Contract testing¶

Contract tests sit between unit tests and integration tests. They verify that two parts of the system agree on a shared interface without actually running those parts together. In this project the main contract boundary is between Python OTel metric definitions and Grafana dashboard JSON files — both reference the same Prometheus metric names, but neither knows about the other at runtime.

Grafana metrics contract¶

The test file lives at backend/tests/contract/test_grafana_metrics.py. It uses a real OTel-to-Prometheus export pipeline so that metric name conversion (dots to underscores, unit suffixes, _total / _bucket / etc.) is handled by the SDK, not hand-rolled.

The setup works like this: a PrometheusMetricReader is attached to a MeterProvider, then every BaseMetrics subclass, the MetricsMiddleware, and the system metrics are instantiated so that their instruments get registered in the SDK. After that, the fixture walks MeterProvider._meters and triggers every synchronous instrument through duck-typed getattr — it tries add, record, set in order, calls the first one that exists, and moves on. Observable instruments (like the system CPU/memory gauges) fire their callbacks automatically during collection, so they need no explicit trigger. The result is a dict[str, set[str]] mapping each Prometheus family name to the set of sample names it produces.

There are two tests that share this fixture:

test_dashboard_metrics_defined_in_code is the forward check. It parses every *.json dashboard in backend/grafana/provisioning/dashboards/, extracts expr fields, tokenizes the PromQL, filters out known builtins (rate, sum, by, etc.), and checks that every remaining metric name exists in the Prometheus sample set. If a dashboard references foo_bar_total but no Python code defines that metric, the test fails and lists the offending dashboard and metric names.

test_code_metrics_used_in_dashboards is the reverse check. It flattens every dashboard metric into one set, then iterates over the Prometheus families from the fixture. For each family it checks whether any of its sample names appear in that dashboard set. Because a single histogram like execution_duration produces _bucket, _count, _sum, and _created samples, the dashboard only needs to reference one of them for the family to pass. The target family is skipped since it's auto-generated by the OTel SDK and not something you'd panel in Grafana. If a metric family has no dashboard coverage at all, the test fails with a list of unused families and their samples.

Why duck-typed instrument triggering¶

The earlier version of this test had an isinstance chain — check for Counter, call add; check for Histogram, call record; check for UpDownCounter, call add. This breaks silently when a new instrument type shows up (the project already uses ObservableGauge, which the old code didn't handle). The current approach iterates the SDK's internal meter registry and calls whichever method the instrument exposes. If OTel adds a new synchronous instrument type tomorrow that has an add or record method, the test picks it up with zero changes.

Running the tests¶

cd backend
uv run pytest tests/contract/test_grafana_metrics.py -v -o "addopts="

The -o "addopts=" override is needed because pyproject.toml sets -n auto --dist=loadfile for xdist, which interferes with the module-scoped fixture (the OTel MeterProvider can only be set once per process). Running without xdist is fine since these tests finish in under a second.

Both tests use the @pytest.mark.grafana_contract marker, so you can also run them via:

uv run pytest -m grafana_contract -o "addopts="

Adding a new metric¶

When you add a metric to a BaseMetrics subclass, the forward test will keep passing (dashboards don't reference it yet). But the reverse test will fail, telling you exactly which family has no dashboard coverage. At that point either add a panel to an existing dashboard or create a new one. Conversely, if you add a PromQL expression to a dashboard that references a metric that doesn't exist in code, the forward test catches it.

The goal is to keep the two sides in sync so you don't end up with dead panels pointing at metrics that were renamed three months ago, or metrics that nobody ever looks at because they were never wired into a dashboard.