Metrics Reference¶

The platform exports metrics via OpenTelemetry to an OTLP-compatible collector (Jaeger, Prometheus, etc.). Each service component has its own metrics class, and all metrics follow a consistent naming pattern: {domain}.{metric}.{type}.

Architecture¶

Metrics are collected using the OpenTelemetry SDK and exported every 10 seconds to the configured OTLP endpoint:

        Args:
            settings: Application settings (kept for DI compatibility).
            meter_name: Optional name for the meter. Defaults to class name.
        """
        meter_name = meter_name or self.__class__.__name__
        self._meter = metrics.get_meter(meter_name)
        self._create_instruments()

When ENABLE_TRACING is false or no OTLP endpoint is configured, the system uses a no-op meter provider to avoid unnecessary overhead.

Metric Categories¶

Execution Metrics¶

Track script execution performance and resource usage.

Metric	Type	Labels	Description
`script.executions.total`	Counter	status, lang_and_version	Total executions
`script.execution.duration`	Histogram	lang_and_version	Execution time (seconds)
`script.executions.active`	UpDownCounter	-	Currently running executions
`script.memory.usage`	Histogram	lang_and_version	Memory per execution (MiB)
`script.cpu.utilization`	Histogram	-	CPU usage (millicores)
`script.errors.total`	Counter	error_type	Errors by type
`execution.queue.depth`	UpDownCounter	-	Queued executions
`execution.queue.wait_time`	Histogram	lang_and_version	Queue wait time (seconds)

Coordinator Metrics¶

Track scheduling and queue management.

Metric	Type	Labels	Description
`coordinator.scheduling.duration`	Histogram	-	Scheduling time
`coordinator.executions.active`	UpDownCounter	-	Active managed executions
`coordinator.queue.wait_time`	Histogram	priority	Queue wait by priority
`coordinator.executions.scheduled.total`	Counter	status	Scheduled executions

Rate Limit Metrics¶

Track rate-limiting behavior.

Metric	Type	Labels	Description
`rate_limit.requests.total`	Counter	authenticated, endpoint, algorithm	Total checks
`rate_limit.allowed.total`	Counter	group, priority, multiplier	Allowed requests
`rate_limit.rejected.total`	Counter	group, priority, multiplier	Rejected requests
`rate_limit.bypass.total`	Counter	endpoint	Bypassed checks
`rate_limit.check.duration`	Histogram	endpoint, authenticated	Check duration (ms)
`rate_limit.redis.duration`	Histogram	operation	Redis operation time (ms)
`rate_limit.remaining`	Histogram	-	Remaining requests
`rate_limit.quota.usage`	Histogram	-	Quota usage (%)
`rate_limit.token_bucket.tokens`	Histogram	endpoint	Current tokens

Event Metrics¶

Track Kafka event processing.

Metric	Type	Labels	Description
`events.produced.total`	Counter	event_type, topic	Events published
`events.consumed.total`	Counter	event_type, topic	Events consumed
`events.processing.duration`	Histogram	event_type	Processing time
`events.errors.total`	Counter	event_type, error_type	Processing errors
`events.lag`	UpDownCounter	topic, partition	Consumer lag

Database Metrics¶

Track MongoDB operations.

Metric	Type	Labels	Description
`database.operations.total`	Counter	operation, collection	Total operations
`database.operation.duration`	Histogram	operation, collection	Operation time
`database.errors.total`	Counter	operation, error_type	Database errors
`database.connections.active`	UpDownCounter	-	Active connections

Connection Metrics¶

Track SSE and WebSocket connections.

Metric	Type	Labels	Description
`connections.active`	UpDownCounter	type	Active connections
`connections.total`	Counter	type	Total connections opened
`connections.duration`	Histogram	type	Connection duration
`connections.messages.sent`	Counter	type	Messages sent

Health Metrics¶

Track service health.

Metric	Type	Labels	Description
`health.checks.total`	Counter	service, status	Health check results
`health.check.duration`	Histogram	service	Check duration
`health.dependencies.status`	UpDownCounter	dependency	Dependency status (1=up)

Notification Metrics¶

Track notification delivery.

Metric	Type	Labels	Description
`notifications.sent.total`	Counter	type, channel	Notifications sent
`notifications.failed.total`	Counter	type, error	Failed notifications
`notifications.delivery.duration`	Histogram	channel	Delivery time

Dead Letter Queue Metrics¶

Track DLQ operations.

Metric	Type	Labels	Description
`dlq.messages.total`	Counter	topic, reason	Messages sent to DLQ
`dlq.retries.total`	Counter	topic	Retry attempts
`dlq.size`	UpDownCounter	topic	Current DLQ size

Configuration¶

Metrics are configured via environment variables:

Variable	Default	Description
`ENABLE_TRACING`	`true`	Enable metrics/tracing
`OTEL_EXPORTER_OTLP_ENDPOINT`	-	OTLP collector endpoint
`TRACING_SERVICE_NAME`	`integr8scode-backend`	Service name in traces

Prometheus Queries¶

Example PromQL queries for common dashboards:

# Execution success rate (last 5 minutes)
sum(rate(script_executions_total{status="completed"}[5m])) /
sum(rate(script_executions_total[5m]))

# P99 execution duration by language
histogram_quantile(0.99, sum(rate(script_execution_duration_bucket[5m])) by (le, lang_and_version))

# Rate limit rejection rate
sum(rate(rate_limit_rejected_total[5m])) /
sum(rate(rate_limit_requests_total[5m]))

# Queue depth trend
avg_over_time(execution_queue_depth[1h])

Key Files¶

File	Purpose
`core/metrics/base.py`	Base metrics class and configuration
`core/metrics/execution.py`	Execution metrics
`core/metrics/coordinator.py`	Coordinator metrics
`core/metrics/rate_limit.py`	Rate limit metrics
`core/metrics/`	All metrics modules