Skip to content

Metrics Reference

The platform exports metrics via OpenTelemetry to an OTLP-compatible collector (Jaeger, Prometheus, etc.). Each service component has its own metrics class, and all metrics follow a consistent naming pattern: {domain}.{metric}.{type}.

Architecture

Metrics are collected using the OpenTelemetry SDK and exported every 10 seconds to the configured OTLP endpoint:

        Args:
            settings: Application settings (kept for DI compatibility).
            meter_name: Optional name for the meter. Defaults to class name.
        """
        meter_name = meter_name or self.__class__.__name__
        self._meter = metrics.get_meter(meter_name)
        self._create_instruments()

When ENABLE_TRACING is false or no OTLP endpoint is configured, the system uses a no-op meter provider to avoid unnecessary overhead.

Metric Categories

Execution Metrics

Track script execution performance and resource usage.

Metric Type Labels Description
script.executions.total Counter status, lang_and_version Total executions
script.execution.duration Histogram lang_and_version Execution time (seconds)
script.executions.active UpDownCounter - Currently running executions
script.memory.usage Histogram lang_and_version Memory per execution (MiB)
script.cpu.utilization Histogram - CPU usage (millicores)
script.errors.total Counter error_type Errors by type
execution.queue.depth UpDownCounter - Queued executions
execution.queue.wait_time Histogram lang_and_version Queue wait time (seconds)

Coordinator Metrics

Track scheduling and queue management.

Metric Type Labels Description
coordinator.scheduling.duration Histogram - Scheduling time
coordinator.executions.active UpDownCounter - Active managed executions
coordinator.queue.wait_time Histogram priority Queue wait by priority
coordinator.executions.scheduled.total Counter status Scheduled executions

Rate Limit Metrics

Track rate-limiting behavior.

Metric Type Labels Description
rate_limit.requests.total Counter authenticated, endpoint, algorithm Total checks
rate_limit.allowed.total Counter group, priority, multiplier Allowed requests
rate_limit.rejected.total Counter group, priority, multiplier Rejected requests
rate_limit.bypass.total Counter endpoint Bypassed checks
rate_limit.check.duration Histogram endpoint, authenticated Check duration (ms)
rate_limit.redis.duration Histogram operation Redis operation time (ms)
rate_limit.remaining Histogram - Remaining requests
rate_limit.quota.usage Histogram - Quota usage (%)
rate_limit.token_bucket.tokens Histogram endpoint Current tokens

Event Metrics

Track Kafka event processing.

Metric Type Labels Description
events.produced.total Counter event_type, topic Events published
events.consumed.total Counter event_type, topic Events consumed
events.processing.duration Histogram event_type Processing time
events.errors.total Counter event_type, error_type Processing errors
events.lag UpDownCounter topic, partition Consumer lag

Database Metrics

Track MongoDB operations.

Metric Type Labels Description
database.operations.total Counter operation, collection Total operations
database.operation.duration Histogram operation, collection Operation time
database.errors.total Counter operation, error_type Database errors
database.connections.active UpDownCounter - Active connections

Connection Metrics

Track SSE and WebSocket connections.

Metric Type Labels Description
connections.active UpDownCounter type Active connections
connections.total Counter type Total connections opened
connections.duration Histogram type Connection duration
connections.messages.sent Counter type Messages sent

Health Metrics

Track service health.

Metric Type Labels Description
health.checks.total Counter service, status Health check results
health.check.duration Histogram service Check duration
health.dependencies.status UpDownCounter dependency Dependency status (1=up)

Notification Metrics

Track notification delivery.

Metric Type Labels Description
notifications.sent.total Counter type, channel Notifications sent
notifications.failed.total Counter type, error Failed notifications
notifications.delivery.duration Histogram channel Delivery time

Dead Letter Queue Metrics

Track DLQ operations.

Metric Type Labels Description
dlq.messages.total Counter topic, reason Messages sent to DLQ
dlq.retries.total Counter topic Retry attempts
dlq.size UpDownCounter topic Current DLQ size

Configuration

Metrics are configured via environment variables:

Variable Default Description
ENABLE_TRACING true Enable metrics/tracing
OTEL_EXPORTER_OTLP_ENDPOINT - OTLP collector endpoint
TRACING_SERVICE_NAME integr8scode-backend Service name in traces

Prometheus Queries

Example PromQL queries for common dashboards:

# Execution success rate (last 5 minutes)
sum(rate(script_executions_total{status="completed"}[5m])) /
sum(rate(script_executions_total[5m]))

# P99 execution duration by language
histogram_quantile(0.99, sum(rate(script_execution_duration_bucket[5m])) by (le, lang_and_version))

# Rate limit rejection rate
sum(rate(rate_limit_rejected_total[5m])) /
sum(rate(rate_limit_requests_total[5m]))

# Queue depth trend
avg_over_time(execution_queue_depth[1h])

Key Files

File Purpose
core/metrics/base.py Base metrics class and configuration
core/metrics/execution.py Execution metrics
core/metrics/coordinator.py Coordinator metrics
core/metrics/rate_limit.py Rate limit metrics
core/metrics/ All metrics modules