Deployment¶
Integr8sCode uses Docker Compose for deployment. All services — backend, frontend, workers, and infrastructure —
run as containers orchestrated by a single docker-compose.yaml. Workers reuse the backend image with different
command: overrides, so there's only one application image to build. Kubernetes is used only for executor pods
(running user code); workers just need a kubeconfig to talk to the K8s API.
Architecture¶
flowchart TB
Script[deploy.sh] --> DC
subgraph Images["Container Images"]
Base[Dockerfile.base] --> Backend[Backend Image]
end
subgraph DC["Docker Compose"]
Compose[docker-compose.yaml] --> Containers
end
Images --> Containers
Deployment script¶
The deploy.sh script wraps Docker Compose:
show_help() {
echo "Integr8sCode Deployment Script"
echo ""
echo "Usage: ./deploy.sh <command> [options]"
echo ""
echo "Commands:"
echo " dev [options] Start full stack (docker-compose)"
echo " --build Rebuild images locally"
echo " --no-build Use pre-built images only (no build fallback)"
echo " --wait Wait for services to be healthy"
echo " --timeout <secs> Health check timeout (default: 300)"
echo " --observability Include Grafana, Jaeger, etc."
echo " --debug Include observability + Kafdrop"
echo " infra [options] Start infrastructure only (mongo, redis, kafka, etc.)"
echo " --wait Wait for services to be healthy"
echo " --timeout <secs> Health check timeout (default: 120)"
echo " down Stop all services"
echo " check Run quality checks (ruff, mypy, bandit)"
echo " test Run full test suite"
echo " logs [service] View logs (defaults to all services)"
echo " status Show status of running services"
echo " openapi [path] Generate OpenAPI spec (default: docs/reference/openapi.json)"
echo " types Generate TypeScript types for frontend from OpenAPI spec"
echo " help Show this help message"
echo ""
echo "Configuration:"
echo " All settings come from backend/config.toml (single source of truth)"
echo " For CI/tests: cp backend/config.test.toml backend/config.toml"
echo ""
echo "Examples:"
echo " ./deploy.sh dev # Start dev environment"
echo " ./deploy.sh dev --build # Rebuild and start"
echo " ./deploy.sh dev --wait # Start and wait for healthy"
echo " ./deploy.sh logs backend # View backend logs"
}
The script wraps Docker Compose with convenience commands for building, starting, stopping, and running tests.
Local development¶
Local development uses Docker Compose to spin up the entire stack on your machine. The compose file defines all services with health checks and dependency ordering, so containers start in the correct sequence.
This brings up MongoDB, Redis, Kafka (KRaft mode), all six workers, the backend API, and the
frontend. One initialization container runs automatically: user-seed populates the database with default user accounts.
Kafka topics are created on-demand via auto.create.topics.enable when producers first publish or consumers subscribe.
Once the stack is running, you can access the services at their default ports.
| Service | URL |
|---|---|
| Frontend | https://localhost:5001 |
| Backend API | https://localhost:443 |
| Kafdrop (Kafka UI) | http://localhost:9000 |
| Jaeger (Tracing) | http://localhost:16686 |
| Grafana | http://localhost:3000 |
The default credentials created by the seed job are user / user123 for a regular account and admin / admin123
for an administrator. You can override these via environment variables if needed.
Hot reloading works for the backend since the source directory is mounted into the container. Changes to Python files trigger Uvicorn to restart automatically. The frontend runs its own dev server with similar behavior.
Docker build strategy¶
The backend uses a multi-stage build with a shared base image to keep startup fast:
flowchart LR
subgraph Base["Dockerfile.base"]
B1[python:3.12-slim]
B2[system deps]
B3[uv sync --locked]
end
subgraph Services["Service Images"]
S1[Backend + Workers]
end
Base --> S1
The base image installs all production dependencies:
# Shared base image for all backend services
# Contains: Python, system deps, uv, and all Python dependencies
# Multi-stage build: gcc + dev headers only in builder, not in final image
FROM python:3.12-slim AS builder
WORKDIR /app
# Install build-time dependencies (gcc + dev headers for C extensions)
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
gcc \
libsnappy-dev \
liblzma-dev \
&& rm -rf /var/lib/apt/lists/*
# Install uv (using Docker Hub mirror - ghcr.io has rate limiting issues)
COPY --from=astral/uv:latest /uv /uvx /bin/
# Pre-compile bytecode for faster startup; copy mode avoids symlink issues with cache mounts
ENV UV_COMPILE_BYTECODE=1 UV_LINK_MODE=copy
# Copy dependency files
COPY pyproject.toml uv.lock ./
# Install Python dependencies with BuildKit cache mount for faster rebuilds
RUN --mount=type=cache,target=/root/.cache/uv \
uv sync --locked --no-dev --no-install-project
FROM python:3.12-slim
WORKDIR /app
# Install only runtime dependencies (shared libs, no -dev headers, no gcc)
RUN apt-get update && apt-get upgrade -y \
&& apt-get install -y --no-install-recommends \
curl \
libsnappy1v5 \
liblzma5 \
&& rm -rf /var/lib/apt/lists/*
# Copy uv from builder (avoids second Docker Hub pull, guarantees version consistency)
COPY --from=builder /bin/uv /bin/uvx /bin/
# Copy pre-built virtual environment from builder stage
COPY --from=builder /app/.venv /app/.venv
# Copy dependency files (needed for uv to recognize the project)
COPY pyproject.toml uv.lock ./
# Set paths: PYTHONPATH for imports, PATH for venv binaries (no uv run needed at runtime)
ENV PYTHONPATH=/app
ENV PATH="/app/.venv/bin:$PATH"
ENV KUBECONFIG=/app/kubeconfig.yaml
Each service image extends the base and copies only application code. Since dependencies rarely change, Docker's layer caching means most builds only rebuild the thin application layer.
For local development, the compose file mounts source directories:
volumes:
mongo_data:
redis_data:
grafana_data:
victoria_metrics_data:
shared_ca:
loki_data:
kafka_data:
kafka_logs:
This preserves the container's .venv while allowing live code changes. Gunicorn watches for file changes and reloads
automatically. The design means git clone followed by docker compose up just works—no local Python environment
needed.
To stop everything and clean up volumes:
Running tests locally¶
The test command runs the full unit and E2E test suite:
This builds images, starts services, waits for the backend health endpoint using curl's built-in retry mechanism, runs
pytest with coverage reporting, then tears down the stack. The curl retry approach is cleaner than shell loops and
avoids issues with Docker Compose's --wait flag (which fails on init containers that exit after completion). Key
services define healthchecks in docker-compose.yaml:
| Service | Healthcheck |
|---|---|
| MongoDB | mongosh ping |
| Redis | redis-cli ping |
| Backend | curl /api/v1/health/live |
| Kafka | kafka-broker-api-versions |
Services without explicit healthchecks (workers, Grafana, Kafdrop) are considered "started" when their container is running. The test suite doesn't require worker containers since tests instantiate worker classes directly.
Container resource limits¶
Every long-running service has a mem_limit in docker-compose.yaml to prevent any single container from starving the host. The budget targets a 7.7 GB server with the observability profile enabled, leaving ~2 GB for the OS and page cache.
| Service | mem_limit |
Internal cap | Notes |
|---|---|---|---|
| MongoDB | 1024 m | wiredTiger 0.4 GB | --wiredTigerCacheSizeGB 0.4 prevents default 50 %-of-RAM behavior |
| Redis | 300 m | 256 MB maxmemory | LRU eviction, persistence disabled |
| Kafka | 1280 m | JVM -Xms256m -Xmx1g |
Single-broker KRaft mode, low throughput workload |
| Backend API | 768 m | 2 gunicorn workers | Controlled by WEB_CONCURRENCY env var |
| Frontend | 128 m | nginx serving static assets | |
| Each worker (x6) | 160 m | Single-process Python | k8s-worker, pod-monitor, result-processor, saga-orchestrator, event-replay, dlq-processor |
| Grafana | 192 m | Observability profile | |
| Jaeger | 256 m | All-in-one, in-memory storage | Observability profile |
| Victoria Metrics | 256 m | 30-day retention | Observability profile |
| OTel Collector | 192 m | limit_mib: 150 in memory_limiter processor; includes kafkametrics receiver |
Observability profile |
All long-running services — core infrastructure (MongoDB, Redis, Kafka, backend, frontend), all six workers (k8s-worker, pod-monitor, result-processor, saga-orchestrator, event-replay, dlq-processor), and observability components (Grafana, victoria-metrics, otel-collector) — have restart: unless-stopped so they recover automatically after an OOM kill or crash.
Monitoring¶
Check service status using the deploy script or Docker Compose directly.
Troubleshooting¶
| Issue | Cause | Solution |
|---|---|---|
| Unknown topic errors | Kafka not ready or wrong prefix | Check docker compose logs kafka |
| MongoDB auth errors | Password mismatch | Verify MONGO_USER/MONGO_PASSWORD env vars match MongoDB init |
| Worker crash loop | Config file missing | Ensure config.<worker>.toml exists |
Kafka topic debugging¶
docker compose logs kafka
docker compose exec kafka kafka-topics --list --bootstrap-server localhost:29092
Topics are auto-created on first use. Each topic name = event type value with prefix (e.g., dev_execution_requested).
k3s crash loop after VPN or IP change¶
Symptoms:
systemctl status k3sshowsActive: activating (auto-restart) (Result: exit-code)- k3s repeatedly crashes with
status=1/FAILURE kubectlcommands fail withconnection refusedorServiceUnavailable- API intermittently responds then stops
Root cause:
When the host IP changes (VPN on/off, network switch, DHCP renewal), k3s stores stale IP references in two locations:
- SQLite database (
/var/lib/rancher/k3s/server/db/) — contains cluster state with old IP - TLS certificates (
/var/lib/rancher/k3s/server/tls/) — generated with old IP in SAN field
k3s detects the mismatch between config (node-ip in /etc/rancher/k3s/config.yaml) and stored data, causing the crash loop.
Solution:
WARNING: DATA LOSS — The steps below will permanently delete all cluster state, including: - All deployed workloads (pods, deployments, services) - All cluster configuration (namespaces, RBAC, ConfigMaps, Secrets) - All PersistentVolume data stored in the default local-path provisioner
Before proceeding, back up: - etcd snapshots:
sudo k3s etcd-snapshot save- kubeconfig files - Application manifests - Any critical PersistentVolume dataConfirm backups are complete before continuing.
# 1. Stop k3s
sudo systemctl stop k3s
# 2. Delete corrupted database (k3s will rebuild it)
sudo rm -rf /var/lib/rancher/k3s/server/db/
# 3. Delete old TLS certificates (k3s will regenerate them)
sudo rm -rf /var/lib/rancher/k3s/server/tls/
# 4. Start k3s with clean state
sudo systemctl start k3s
After k3s restarts, regenerate the application kubeconfig:
# Regenerate kubeconfig with fresh ServiceAccount token
docker compose restart cert-generator
# Restart workers to pick up new kubeconfig
docker compose restart k8s-worker pod-monitor
Verification:
# Check k3s is running
systemctl status k3s # Should show "active (running)"
# Test API access
KUBECONFIG=/path/to/backend/kubeconfig.yaml kubectl get namespaces
# Check workers connected
docker logs k8s-worker 2>&1 | tail -5
docker logs pod-monitor 2>&1 | tail -5
VPN-specific notes:
When using VPN (e.g., NordVPN with WireGuard/NordLynx):
- LAN Discovery must be enabled:
nordvpn set lan-discovery enabled - VPN can interfere with Docker's
hostnetwork mode and k3s flannel networking - Consider using bridge networking for containers that need to reach k3s
References:
Pre-built images¶
The CI pipeline automatically builds and pushes images to GitHub Container Registry on every merge to main. To use
pre-built images instead of building locally, set IMAGE_TAG:
Available tags¶
| Tag | Description |
|---|---|
latest |
Most recent build from main branch |
sha-abc1234 |
Specific commit SHA |
2026.2.0 |
CalVer release version |
Production deployment¶
Merges to main trigger automatic deployment to the production server via the
Release & Deploy workflow. The full pipeline chain is:
- Stack Tests — unit tests, image build, E2E tests
- Docker Scan & Promote — Trivy vulnerability scan, promote
sha-xxxtolatest - Release & Deploy — create CalVer tag + GitHub Release, SSH deploy to production
The deploy step pulls the latest images on the server and recreates containers with zero-downtime health checks. No manual intervention is required for normal merges.
Rollback¶
To roll back to a previous release, use a specific CalVer or SHA tag:
# On the production server
IMAGE_TAG=2026.2.0 docker compose pull
IMAGE_TAG=2026.2.0 docker compose up -d --remove-orphans
Or trigger the Release & Deploy workflow manually with skip_deploy enabled to create a release without deploying,
then deploy a specific version via SSH.
First-time setup¶
To configure the production server and GitHub Secrets, follow the Required secrets section
in the CI/CD docs. You will need to generate an SSH key pair, create a GitHub PAT with read:packages scope, and add
all four secrets (DEPLOY_HOST, DEPLOY_USER, DEPLOY_SSH_KEY, DEPLOY_GHCR_TOKEN) to the repository settings.
Key files¶
| File | Purpose |
|---|---|
deploy.sh |
Deployment script |
docker-compose.yaml |
Full stack definition |
backend/Dockerfile.base |
Shared base image with deps |
.github/workflows/docker.yml |
CI/CD image build pipeline |
.github/workflows/release-deploy.yml |
Release + deploy pipeline |