PRD-004-2 — Codebase Vector Index: Full Implementation Spec¶
| Field | Value |
|---|---|
| Document ID | PRD-004-2 |
| Version | 1.0 |
| Status | DRAFT |
| Date | March 2026 |
| Parent Doc | PRD-004 |
| Related Docs | PRD-004-1 (Chain specs, retriever usage) |
1. Purpose & Scope¶
This document provides the full implementation spec for the codebase vector index used by the Codebase Search Agent. PRD-004 §Codebase Vector Index leaves four gaps that make it impossible to implement:
codebase_retriever— package, collection naming, persist directory, and how to select the correct collection per repository are all unspecified.- Index build process — the ARQ job, clone mechanism, chunking API, and file filter are not shown.
- Incremental re-index — "webhook or manual trigger" is hand-waved; the GitHub push event, git-diff mechanism, and partial Chroma update are completely unspecified.
- Correct
from_language()API — PRD-004 mentions language-aware splitting in prose but never uses the actual API.
Boundary with PRD-004: PRD-004 covers the strategic decision to use Chroma, the retriever
parameters (k=8, score_threshold=0.3), and the index lifecycle table. This document covers
only the concrete implementation details. Do not duplicate the overview here.
Boundary with PRD-004-1: PRD-004-1 imports get_codebase_retriever from this module and uses
it inside the Codebase Search Agent chain. The retriever interface is specified here; its usage in
the chain is specified there.
2. Package & Import¶
The correct package is langchain-chroma (a standalone package, not langchain-community):
# Package: langchain-chroma (pip install langchain-chroma)
from langchain_chroma import Chroma
# Text splitting with language-aware boundaries
from langchain_text_splitters import RecursiveCharacterTextSplitter, Language
# Embeddings
from langchain_openai import OpenAIEmbeddings
# Retriever type hint
from langchain_core.vectorstores import VectorStoreRetriever
Do not use langchain_community.vectorstores.Chroma — that import path is deprecated and will
be removed in a future LangChain release.
3. Collection Naming Convention¶
One Chroma collection is created per repository. The collection name must be:
- Deterministic (same repo URL always maps to the same name)
- URL-safe (no slashes, colons, or special characters)
- Collision-resistant (two repos with similar names must not collide)
Convention: repo_{sha256(repo_url)[:16]}
import hashlib
def collection_name(repo_url: str) -> str:
"""
Return a deterministic, URL-safe Chroma collection name for a repository.
Normalises the URL before hashing: strips trailing slashes and the .git
suffix so that https://github.com/owner/repo, .../repo/, and .../repo.git
all resolve to the same collection.
Examples:
"https://github.com/owner/repo" → "repo_4a2f1b9c8d3e7f02"
"https://github.com/owner/repo/" → "repo_4a2f1b9c8d3e7f02"
"https://github.com/owner/repo.git" → "repo_4a2f1b9c8d3e7f02"
"https://github.com/owner/other-repo" → "repo_9c1d3e5f7a2b4c8d"
"""
normalised = repo_url.rstrip("/")
if normalised.endswith(".git"):
normalised = normalised[:-4]
digest = hashlib.sha256(normalised.encode()).hexdigest()
return f"repo_{digest[:16]}"
Note: collection_name normalises the URL before hashing — trailing slashes and .git
suffixes are stripped — so https://github.com/owner/repo, .../repo/, and .../repo.git
all map to the same collection. Callers must not pre-normalise; pass the raw URL.
4. Persist Directory & Docker Volume¶
Persist directory inside the container: /data/chroma
Docker volume: chroma_data mounted at /data/chroma. The volume persists across container
restarts and image upgrades.
All collections share one directory. Chroma manages internal subdirectories per collection — no manual directory partitioning is needed.
docker-compose Declaration¶
services:
agentops-codebase-search:
image: agentops-codebase-search:latest
volumes:
- chroma_data:/data/chroma
environment:
- CHROMA_PERSIST_DIR=/data/chroma
- OPENAI_API_KEY=${OPENAI_API_KEY}
agentops-arq-worker:
image: agentops-arq-worker:latest
volumes:
- chroma_data:/data/chroma # worker writes; search service reads
environment:
- CHROMA_PERSIST_DIR=/data/chroma
- OPENAI_API_KEY=${OPENAI_API_KEY}
- GITHUB_TOKEN=${GITHUB_TOKEN}
volumes:
chroma_data:
Important: Both the agentops-codebase-search service and the agentops-arq-worker service
mount the same chroma_data volume. The search service only reads; the ARQ worker only writes
(protected by a per-repo Redis lock — see Section 8).
5. codebase_retriever Instantiation¶
Function Signature¶
import chromadb
def get_codebase_retriever(repository: str) -> VectorStoreRetriever:
"""
Return a LangChain VectorStoreRetriever for the Chroma collection
associated with the given repository URL.
Args:
repository: The repository URL as stored in BugTriageState.repository.
e.g. "https://github.com/owner/repo"
Returns:
A VectorStoreRetriever configured with similarity search,
k=8 results, and a minimum score threshold of 0.3.
Raises:
chromadb.errors.InvalidCollectionException: If the collection does not exist
(repository has not been indexed yet). Propagates to the worker error handler.
"""
coll_name = collection_name(repository)
# Chroma(persist_directory=...) silently creates missing collections, so we
# must validate existence via the raw client first. get_collection() raises
# InvalidCollectionException if the name is not found — let it propagate.
chromadb.PersistentClient(path="/data/chroma").get_collection(coll_name)
vectorstore = Chroma(
collection_name=coll_name,
embedding_function=OpenAIEmbeddings(model="text-embedding-3-small"),
persist_directory="/data/chroma",
)
return vectorstore.as_retriever(
search_type="similarity_score_threshold",
search_kwargs={"k": 8, "score_threshold": 0.3},
)
Call Site¶
get_codebase_retriever is called inside codebase_search_node (in the LangGraph orchestration
layer) before building the chain, using state["repository"]:
async def codebase_search_node(state: BugTriageState) -> dict:
retriever = get_codebase_retriever(state["repository"])
chain = build_codebase_chain(retriever) # see PRD-004-1 §6
# ... invoke chain and translate finding
6. Index Build Process¶
Trigger¶
The full index build is performed by an ARQ background job: build_codebase_index(repo_url, force=False).
Enqueue conditions (checked by POST /jobs endpoint when a new job is submitted):
- The repository has never been indexed (
repo_index_metadatarow does not exist). - The existing index is older than 24 hours (
indexed_at < now() - interval '24 hours'). force=Trueis passed explicitly (manual rebuild via admin API).
If a build_codebase_index job for the same repository is already queued or running, the enqueue
is skipped (ARQ deduplication via job ID = build:{collection_name}).
Database Table¶
CREATE TABLE repo_index_metadata (
repo_url TEXT PRIMARY KEY,
indexed_at TIMESTAMPTZ NOT NULL,
head_sha TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'pending' -- pending | building | ready | failed
);
ARQ Job Implementation¶
import os
import shutil
import subprocess
import tempfile
import logging
from pathlib import Path
import chardet
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter, Language
logger = logging.getLogger(__name__)
SUPPORTED_EXTENSIONS = {".py", ".js", ".ts", ".go", ".java"}
EXTENSION_TO_LANGUAGE = {
".py": Language.PYTHON,
".js": Language.JS,
".ts": Language.TS,
".go": Language.GO,
".java": Language.JAVA,
}
EXCLUDE_DIRS = {"node_modules", "dist", "__pycache__", ".git"}
EXCLUDE_FILENAME_PATTERNS = {"*.min.js", "*.min.css"}
MAX_REPO_SIZE_MB = 500
async def build_codebase_index(ctx, repo_url: str, force: bool = False):
"""
ARQ job: clone repo, chunk source files, embed, store in Chroma.
"""
col_name = collection_name(repo_url)
redis = ctx["redis"]
# Per-repo lock to prevent concurrent index builds
lock_key = f"index_lock:{col_name}"
async with redis.lock(lock_key, timeout=3600):
await _do_build(repo_url, col_name)
async def _do_build(repo_url: str, col_name: str):
with tempfile.TemporaryDirectory() as tmpdir:
# 1. Size check before clone
_check_repo_size(repo_url)
# 2. Clone (shallow)
github_token = os.environ["GITHUB_TOKEN"]
auth_url = repo_url.replace("https://", f"https://{github_token}@")
subprocess.run(
["git", "clone", "--depth=1", auth_url, tmpdir],
check=True, capture_output=True,
)
# 3. Get HEAD SHA
result = subprocess.run(
["git", "rev-parse", "HEAD"],
cwd=tmpdir, check=True, capture_output=True, text=True,
)
head_sha = result.stdout.strip()
# 4. Collect and chunk source files
docs = _chunk_repository(Path(tmpdir))
# 5. Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
Chroma.from_documents(
documents=docs,
embedding=embeddings,
collection_name=col_name,
persist_directory="/data/chroma",
)
# 6. Update metadata
await _update_metadata(repo_url, head_sha, status="ready")
logger.info("index built: repo=%s collection=%s docs=%d", repo_url, col_name, len(docs))
File Filter¶
def _should_index(path: Path) -> bool:
"""Return True if the file should be indexed."""
import fnmatch
# Exclude excluded directories and hidden directories/files
for part in path.parts:
if part in EXCLUDE_DIRS or part.startswith("."):
return False
# Exclude minified files and other glob-matched filename patterns
if any(fnmatch.fnmatch(path.name, pat) for pat in EXCLUDE_FILENAME_PATTERNS):
return False
# Extension allowlist
if path.suffix not in SUPPORTED_EXTENSIONS:
return False
# Skip binary files (heuristic: chardet confidence < 0.9 for text)
try:
raw = path.read_bytes()[:4096]
result = chardet.detect(raw)
if result["confidence"] < 0.9 or result["encoding"] is None:
return False
except OSError:
return False
return True
Language-Aware Chunking¶
def _chunk_repository(repo_root: Path) -> list:
"""Chunk all indexable source files in the repository."""
from langchain_core.documents import Document
docs = []
for path in repo_root.rglob("*"):
if not path.is_file() or not _should_index(path):
continue
language = EXTENSION_TO_LANGUAGE[path.suffix]
splitter = RecursiveCharacterTextSplitter.from_language(
language=language, # splits on language-specific syntax boundaries
chunk_size=1000,
chunk_overlap=200,
)
content = path.read_text(encoding="utf-8", errors="ignore")
rel_path = str(path.relative_to(repo_root))
chunks = splitter.create_documents(
texts=[content],
metadatas=[{"source": rel_path}],
)
docs.extend(chunks)
return docs
Supported Languages¶
| Extension | Language enum value |
Notes |
|---|---|---|
.py |
Language.PYTHON |
Splits on class/def boundaries |
.js |
Language.JS |
Splits on function/arrow functions |
.ts |
Language.TS |
Same as JS + type annotations |
.go |
Language.GO |
Splits on func declarations |
.java |
Language.JAVA |
Splits on class/method boundaries |
7. Incremental Re-index¶
Trigger: GitHub Push Webhook¶
Webhook endpoint: POST /webhooks/github on the main API server.
GitHub event type: push (configured in the repository's GitHub webhook settings with
content type application/json). Only push events to the default branch trigger a re-index.
Webhook secret: The X-Hub-Signature-256 header must be validated using
GITHUB_WEBHOOK_SECRET from the environment before processing any payload.
Webhook Payload Fields Used¶
{
"ref": "refs/heads/main",
"before": "abc123", // base SHA (previous HEAD)
"after": "def456", // new HEAD SHA
"repository": {
"clone_url": "https://github.com/owner/repo.git",
"default_branch": "main"
}
}
Default branch check: Only process if ref == f"refs/heads/{repository.default_branch}".
Push events to feature branches are ignored.
Incremental Update Flow¶
GitHub push → POST /webhooks/github
↓
Validate X-Hub-Signature-256
↓
Check: ref == default branch? No → ignore
↓ Yes
Enqueue ARQ job: update_codebase_index(repo_url, base_sha=before, head_sha=after)
↓
ARQ worker acquires per-repo Redis lock
↓
git diff --name-only {base_sha}..{head_sha} → list of changed file paths
↓
For each changed file path:
Delete Chroma documents where metadata["source"] == changed_file_path
Re-chunk and re-embed the changed file
Add new documents to Chroma collection
↓
Update repo_index_metadata: head_sha = after, indexed_at = now()
ARQ Job Implementation¶
async def update_codebase_index(ctx, repo_url: str, base_sha: str, head_sha: str):
"""
ARQ job: incrementally re-index only the files changed between base_sha and head_sha.
Falls back to full re-index if base_sha is not available in the shallow clone.
"""
col_name = collection_name(repo_url)
redis = ctx["redis"]
async with redis.lock(f"index_lock:{col_name}", timeout=3600):
with tempfile.TemporaryDirectory() as tmpdir:
# Clone the default branch (shallow)
github_token = os.environ["GITHUB_TOKEN"]
auth_url = repo_url.replace("https://", f"https://{github_token}@")
subprocess.run(
["git", "clone", "--depth=50", auth_url, tmpdir],
check=True, capture_output=True,
)
# Get changed files via git diff
diff_result = subprocess.run(
["git", "diff", "--name-only", f"{base_sha}..{head_sha}"],
cwd=tmpdir, capture_output=True, text=True,
)
if diff_result.returncode != 0:
# base_sha not found (force push or too shallow) — fall back to full re-index
logger.warning(
"base_sha %s not found for %s; falling back to full re-index",
base_sha, repo_url,
)
await _do_build(repo_url, col_name)
return
changed_files = [
f for f in diff_result.stdout.strip().splitlines() if f
]
if not changed_files:
logger.info("no changed files for %s; skipping re-index", repo_url)
return
repo_root = Path(tmpdir)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(
collection_name=col_name,
embedding_function=embeddings,
persist_directory="/data/chroma",
)
for rel_path in changed_files:
abs_path = repo_root / rel_path
# Delete existing documents for this file
existing = vectorstore.get(where={"source": rel_path})
if existing["ids"]:
vectorstore.delete(ids=existing["ids"])
# Re-index if file still exists and is indexable
if abs_path.exists() and _should_index(abs_path):
language = EXTENSION_TO_LANGUAGE.get(abs_path.suffix)
if language is None:
continue
splitter = RecursiveCharacterTextSplitter.from_language(
language=language,
chunk_size=1000,
chunk_overlap=200,
)
content = abs_path.read_text(encoding="utf-8", errors="ignore")
chunks = splitter.create_documents(
texts=[content],
metadatas=[{"source": rel_path}],
)
vectorstore.add_documents(chunks)
logger.info(
"incremental re-index complete: repo=%s files_updated=%d",
repo_url, len(changed_files),
)
await _update_metadata(repo_url, head_sha, status="ready")
Fallback to Full Re-index¶
| Condition | Action |
|---|---|
base_sha not in shallow clone (force push, rebase) |
Full build_codebase_index |
| Diff returns non-zero exit code | Full build_codebase_index |
| Changed file list is empty | Skip (no-op) |
| File deleted in push | Documents deleted from Chroma; not re-added |
8. Known Limitations¶
Repository Size Limit¶
Maximum repository size: 500 MB (checked via du before clone).
def _check_repo_size(repo_url: str):
"""Raises ValueError if the remote repository is too large to index."""
# GitHub API: GET /repos/{owner}/{repo} returns "size" in KB
import httpx
path = repo_url.replace("https://github.com/", "") # "owner/repo"
resp = httpx.get(
f"https://api.github.com/repos/{path}",
headers={"Authorization": f"Bearer {os.environ['GITHUB_TOKEN']}"},
)
resp.raise_for_status()
size_mb = resp.json()["size"] / 1024
if size_mb > MAX_REPO_SIZE_MB:
raise ValueError(
f"Repository {repo_url} is {size_mb:.0f} MB, "
f"exceeding the {MAX_REPO_SIZE_MB} MB limit."
)
Private Repositories¶
Private repositories require GITHUB_TOKEN with repo scope. The token is injected via the
GITHUB_TOKEN environment variable in the agentops-arq-worker container. Public repositories
work without authentication but the token is still sent (GitHub ignores it for public repos).
Concurrent Write Protection¶
The Chroma collection is not thread-safe for concurrent writes. The ARQ worker acquires a per-repo Redis lock before any index build or update:
lock_key = f"index_lock:{collection_name(repo_url)}"
async with redis.lock(lock_key, timeout=3600):
...
- Lock timeout: 3600 seconds (1 hour). If a build takes longer, the lock expires and a new build can start — acceptable because Chroma's worst case is a partially-updated collection, not corruption.
- The
agentops-codebase-searchservice only reads from Chroma. Read/write isolation at the Chroma level is not needed, only write/write isolation.
Index Staleness¶
An index older than 24 hours triggers a background rebuild before the next job (see Section 6, enqueue conditions). Between the job submission and the rebuild completing, the search agent uses the stale index. This is acceptable — stale code embeddings are better than no embeddings.
If the index is in status = 'building' when a search is requested, get_codebase_retriever
still returns the existing (stale) collection rather than failing. The new build will complete
in the background and future searches will use the updated index.