PRD-010 — Evaluation Framework¶
| Field | Value |
|---|---|
| Document ID | PRD-010 |
| Version | 1.0 |
| Status | DRAFT |
| Date | March 2026 |
| Parent Doc | PRD-001 |
| Related Docs | PRD-004 (Agent Layer), PRD-005 (LangSmith Observability) |
Overview¶
This document specifies the evaluation framework for AgentOps Dashboard: how agent output quality is measured, what data is used as ground truth, how scores are computed, and how evaluations are gated in CI.
The framework is built on LangSmith's evaluation primitives (datasets, evaluators, experiments) and runs automatically on every prompt change and deployment.
Evaluation Dimensions¶
What Gets Evaluated¶
The eval framework measures three things:
| Dimension | Question | Evaluator Type |
|---|---|---|
| Triage Accuracy | Does the root cause match what a human engineer would identify? | LLM-as-judge + human comparison |
| Report Usefulness | Is the final report helpful and actionable? | LLM-as-judge (rubric) |
| Question Quality | When the supervisor asks the user a question, is it a good question? | Human feedback (thumbs up/down in UI) |
| Agent Efficiency | Did the supervisor route optimally (no redundant agent calls)? | Automated: count supervisor hops vs. minimum path |
LLM-as-Judge Setup¶
from typing import Literal
from pydantic_settings import BaseSettings
from langchain_anthropic import ChatAnthropic
from langsmith.evaluation import evaluate, LangChainStringEvaluator
class EvalSettings(BaseSettings):
langchain_project: Literal["agentops-staging"] # fails fast if pointed at production
langsmith_api_key: str # required to submit eval results to LangSmith
langserve_base_url: str # must be a staging deployment URL
openai_api_key: str # separate eval project key for billing isolation
def main() -> None:
"""Run the LangSmith evaluation suite against the golden dataset."""
settings = EvalSettings() # raises ValidationError if env vars are missing or wrong
helpfulness_evaluator = LangChainStringEvaluator(
"criteria",
config={
"criteria": {
"helpfulness": "Is this triage report specific, actionable, and correct?",
"completeness": "Does the report cover severity, root cause, relevant files, and a fix suggestion?",
"accuracy": "Does the root cause match the reference answer?"
},
"llm": ChatAnthropic(model="claude-sonnet-4-6", temperature=0)
}
)
evaluate(
lambda inputs: run_triage_job(inputs["issue_url"]),
data="agentops-golden-dataset-v1",
evaluators=[helpfulness_evaluator],
experiment_prefix="prompt-change-2026-03",
)
if __name__ == "__main__":
main()
The judge uses a different model family (Anthropic) than the production agents (OpenAI GPT-4o) to avoid self-preference bias, which research shows inflates scores by 10–25% when a model evaluates its own outputs.
Scoring Rubric¶
| Score | Meaning |
|---|---|
| 5 | Perfect: root cause is correct, files are exact, report is clear and actionable |
| 4 | Good: root cause is correct, minor gaps in files or report formatting |
| 3 | Partial: hypothesis is on the right track but root cause is incomplete or imprecise |
| 2 | Poor: wrong code area identified, or report is too vague to be actionable |
| 1 | Fail: completely wrong diagnosis or empty output |
Target: Average score ≥ 4.0 / 5.0 on the golden dataset before any prompt change is deployed to production.
Golden Dataset¶
Structure¶
The golden dataset is a collection of real GitHub issues with human-authored reference answers:
{
"issue_url": "https://github.com/org/repo/issues/1042",
"issue_title": "Auth token expiry causes 500 on /api/me",
"issue_body": "...",
"reference": {
"severity": "HIGH",
"root_cause": "JWT expiry check in auth/middleware.py:L142 uses local time instead of UTC",
"relevant_files": ["auth/middleware.py", "tests/test_auth.py"],
"expected_keywords": ["JWT", "UTC", "timezone", "token expiry"]
}
}
Dataset Growth Plan¶
| Phase | Dataset Size | Source |
|---|---|---|
| v1.0 launch | 20 issues | Manually authored from real repos |
| v1.1 | 50 issues | User feedback thumbs up/down on job outputs |
| v2.0 | 200+ issues | Crowdsourced from community contributors |
Dataset Management¶
The golden dataset is managed in LangSmith's Datasets UI. New examples can be added directly from a LangSmith trace: if a live production job produces a high-quality output, it can be added to the dataset in one click via LangSmith's "Add to Dataset" feature.
Automated Eval Pipeline¶
When Evals Run¶
| Trigger | Action |
|---|---|
Any agent prompt change proposed in a PR to main |
CI pipeline runs eval against golden dataset; fails PR if avg score drops > 0.3 |
| New LangServe agent version deployed to staging | Eval runs automatically; results posted to PR as a comment |
| Daily at 02:00 UTC | Production eval: random sample of 10 recent jobs scored and logged |
| Manual trigger | Developer can run evals on demand from LangSmith UI or CLI |
CI Integration¶
# .github/workflows/eval.yml
- name: Run LangSmith Evals
run: |
python scripts/run_evals.py \
--dataset agentops-golden-dataset-v1 \
--project agentops-staging \
--min-score 4.0 \
--fail-on-regression
Eval Environment Requirements¶
CI-triggered evals (PR gate and staging-deploy trigger) call run_triage_job against the
golden dataset. They must run in a fully isolated environment:
| Resource | Production value | Eval (CI) value | Why |
|---|---|---|---|
| LangSmith project | agentops-prod |
agentops-staging |
Prevent eval traces from polluting production dashboards |
| LangServe URL | https://agents.prod/… |
https://agents.staging/… |
Prevent eval traffic from consuming production rate-limit budget |
| OpenAI API key | Production org/project | Separate org sub-account or project | Billing isolation; eval cost tracked separately from user traffic |
Staging deployment requirement: A dedicated staging deployment of all LangServe agents
(investigator, codebase-search, web-search, critic, writer) must be maintained and
kept in sync with main. CI evals fail if the staging deployment is unreachable.
Daily production eval (02:00 UTC) is exempt from the above: it scores completed job
outputs already stored in LangSmith and does not call run_triage_job on new inputs.
Production credentials are appropriate for that trigger.
CI env vars (set in GitHub Actions secrets):