PRD-005 — Observability & Evaluation: LangSmith¶
| Field | Value |
|---|---|
| Document ID | PRD-005 |
| Version | 1.0 |
| Status | DRAFT |
| Date | March 2026 |
| Parent Doc | PRD-001 |
| Related Docs | PRD-003 (Orchestration), PRD-004 (Agent Layer) |
Overview¶
Detailed specs: LangSmith API & Integration Spec — fills 7 implementation gaps in this document (LangSmith read API, in-flight cost tracking, user feedback SDK calls, Manual Review Queue, automation rules, deep-link URL construction, and daily eval score storage).
LangSmith is the observability and evaluation layer of AgentOps Dashboard. It wraps every layer of the stack automatically — LCEL chains inside each agent, LangGraph orchestration decisions, and full end-to-end job runs — and surfaces them in a unified tracing dashboard.
Unlike typical application monitoring (Datadog, New Relic), LangSmith is specifically designed for LLM applications. It understands the concept of tokens, prompts, model calls, tool invocations, and agent reasoning chains — making it the right tool for this use case.
LangSmith is used in three modes in this product:
- Development — debug why an agent produced a wrong answer by inspecting its exact prompt and model response
- Iteration — run eval datasets after every prompt change to catch regressions before deploying
- Production — monitor live job quality, cost per job, latency per agent, and error rates
Why LangSmith¶
| Need | LangSmith Solution |
|---|---|
| "Why did the investigator agent produce the wrong hypothesis?" | Full trace: exact prompt sent, token-by-token response, structured output parsed |
| "Did my prompt change improve agent quality?" | Dataset eval: run old vs. new prompt on golden dataset, compare scores side by side |
| "How much does one bug triage job cost?" | Automatic token counting and cost calculation per run, per agent, per job |
| "Which agent is the slowest bottleneck?" | Latency breakdown per node in the LangGraph trace |
| "Is agent quality degrading over time?" | Production monitoring with trend charts over rolling 7-day window |
| "Can I A/B test two different critic prompts?" | LangSmith experiments: split traffic or run parallel evals |
LangSmith is framework-agnostic — it instruments LCEL chains (via LangChain), LangGraph nodes (natively), and LangServe endpoints (automatically). Zero extra instrumentation code is required beyond setting the environment variables.
Trace Architecture¶
Trace Levels¶
LangSmith captures traces at three levels, all linked in a parent–child hierarchy:
flowchart TD
JOB["Job Run\njob_id: uuid-1234\ntokens: 12,450 · cost: $0.043 · duration: 127s"]
SUP1["supervisor [node 1]\nprompt: 840 tokens\ndecision: call investigator"]
INV["investigator [node 2] — 12s"]
LCEL_I["LCEL Chain: investigator\ngpt-4o-mini · 1,240 tokens\noutput: {hypothesis, confidence: 0.8}"]
SUP3["supervisor [node 3]\ndecision: call codebase_search"]
CS["codebase_search [node 4]"]
RET["Retriever: VectorStoreRetriever\nquery: JWT token expiry UTC\n→ 3 chunks from auth/middleware.py"]
LLM_C["LLM: gpt-4o · 3,200 tokens"]
HI["human_input [node 5]\nq: Focus on auth or DB layer?\na: Auth layer — recent JWT lib change\nwait: 145s"]
WR["writer [node N]"]
LCEL_W["LCEL Chain: writer\nparallel: report_chain · comment_chain · ticket_chain"]
JOB --> SUP1
JOB --> INV --> LCEL_I
JOB --> SUP3
JOB --> CS --> RET
CS --> LLM_C
JOB --> HI
JOB --> WR --> LCEL_W
What Gets Automatically Captured¶
No manual instrumentation is needed for:
- All LCEL chain inputs and outputs (LangChain native)
- All LangGraph node transitions, state diffs, and routing decisions (LangGraph native)
- All LangServe endpoint calls — token counts, latency, model used (LangServe native)
- Tool calls (Tavily, Chroma retriever) — query sent, results returned
Manual tagging is added for:
job_id— links LangSmith trace to AgentOps job recordrepository— enables filtering by repo in LangSmith dashboardhuman_question/human_answer— captured as metadata on thehuman_inputnode
Integration Setup¶
Environment Variables¶
# Set in all services (orchestration + all LangServe agents)
LANGSMITH_TRACING=true
LANGSMITH_API_KEY=lsv2_...
LANGSMITH_PROJECT=agentops-dashboard # separates prod traces from dev
LANGSMITH_ENDPOINT=https://api.smith.langchain.com
No other code changes are required. LangSmith auto-instruments all LangChain and LangGraph calls when
LANGSMITH_TRACING=true.
Project Separation¶
| Environment | LangSmith Project | Purpose |
|---|---|---|
| Development | agentops-dev |
Local debugging; noisy traces acceptable |
| Staging | agentops-staging |
Eval runs against golden dataset |
| Production | agentops-prod |
Live monitoring; alerts configured here |
Tagging Runs¶
Each LangGraph job run is tagged with metadata and tags for filtering in the
LangSmith dashboard. Both are set in the combined config dict assembled in the
ARQ worker — see Accessing the Trace URL for the
full example.
Trace Hierarchy¶
Accessing the Trace URL¶
The run_id is generated before graph invocation and passed via config["run_id"].
LangGraph forwards it to LangSmith as the root run ID. Because the ID is pre-assigned,
it is stored in BugTriageState.langsmith_run_id and the DB before the graph runs —
no callback manager, no traced_runs access, no race condition.
# worker.py — inside run_triage(), before astream_events
import uuid
langsmith_run_id = uuid.uuid4()
config = {
"configurable": {"thread_id": job_id},
"run_id": langsmith_run_id, # LangSmith root run ID; pre-assigned, never read back
"metadata": {
"job_id": job_id,
"repository": initial_state["repository"],
"issue_url": initial_state["issue_url"],
"env": os.getenv("ENVIRONMENT", "dev"),
},
"tags": ["bug-triage", initial_state["repository"]],
}
# Persist run ID in graph state so checkpointer carries it through the job lifetime
initial_state["langsmith_run_id"] = str(langsmith_run_id)
async for event in graph.astream_events(initial_state, config=config, version="v2"):
...
Deep-link format:
UI Integration — LangSmith Deep Links¶
"View in LangSmith" Button¶
Every completed or failed job in the AgentOps Dashboard UI shows a "View in LangSmith" button in the Output Panel ( PRD-002, Zone 3). This button opens the full job trace in a new tab.
The trace shows:
- The complete LangGraph execution tree
- Every agent's exact prompt and response
- Token counts and costs broken down by node
- The human interrupt: question asked, time waited, answer given
- Any errors with full stack traces
Job-Level Trace Summary (In-App)¶
A lightweight summary of the LangSmith trace is shown directly in the AgentOps UI (no need to navigate to LangSmith for basic info):
┌─────────────────────────────────────────────────────────┐
│ JOB TRACE SUMMARY │
│ Total tokens: 12,450 Estimated cost: $0.043 │
│ Total duration: 2m 7s Nodes executed: 8 │
│ Slowest agent: codebase_search (18s) │
│ [View full trace ↗] │
└─────────────────────────────────────────────────────────┘
This data is fetched from the LangSmith API after job completion and cached in the backend DB.
Evaluation methodology — golden dataset, LLM-as-judge setup, scoring rubrics, and CI pipeline — is specified in PRD-010 §Evaluation Framework.
Cost and Latency Monitoring¶
Per-Job Cost Tracking¶
LangSmith automatically tracks token usage per run. The backend aggregates this into per-job cost estimates shown in the AgentOps UI:
| Agent | Avg Tokens/Job | Avg Cost/Job |
|---|---|---|
| Investigator | ~1,200 | ~$0.001 |
| Codebase Search | ~3,500 | ~$0.007 |
| Web Search | ~2,000 | ~$0.002 |
| Critic | ~2,500 | ~$0.005 |
| Writer | ~4,000 | ~$0.012 |
| Supervisor (all hops) | ~2,000 | ~$0.002 |
| Total (typical job) | ~15,200 | ~$0.029 |
Estimates based on GPT-4o-mini at $0.15/1M input tokens, GPT-4o at $2.50/1M input tokens.
Cost Budget Alerts¶
Users can set a per-job cost limit in Settings. If a job's running cost exceeds the limit,
the supervisor is notified and moves toward the writer node to wrap up, rather than spawning more agents.
Correction: In-flight cost cannot be tracked via LangSmith's API — the API only exposes completed run data. The correct approach is to accumulate cost from
on_chat_model_endevents inastream_events()using a model pricing table. See PRD-005-1 §5 for the full spec, token extraction code, and the three newBugTriageStatefields required.
Latency Dashboard¶
The Analytics page (v1.1) shows rolling 7-day charts from LangSmith data:
- Average job duration (P50, P95)
- Per-agent latency breakdown
- Human wait time (time between question asked and answer received)
- Jobs per day, error rate per day
Prompt Iteration Workflow¶
The workflow for safely improving agent quality using LangSmith:
1. OBSERVE
Identify a failing job in production via LangSmith traces
Note which agent produced the bad output and why
2. PROTOTYPE
Open LangFlow canvas for that agent
Reproduce the failure with the same input
Iterate on the system prompt until output improves
3. EVALUATE
Export updated prompt to Python
Update the agent's LangServe service
Deploy to staging
4. RUN EVALS
Trigger eval pipeline: python scripts/run_evals.py
Check score vs. baseline on golden dataset
If score improves (or doesn't regress): proceed
5. DEPLOY
Merge to main
CI runs eval one more time as gate
Deploy to production LangServe endpoint
6. MONITOR
LangSmith production dashboard shows new scores
Daily eval confirms improvement holds over time
Alerting and Anomaly Detection¶
Automated Alerts (v1.1)¶
Implementation note: Automation rules are configured in the LangSmith UI, not in Python code. The backend provides a webhook receiver endpoint (
POST /internal/langsmith-alert) that LangSmith calls when a rule fires. See PRD-005-1 §8 for rule conditions, LangSmith UI paths, and the webhook receiver implementation.
LangSmith supports rule-based automations that trigger actions when conditions are met:
| Alert | Condition | Action |
|---|---|---|
| Quality degradation | 7-day rolling avg score drops below 3.5 | Slack notification to engineering team |
| High cost job | Single job exceeds $0.20 | Flag job in UI; notify via email |
| Agent error spike | Error rate > 10% in last hour | PagerDuty alert |
| Slow job | Job duration > 5 minutes | Warning badge in UI |
Manual Review Queue¶
Any job where:
- The final confidence score is < 0.5, OR
- The user gave a thumbs-down on the output, OR
- An agent errored and was skipped
…is automatically added to a Manual Review Queue in LangSmith for the team to inspect and potentially add to the golden dataset.