PRD-005 — Observability & Evaluation: LangSmith¶

Field	Value
Document ID	PRD-005
Version	1.0
Status	DRAFT
Date	March 2026
Parent Doc	PRD-001
Related Docs	PRD-003 (Orchestration), PRD-004 (Agent Layer)

Overview¶

Detailed specs: LangSmith API & Integration Spec — fills 7 implementation gaps in this document (LangSmith read API, in-flight cost tracking, user feedback SDK calls, Manual Review Queue, automation rules, deep-link URL construction, and daily eval score storage).

LangSmith is the observability and evaluation layer of AgentOps Dashboard. It wraps every layer of the stack automatically — LCEL chains inside each agent, LangGraph orchestration decisions, and full end-to-end job runs — and surfaces them in a unified tracing dashboard.

Unlike typical application monitoring (Datadog, New Relic), LangSmith is specifically designed for LLM applications. It understands the concept of tokens, prompts, model calls, tool invocations, and agent reasoning chains — making it the right tool for this use case.

LangSmith is used in three modes in this product:

Development — debug why an agent produced a wrong answer by inspecting its exact prompt and model response
Iteration — run eval datasets after every prompt change to catch regressions before deploying
Production — monitor live job quality, cost per job, latency per agent, and error rates

Why LangSmith¶

Need	LangSmith Solution
"Why did the investigator agent produce the wrong hypothesis?"	Full trace: exact prompt sent, token-by-token response, structured output parsed
"Did my prompt change improve agent quality?"	Dataset eval: run old vs. new prompt on golden dataset, compare scores side by side
"How much does one bug triage job cost?"	Automatic token counting and cost calculation per run, per agent, per job
"Which agent is the slowest bottleneck?"	Latency breakdown per node in the LangGraph trace
"Is agent quality degrading over time?"	Production monitoring with trend charts over rolling 7-day window
"Can I A/B test two different critic prompts?"	LangSmith experiments: split traffic or run parallel evals

LangSmith is framework-agnostic — it instruments LCEL chains (via LangChain), LangGraph nodes (natively), and LangServe endpoints (automatically). Zero extra instrumentation code is required beyond setting the environment variables.

Trace Architecture¶

Trace Levels¶

LangSmith captures traces at three levels, all linked in a parent–child hierarchy:

flowchart TD
    JOB["Job Run\njob_id: uuid-1234\ntokens: 12,450 · cost: $0.043 · duration: 127s"]

    SUP1["supervisor [node 1]\nprompt: 840 tokens\ndecision: call investigator"]
    INV["investigator [node 2] — 12s"]
    LCEL_I["LCEL Chain: investigator\ngpt-4o-mini · 1,240 tokens\noutput: {hypothesis, confidence: 0.8}"]

    SUP3["supervisor [node 3]\ndecision: call codebase_search"]
    CS["codebase_search [node 4]"]
    RET["Retriever: VectorStoreRetriever\nquery: JWT token expiry UTC\n→ 3 chunks from auth/middleware.py"]
    LLM_C["LLM: gpt-4o · 3,200 tokens"]

    HI["human_input [node 5]\nq: Focus on auth or DB layer?\na: Auth layer — recent JWT lib change\nwait: 145s"]

    WR["writer [node N]"]
    LCEL_W["LCEL Chain: writer\nparallel: report_chain · comment_chain · ticket_chain"]

    JOB --> SUP1
    JOB --> INV --> LCEL_I
    JOB --> SUP3
    JOB --> CS --> RET
    CS --> LLM_C
    JOB --> HI
    JOB --> WR --> LCEL_W

What Gets Automatically Captured¶

No manual instrumentation is needed for:

All LCEL chain inputs and outputs (LangChain native)
All LangGraph node transitions, state diffs, and routing decisions (LangGraph native)
All LangServe endpoint calls — token counts, latency, model used (LangServe native)
Tool calls (Tavily, Chroma retriever) — query sent, results returned

Manual tagging is added for:

job_id — links LangSmith trace to AgentOps job record
repository — enables filtering by repo in LangSmith dashboard
human_question / human_answer — captured as metadata on the human_input node

Integration Setup¶

Environment Variables¶

# Set in all services (orchestration + all LangServe agents)
LANGSMITH_TRACING=true
LANGSMITH_API_KEY=lsv2_...
LANGSMITH_PROJECT=agentops-dashboard   # separates prod traces from dev
LANGSMITH_ENDPOINT=https://api.smith.langchain.com

No other code changes are required. LangSmith auto-instruments all LangChain and LangGraph calls when LANGSMITH_TRACING=true.

Project Separation¶

Environment	LangSmith Project	Purpose
Development	`agentops-dev`	Local debugging; noisy traces acceptable
Staging	`agentops-staging`	Eval runs against golden dataset
Production	`agentops-prod`	Live monitoring; alerts configured here

Tagging Runs¶

Each LangGraph job run is tagged with metadata and tags for filtering in the LangSmith dashboard. Both are set in the combined config dict assembled in the ARQ worker — see Accessing the Trace URL for the full example.

Trace Hierarchy¶

Accessing the Trace URL¶

The run_id is generated before graph invocation and passed via config["run_id"]. LangGraph forwards it to LangSmith as the root run ID. Because the ID is pre-assigned, it is stored in BugTriageState.langsmith_run_id and the DB before the graph runs — no callback manager, no traced_runs access, no race condition.

# worker.py — inside run_triage(), before astream_events
import uuid

langsmith_run_id = uuid.uuid4()

config = {
    "configurable": {"thread_id": job_id},
    "run_id": langsmith_run_id,   # LangSmith root run ID; pre-assigned, never read back
    "metadata": {
        "job_id": job_id,
        "repository": initial_state["repository"],
        "issue_url": initial_state["issue_url"],
        "env": os.getenv("ENVIRONMENT", "dev"),
    },
    "tags": ["bug-triage", initial_state["repository"]],
}

# Persist run ID in graph state so checkpointer carries it through the job lifetime
initial_state["langsmith_run_id"] = str(langsmith_run_id)

async for event in graph.astream_events(initial_state, config=config, version="v2"):
    ...

Deep-link format:

https://smith.langchain.com/o/{org_id}/projects/p/{project_id}/r/{run_id}

UI Integration — LangSmith Deep Links¶

"View in LangSmith" Button¶

Every completed or failed job in the AgentOps Dashboard UI shows a "View in LangSmith" button in the Output Panel ( PRD-002, Zone 3). This button opens the full job trace in a new tab.

The trace shows:

The complete LangGraph execution tree
Every agent's exact prompt and response
Token counts and costs broken down by node
The human interrupt: question asked, time waited, answer given
Any errors with full stack traces

Job-Level Trace Summary (In-App)¶

A lightweight summary of the LangSmith trace is shown directly in the AgentOps UI (no need to navigate to LangSmith for basic info):

┌─────────────────────────────────────────────────────────┐
│  JOB TRACE SUMMARY                                       │
│  Total tokens: 12,450    Estimated cost: $0.043          │
│  Total duration: 2m 7s   Nodes executed: 8               │
│  Slowest agent: codebase_search (18s)                    │
│                              [View full trace ↗]         │
└─────────────────────────────────────────────────────────┘

This data is fetched from the LangSmith API after job completion and cached in the backend DB.

Evaluation methodology — golden dataset, LLM-as-judge setup, scoring rubrics, and CI pipeline — is specified in PRD-010 §Evaluation Framework.

Cost and Latency Monitoring¶

Per-Job Cost Tracking¶

LangSmith automatically tracks token usage per run. The backend aggregates this into per-job cost estimates shown in the AgentOps UI:

Agent	Avg Tokens/Job	Avg Cost/Job
Investigator	~1,200	~$0.001
Codebase Search	~3,500	~$0.007
Web Search	~2,000	~$0.002
Critic	~2,500	~$0.005
Writer	~4,000	~$0.012
Supervisor (all hops)	~2,000	~$0.002
Total (typical job)	~15,200	~$0.029

Estimates based on GPT-4o-mini at $0.15/1M input tokens, GPT-4o at $2.50/1M input tokens.

Cost Budget Alerts¶

Users can set a per-job cost limit in Settings. If a job's running cost exceeds the limit, the supervisor is notified and moves toward the writer node to wrap up, rather than spawning more agents.

Correction: In-flight cost cannot be tracked via LangSmith's API — the API only exposes completed run data. The correct approach is to accumulate cost from on_chat_model_end events in astream_events() using a model pricing table. See PRD-005-1 §5 for the full spec, token extraction code, and the three new BugTriageState fields required.

Latency Dashboard¶

The Analytics page (v1.1) shows rolling 7-day charts from LangSmith data:

Average job duration (P50, P95)
Per-agent latency breakdown
Human wait time (time between question asked and answer received)
Jobs per day, error rate per day

Prompt Iteration Workflow¶

The workflow for safely improving agent quality using LangSmith:

1. OBSERVE
   Identify a failing job in production via LangSmith traces
   Note which agent produced the bad output and why

2. PROTOTYPE
   Open LangFlow canvas for that agent
   Reproduce the failure with the same input
   Iterate on the system prompt until output improves

3. EVALUATE
   Export updated prompt to Python
   Update the agent's LangServe service
   Deploy to staging

4. RUN EVALS
   Trigger eval pipeline: python scripts/run_evals.py
   Check score vs. baseline on golden dataset
   If score improves (or doesn't regress): proceed

5. DEPLOY
   Merge to main
   CI runs eval one more time as gate
   Deploy to production LangServe endpoint

6. MONITOR
   LangSmith production dashboard shows new scores
   Daily eval confirms improvement holds over time

Alerting and Anomaly Detection¶

Automated Alerts (v1.1)¶

Implementation note: Automation rules are configured in the LangSmith UI, not in Python code. The backend provides a webhook receiver endpoint (POST /internal/langsmith-alert) that LangSmith calls when a rule fires. See PRD-005-1 §8 for rule conditions, LangSmith UI paths, and the webhook receiver implementation.

LangSmith supports rule-based automations that trigger actions when conditions are met:

Alert	Condition	Action
Quality degradation	7-day rolling avg score drops below 3.5	Slack notification to engineering team
High cost job	Single job exceeds $0.20	Flag job in UI; notify via email
Agent error spike	Error rate > 10% in last hour	PagerDuty alert
Slow job	Job duration > 5 minutes	Warning badge in UI

Manual Review Queue¶

Any job where:

The final confidence score is < 0.5, OR
The user gave a thumbs-down on the output, OR
An agent errored and was skipped

…is automatically added to a Manual Review Queue in LangSmith for the team to inspect and potentially add to the golden dataset.