Skip to content

PRD-005 — Observability & Evaluation: LangSmith

Field Value
Document ID PRD-005
Version 1.0
Status DRAFT
Date March 2026
Parent Doc PRD-001
Related Docs PRD-003 (Orchestration), PRD-004 (Agent Layer)

Overview

Detailed specs: LangSmith API & Integration Spec — fills 7 implementation gaps in this document (LangSmith read API, in-flight cost tracking, user feedback SDK calls, Manual Review Queue, automation rules, deep-link URL construction, and daily eval score storage).

LangSmith is the observability and evaluation layer of AgentOps Dashboard. It wraps every layer of the stack automatically — LCEL chains inside each agent, LangGraph orchestration decisions, and full end-to-end job runs — and surfaces them in a unified tracing dashboard.

Unlike typical application monitoring (Datadog, New Relic), LangSmith is specifically designed for LLM applications. It understands the concept of tokens, prompts, model calls, tool invocations, and agent reasoning chains — making it the right tool for this use case.

LangSmith is used in three modes in this product:

  1. Development — debug why an agent produced a wrong answer by inspecting its exact prompt and model response
  2. Iteration — run eval datasets after every prompt change to catch regressions before deploying
  3. Production — monitor live job quality, cost per job, latency per agent, and error rates

Why LangSmith

Need LangSmith Solution
"Why did the investigator agent produce the wrong hypothesis?" Full trace: exact prompt sent, token-by-token response, structured output parsed
"Did my prompt change improve agent quality?" Dataset eval: run old vs. new prompt on golden dataset, compare scores side by side
"How much does one bug triage job cost?" Automatic token counting and cost calculation per run, per agent, per job
"Which agent is the slowest bottleneck?" Latency breakdown per node in the LangGraph trace
"Is agent quality degrading over time?" Production monitoring with trend charts over rolling 7-day window
"Can I A/B test two different critic prompts?" LangSmith experiments: split traffic or run parallel evals

LangSmith is framework-agnostic — it instruments LCEL chains (via LangChain), LangGraph nodes (natively), and LangServe endpoints (automatically). Zero extra instrumentation code is required beyond setting the environment variables.


Trace Architecture

Trace Levels

LangSmith captures traces at three levels, all linked in a parent–child hierarchy:

flowchart TD
    JOB["Job Run\njob_id: uuid-1234\ntokens: 12,450 · cost: $0.043 · duration: 127s"]

    SUP1["supervisor [node 1]\nprompt: 840 tokens\ndecision: call investigator"]
    INV["investigator [node 2] — 12s"]
    LCEL_I["LCEL Chain: investigator\ngpt-4o-mini · 1,240 tokens\noutput: {hypothesis, confidence: 0.8}"]

    SUP3["supervisor [node 3]\ndecision: call codebase_search"]
    CS["codebase_search [node 4]"]
    RET["Retriever: VectorStoreRetriever\nquery: JWT token expiry UTC\n→ 3 chunks from auth/middleware.py"]
    LLM_C["LLM: gpt-4o · 3,200 tokens"]

    HI["human_input [node 5]\nq: Focus on auth or DB layer?\na: Auth layer — recent JWT lib change\nwait: 145s"]

    WR["writer [node N]"]
    LCEL_W["LCEL Chain: writer\nparallel: report_chain · comment_chain · ticket_chain"]

    JOB --> SUP1
    JOB --> INV --> LCEL_I
    JOB --> SUP3
    JOB --> CS --> RET
    CS --> LLM_C
    JOB --> HI
    JOB --> WR --> LCEL_W

What Gets Automatically Captured

No manual instrumentation is needed for:

  • All LCEL chain inputs and outputs (LangChain native)
  • All LangGraph node transitions, state diffs, and routing decisions (LangGraph native)
  • All LangServe endpoint calls — token counts, latency, model used (LangServe native)
  • Tool calls (Tavily, Chroma retriever) — query sent, results returned

Manual tagging is added for:

  • job_id — links LangSmith trace to AgentOps job record
  • repository — enables filtering by repo in LangSmith dashboard
  • human_question / human_answer — captured as metadata on the human_input node

Integration Setup

Environment Variables

# Set in all services (orchestration + all LangServe agents)
LANGSMITH_TRACING=true
LANGSMITH_API_KEY=lsv2_...
LANGSMITH_PROJECT=agentops-dashboard   # separates prod traces from dev
LANGSMITH_ENDPOINT=https://api.smith.langchain.com

No other code changes are required. LangSmith auto-instruments all LangChain and LangGraph calls when LANGSMITH_TRACING=true.

Project Separation

Environment LangSmith Project Purpose
Development agentops-dev Local debugging; noisy traces acceptable
Staging agentops-staging Eval runs against golden dataset
Production agentops-prod Live monitoring; alerts configured here

Tagging Runs

Each LangGraph job run is tagged with metadata and tags for filtering in the LangSmith dashboard. Both are set in the combined config dict assembled in the ARQ worker — see Accessing the Trace URL for the full example.


Trace Hierarchy

Accessing the Trace URL

The run_id is generated before graph invocation and passed via config["run_id"]. LangGraph forwards it to LangSmith as the root run ID. Because the ID is pre-assigned, it is stored in BugTriageState.langsmith_run_id and the DB before the graph runs — no callback manager, no traced_runs access, no race condition.

# worker.py — inside run_triage(), before astream_events
import uuid

langsmith_run_id = uuid.uuid4()

config = {
    "configurable": {"thread_id": job_id},
    "run_id": langsmith_run_id,   # LangSmith root run ID; pre-assigned, never read back
    "metadata": {
        "job_id": job_id,
        "repository": initial_state["repository"],
        "issue_url": initial_state["issue_url"],
        "env": os.getenv("ENVIRONMENT", "dev"),
    },
    "tags": ["bug-triage", initial_state["repository"]],
}

# Persist run ID in graph state so checkpointer carries it through the job lifetime
initial_state["langsmith_run_id"] = str(langsmith_run_id)

async for event in graph.astream_events(initial_state, config=config, version="v2"):
    ...

Deep-link format:

https://smith.langchain.com/o/{org_id}/projects/p/{project_id}/r/{run_id}

"View in LangSmith" Button

Every completed or failed job in the AgentOps Dashboard UI shows a "View in LangSmith" button in the Output Panel ( PRD-002, Zone 3). This button opens the full job trace in a new tab.

The trace shows:

  • The complete LangGraph execution tree
  • Every agent's exact prompt and response
  • Token counts and costs broken down by node
  • The human interrupt: question asked, time waited, answer given
  • Any errors with full stack traces

Job-Level Trace Summary (In-App)

A lightweight summary of the LangSmith trace is shown directly in the AgentOps UI (no need to navigate to LangSmith for basic info):

┌─────────────────────────────────────────────────────────┐
│  JOB TRACE SUMMARY                                       │
│  Total tokens: 12,450    Estimated cost: $0.043          │
│  Total duration: 2m 7s   Nodes executed: 8               │
│  Slowest agent: codebase_search (18s)                    │
│                              [View full trace ↗]         │
└─────────────────────────────────────────────────────────┘

This data is fetched from the LangSmith API after job completion and cached in the backend DB.


Evaluation methodology — golden dataset, LLM-as-judge setup, scoring rubrics, and CI pipeline — is specified in PRD-010 §Evaluation Framework.


Cost and Latency Monitoring

Per-Job Cost Tracking

LangSmith automatically tracks token usage per run. The backend aggregates this into per-job cost estimates shown in the AgentOps UI:

Agent Avg Tokens/Job Avg Cost/Job
Investigator ~1,200 ~$0.001
Codebase Search ~3,500 ~$0.007
Web Search ~2,000 ~$0.002
Critic ~2,500 ~$0.005
Writer ~4,000 ~$0.012
Supervisor (all hops) ~2,000 ~$0.002
Total (typical job) ~15,200 ~$0.029

Estimates based on GPT-4o-mini at $0.15/1M input tokens, GPT-4o at $2.50/1M input tokens.

Cost Budget Alerts

Users can set a per-job cost limit in Settings. If a job's running cost exceeds the limit, the supervisor is notified and moves toward the writer node to wrap up, rather than spawning more agents.

Correction: In-flight cost cannot be tracked via LangSmith's API — the API only exposes completed run data. The correct approach is to accumulate cost from on_chat_model_end events in astream_events() using a model pricing table. See PRD-005-1 §5 for the full spec, token extraction code, and the three new BugTriageState fields required.

Latency Dashboard

The Analytics page (v1.1) shows rolling 7-day charts from LangSmith data:

  • Average job duration (P50, P95)
  • Per-agent latency breakdown
  • Human wait time (time between question asked and answer received)
  • Jobs per day, error rate per day

Prompt Iteration Workflow

The workflow for safely improving agent quality using LangSmith:

1. OBSERVE
   Identify a failing job in production via LangSmith traces
   Note which agent produced the bad output and why

2. PROTOTYPE
   Open LangFlow canvas for that agent
   Reproduce the failure with the same input
   Iterate on the system prompt until output improves

3. EVALUATE
   Export updated prompt to Python
   Update the agent's LangServe service
   Deploy to staging

4. RUN EVALS
   Trigger eval pipeline: python scripts/run_evals.py
   Check score vs. baseline on golden dataset
   If score improves (or doesn't regress): proceed

5. DEPLOY
   Merge to main
   CI runs eval one more time as gate
   Deploy to production LangServe endpoint

6. MONITOR
   LangSmith production dashboard shows new scores
   Daily eval confirms improvement holds over time

Alerting and Anomaly Detection

Automated Alerts (v1.1)

Implementation note: Automation rules are configured in the LangSmith UI, not in Python code. The backend provides a webhook receiver endpoint (POST /internal/langsmith-alert) that LangSmith calls when a rule fires. See PRD-005-1 §8 for rule conditions, LangSmith UI paths, and the webhook receiver implementation.

LangSmith supports rule-based automations that trigger actions when conditions are met:

Alert Condition Action
Quality degradation 7-day rolling avg score drops below 3.5 Slack notification to engineering team
High cost job Single job exceeds $0.20 Flag job in UI; notify via email
Agent error spike Error rate > 10% in last hour PagerDuty alert
Slow job Job duration > 5 minutes Warning badge in UI

Manual Review Queue

Any job where:

  • The final confidence score is < 0.5, OR
  • The user gave a thumbs-down on the output, OR
  • An agent errored and was skipped

…is automatically added to a Manual Review Queue in LangSmith for the team to inspect and potentially add to the golden dataset.