LangSmith: Trace, Debug, Evaluate, and Monitor LLM Applications

Intermediate90 min3 exercises45 XP

Prerequisites

0/3 exercises

Your LangChain pipeline works on three test inputs. You ship it. Then a user reports nonsense output. You stare at the chain definition and have no idea which step broke, what the intermediate values were, or whether the prompt even received the right context.

I have been in this exact situation more times than I care to admit. LangSmith changed everything — it records every single step of every single run, so instead of guessing, you replay the failure and see exactly what happened.

What Is LangSmith and Why Does It Matter?

I think of it as the "print-statement debugger" you wish you had for LLM apps, except it works in production and keeps a permanent record. Every prompt that went in, every completion that came back, every tool call your agent made — all captured automatically.

The platform has four pillars that map directly to the lifecycle of an LLM application:

Without LangSmith

# Your chain fails in production
# You add print() statements everywhere
# You try to reproduce locally
# You still can't find the bug
# Hours wasted

With LangSmith

# Your chain fails in production
# Open LangSmith dashboard
# Click the red (errored) trace
# See exact input that caused failure
# Fix in 30 seconds

The four pillars:

Tracing — Record every step of a chain run, including inputs, outputs, token counts, and latencies

Debugging — Inspect failing runs, compare successful vs. failed traces, and isolate the broken step

Evaluation — Build datasets of input/expected-output pairs and score your chain with automated evaluators (including LLM-as-Judge)

Monitoring — Track success rates, latency percentiles, cost, and custom metrics in production

Setup and Your First Trace

The setup surprised me when I first tried LangSmith — it is genuinely just two environment variables. LangSmith tracing activates the moment LangChain sees LANGCHAIN_TRACING_V2 set to "true". No code changes to your existing chains.

Environment setup for LangSmith

Loading editor...

Here is a simple chain — run it and then check your LangSmith dashboard to see the trace appear.

A simple chain that gets traced automatically

Loading editor...

Open smith.langchain.com and navigate to your project. You will see a new trace with the full execution tree: the prompt template rendering, the LLM call with exact input/output tokens, and the output parser step. Each node shows its latency, so you immediately know where time is spent.

Understanding Traces — The Execution Tree

A trace is a tree of "runs." The root run is your top-level chain invocation. Each step inside the chain — prompt formatting, LLM call, output parsing, tool invocation — becomes a child run. Agent sub-steps become grandchild runs.

This chain generates a quiz question and then evaluates it — two LLM calls that create a deeper trace tree.

Multi-step chain with a deeper trace

Loading editor...

In LangSmith, this shows up as two separate root traces (one per .invoke() call). Each root has three child nodes: ChatPromptTemplate → ChatOpenAI → StrOutputParser. Click any node to see its exact input and output.

Tracing Non-LangChain Code with @traceable

Not everything in your pipeline is a LangChain runnable. I have a validation function, a post-processing step, and a custom retriever in most of my projects — all plain Python. The @traceable decorator from the langsmith package wraps any function and includes it in the trace tree.

Using @traceable for custom functions

Loading editor...

In LangSmith, the trace tree now shows python_qa_pipeline as the root, with validate_input, the LangChain chain (and its sub-steps), and format_response as children — all in one unified trace. If validate_input raises a ValueError, you see the exception directly in the trace without any log-hunting.

Debugging Failing Chains

This is where LangSmith earns its keep. When a chain fails in production, you do not need to reproduce the bug locally. Open the trace, find the red (errored) node, and read the exact input that caused the failure.

This chain deliberately fails on certain inputs, so we can see what the error trace looks like.

A chain with a deliberate failure point

Loading editor...

When the LLM wraps its JSON in markdown code fences (a common failure mode), json.loads() fails. In LangSmith, you see the parse_json_output node highlighted in red with the full exception traceback. More importantly, you can click the parent LLM node and read the exact output that caused the parsing failure.

Comparing Successful and Failed Runs

LangSmith lets you select two runs and compare them side by side. I use this constantly — find one successful trace and one failed trace for the same chain, compare them, and the difference is usually obvious.

Evaluation Datasets — Systematic Testing for LLM Apps

Spot-checking a few inputs is how most people test their LLM apps. It works until it doesn’t. I shipped a chain that passed my five manual tests, then discovered it hallucinated on a completely normal input the next day.

Evaluation datasets bring the discipline of unit testing to LLM development. Define a set of inputs with expected outputs, run your chain against all of them, and measure how well it performs.

Creating an evaluation dataset

Loading editor...

The dataset now lives in your LangSmith account. You can also add examples through the UI, or convert production traces into evaluation examples with one click — which is a fantastic way to build your test set from real user queries.

Running Evaluations with Custom and Built-in Evaluators

A dataset alone is just data. The power comes from evaluators — functions that score each chain output against the expected output. LangSmith supports two flavors: custom Python evaluators (fast, deterministic) and LLM-as-Judge evaluators (nuanced, costlier).

Custom Python Evaluator

The simplest evaluator is a plain function. It receives the run output and the reference (expected) output, and returns a score.

A custom evaluator that checks keyword overlap

Loading editor...

LLM-as-Judge Evaluator

Keyword overlap is crude. For nuanced quality assessment, use an LLM to judge whether the answer is correct, complete, and well-written. The langsmith and langchain packages provide built-in evaluator classes for this.

LLM-as-Judge evaluator

Loading editor...

With evaluators defined, we run the evaluation against our dataset. LangSmith orchestrates the whole process: it feeds each dataset example to your chain, collects the outputs, runs every evaluator, and aggregates the scores.

Running the evaluation

Loading editor...

The results page in LangSmith shows a table with each example, the chain output, and all evaluator scores side by side. You can sort by score to quickly find the weakest answers, then drill into the trace to understand what went wrong.

Exercise 1: Build a Length-Based Evaluator

Write Code

Write a function length_evaluator(predicted, reference) that scores how close the predicted answer's length is to the reference answer's length. Return 1.0 if they are the same length, and reduce the score proportionally as they diverge. Use the formula: 1 - abs(len(predicted) - len(reference)) / max(len(predicted), len(reference)). If both strings are empty, return 1.0.

Loading editor...

Production Monitoring — Keeping Your LLM App Healthy

Evaluation tells you how your chain performs on a fixed test set. Monitoring tells you how it performs on real traffic — and real traffic always surprises you. LangSmith captures every production run and lets you build dashboards, set up alerts, and track trends.

Setting Up Production Tracing

The first thing I set up for any production deployment: a separate LangSmith project with explicit user and session tracking.

Production tracing configuration

Loading editor...

With user_id and session_id in the metadata, you can filter traces in LangSmith by user or session. When a user reports an issue, you search for their ID and see every interaction they had.

Online Evaluation with Automation Rules

LangSmith supports automation rules that run evaluators on production traces automatically. Configure a rule in the UI that triggers an LLM-as-Judge evaluator on every incoming trace (or a sample). Continuous quality monitoring, zero additional code.

Annotation Queues — Human Feedback at Scale

Automated evaluators are useful, but some quality judgments need human eyes. LangSmith’s annotation queues let you route traces to human reviewers who label them directly in the UI. Reviewers see the input, the chain output, and any evaluator scores, then add their own judgment.

Creating an annotation queue and sending runs to it

Loading editor...

This is where monitoring connects back to evaluation. Human-reviewed traces with corrected outputs become new examples in your evaluation dataset, closing the feedback loop.

The LangSmith SDK — Programmatic Access to Traces

Everything you can do in the LangSmith UI, you can do programmatically with the langsmith Python client. I use this for nightly quality reports, CI/CD gate checks before deployment, and custom Slack alerts when error rates spike.

Querying traces programmatically

Loading editor...

You can filter runs by tags, metadata, error status, latency range, and more. Nightly quality reports and CI/CD gate checks become straightforward.

Finding error runs and extracting details

Loading editor...

Exercise 2: Compute Error Rate from Run Data

Write Code

Write a function compute_error_rate(runs) that takes a list of run dictionaries (each with a "status" key that is either "success" or "error") and returns the error rate as a float between 0.0 and 1.0, rounded to 2 decimal places. If the list is empty, return 0.0.

Loading editor...

Real-World Example: End-to-End LLM App with LangSmith

Time for a complete example. We will build a Python documentation Q&A pipeline, trace it with LangSmith, create an evaluation dataset, and run evaluations — the full workflow in one script.

Complete Q&A pipeline with LangSmith integration

Loading editor...

All three calls are now visible as traced runs in LangSmith. The next step is to formalize these into an evaluation dataset and measure quality.

Create eval dataset and run evaluation

Loading editor...

Common Mistakes and How to Fix Them

I have seen each of these trip up teams adopting LangSmith. They are all easy to fix once you know about them.

Forgetting to set the tracing env var

import os
os.environ["LANGCHAIN_API_KEY"] = "lsv2_pt_..."
# Missing LANGCHAIN_TRACING_V2!

# This chain runs fine but sends NO traces
chain.invoke({"question": "What is Python?"})

Always set LANGCHAIN_TRACING_V2

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"  # This enables tracing
os.environ["LANGCHAIN_API_KEY"] = "lsv2_pt_..."

# Now traces are captured
chain.invoke({"question": "What is Python?"})

The most common mistake is simply forgetting to set LANGCHAIN_TRACING_V2. The chain works perfectly without it — you just get no traces, and you do not realize it until you need them.

Same project for dev and production

# Everything goes to one project
os.environ["LANGCHAIN_PROJECT"] = "my-app"

Separate projects for each environment

# Dev traces and prod traces stay separate
env = os.getenv("APP_ENV", "development")
os.environ["LANGCHAIN_PROJECT"] = f"my-app-{env}"

Mixing dev and production traces makes filtering painful. Your production monitoring dashboards will include test runs, and your evaluations will accidentally use production data. Always separate by environment.

Other mistakes I see frequently:

Not adding metadata to production runs — without user_id or session_id, you cannot trace a user complaint back to a specific run

Building huge evaluation datasets before shipping — start with 5-10 examples, ship, then grow the dataset from real production failures

Ignoring token usage data — LangSmith tracks tokens per run; a sudden spike in token usage usually means a prompt regression

Exercise 3: Filter Traces by Metadata

Write Code

Write a function filter_traces(traces, key, value) that takes a list of trace dictionaries (each with a "metadata" dict) and returns only the traces where metadata[key] equals value. If a trace does not have the specified key in its metadata, skip it.

Loading editor...

Pricing, Limits, and Best Practices

LangSmith offers a free tier that is generous enough for development and small-scale production. The free tier includes a monthly trace allowance with limited retention. Paid tiers remove volume limits and extend data retention.

These are the best practices I have settled on after using LangSmith across multiple production applications:

Complete Code

Here is a single script that demonstrates the core LangSmith workflow: setup, tracing, custom functions, evaluation dataset creation, and running evaluations. Replace the API keys with your own before running.

Complete LangSmith workflow script

Loading editor...

LangSmith vs Alternatives — When to Use What

LangSmith is not the only LLM observability platform. I have tried several alternatives, and the choice depends on your stack and priorities.

Platform	Best For	LangChain Integration	Self-Hosted	Open Source
LangSmith	LangChain/LangGraph apps, teams already in the LangChain ecosystem	Native (automatic)	Enterprise plan	No
Arize Phoenix	Framework-agnostic tracing, OpenTelemetry-native teams	Via OpenInference	Yes (free)	Yes
Weights & Biases Weave	Teams already using W&B for ML experiment tracking	Manual integration	No	Partially
Helicone	Simple request logging, cost tracking, rate limiting	Proxy-based	Yes	Yes
OpenTelemetry + Jaeger	Teams with existing OTel infrastructure	Manual instrumentation	Yes	Yes

LangSmith Fetch — Debug Agents from Your Terminal

LangSmith Fetch is a newer CLI tool that brings trace data directly into your terminal. Instead of switching to the web UI to inspect a trace, you run langsmith-fetch and pipe the output to your coding agent or read it directly.

Installing and using LangSmith Fetch

Loading editor...

This is particularly powerful when paired with AI coding assistants. You can pipe trace data directly into Claude Code or Cursor, turning your coding agent into an expert LLM debugger that can see exactly what your agent did.

Frequently Asked Questions

Can I use LangSmith without LangChain?

Yes. The @traceable decorator from the langsmith package works with any Python function, regardless of whether it uses LangChain. You can trace plain OpenAI SDK calls, custom retrieval functions, or any Python code. You just lose the automatic sub-step tracing that LangChain runnables provide — you need to nest @traceable calls yourself.

Using LangSmith without LangChain

Loading editor...

Does tracing add latency to my application?

Trace data is serialized and uploaded asynchronously in a background thread. In practice, the overhead is negligible for typical LLM applications where the LLM call itself takes hundreds of milliseconds to seconds. The tracing adds single-digit milliseconds of overhead.

How do I disable tracing in specific environments?

Either do not set LANGCHAIN_TRACING_V2, or set it to "false". LangChain checks this variable at runtime — if it is absent or not "true", no traces are sent. For fine-grained control, you can also set it in code based on conditions.

Conditional tracing

Loading editor...

Can I self-host LangSmith?

LangSmith offers a self-hosted option for enterprise customers who cannot send trace data to external servers. The self-hosted version runs in your own infrastructure (Kubernetes) and provides the same UI and API as the cloud version. Check the LangSmith documentation for setup instructions and licensing details.

LangSmith: Trace, Debug, Evaluate, and Monitor LLM Applications

What Is LangSmith and Why Does It Matter?

Setup and Your First Trace

Understanding Traces — The Execution Tree

Tracing Non-LangChain Code with @traceable

Debugging Failing Chains

Comparing Successful and Failed Runs

Evaluation Datasets — Systematic Testing for LLM Apps

Running Evaluations with Custom and Built-in Evaluators

Custom Python Evaluator

LLM-as-Judge Evaluator

Production Monitoring — Keeping Your LLM App Healthy

Setting Up Production Tracing

Online Evaluation with Automation Rules

Annotation Queues — Human Feedback at Scale

The LangSmith SDK — Programmatic Access to Traces

Real-World Example: End-to-End LLM App with LangSmith

Common Mistakes and How to Fix Them

Pricing, Limits, and Best Practices

Complete Code

LangSmith vs Alternatives — When to Use What

LangSmith Fetch — Debug Agents from Your Terminal

Frequently Asked Questions

Can I use LangSmith without LangChain?

Does tracing add latency to my application?

How do I disable tracing in specific environments?

Can I self-host LangSmith?

Related Tutorials

Save your progress across devices