LangSmith: Trace, Debug, Evaluate, and Monitor LLM Applications
Your LangChain pipeline works on three test inputs. You ship it. Then a user reports nonsense output. You stare at the chain definition and have no idea which step broke, what the intermediate values were, or whether the prompt even received the right context.
I have been in this exact situation more times than I care to admit. LangSmith changed everything — it records every single step of every single run, so instead of guessing, you replay the failure and see exactly what happened.
What Is LangSmith and Why Does It Matter?
I think of it as the "print-statement debugger" you wish you had for LLM apps, except it works in production and keeps a permanent record. Every prompt that went in, every completion that came back, every tool call your agent made — all captured automatically.
The platform has four pillars that map directly to the lifecycle of an LLM application:
# Your chain fails in production
# You add print() statements everywhere
# You try to reproduce locally
# You still can't find the bug
# Hours wasted# Your chain fails in production
# Open LangSmith dashboard
# Click the red (errored) trace
# See exact input that caused failure
# Fix in 30 secondsThe four pillars:
Setup and Your First Trace
The setup surprised me when I first tried LangSmith — it is genuinely just two environment variables. LangSmith tracing activates the moment LangChain sees LANGCHAIN_TRACING_V2 set to "true". No code changes to your existing chains.
Here is a simple chain — run it and then check your LangSmith dashboard to see the trace appear.
Open smith.langchain.com and navigate to your project. You will see a new trace with the full execution tree: the prompt template rendering, the LLM call with exact input/output tokens, and the output parser step. Each node shows its latency, so you immediately know where time is spent.
Understanding Traces — The Execution Tree
A trace is a tree of "runs." The root run is your top-level chain invocation. Each step inside the chain — prompt formatting, LLM call, output parsing, tool invocation — becomes a child run. Agent sub-steps become grandchild runs.
This chain generates a quiz question and then evaluates it — two LLM calls that create a deeper trace tree.
In LangSmith, this shows up as two separate root traces (one per .invoke() call). Each root has three child nodes: ChatPromptTemplate → ChatOpenAI → StrOutputParser. Click any node to see its exact input and output.
Tracing Non-LangChain Code with @traceable
Not everything in your pipeline is a LangChain runnable. I have a validation function, a post-processing step, and a custom retriever in most of my projects — all plain Python. The @traceable decorator from the langsmith package wraps any function and includes it in the trace tree.
In LangSmith, the trace tree now shows python_qa_pipeline as the root, with validate_input, the LangChain chain (and its sub-steps), and format_response as children — all in one unified trace. If validate_input raises a ValueError, you see the exception directly in the trace without any log-hunting.
Debugging Failing Chains
This is where LangSmith earns its keep. When a chain fails in production, you do not need to reproduce the bug locally. Open the trace, find the red (errored) node, and read the exact input that caused the failure.
This chain deliberately fails on certain inputs, so we can see what the error trace looks like.
When the LLM wraps its JSON in markdown code fences (a common failure mode), json.loads() fails. In LangSmith, you see the parse_json_output node highlighted in red with the full exception traceback. More importantly, you can click the parent LLM node and read the exact output that caused the parsing failure.
Comparing Successful and Failed Runs
LangSmith lets you select two runs and compare them side by side. I use this constantly — find one successful trace and one failed trace for the same chain, compare them, and the difference is usually obvious.
Evaluation Datasets — Systematic Testing for LLM Apps
Spot-checking a few inputs is how most people test their LLM apps. It works until it doesn’t. I shipped a chain that passed my five manual tests, then discovered it hallucinated on a completely normal input the next day.
Evaluation datasets bring the discipline of unit testing to LLM development. Define a set of inputs with expected outputs, run your chain against all of them, and measure how well it performs.
The dataset now lives in your LangSmith account. You can also add examples through the UI, or convert production traces into evaluation examples with one click — which is a fantastic way to build your test set from real user queries.
Running Evaluations with Custom and Built-in Evaluators
A dataset alone is just data. The power comes from evaluators — functions that score each chain output against the expected output. LangSmith supports two flavors: custom Python evaluators (fast, deterministic) and LLM-as-Judge evaluators (nuanced, costlier).
Custom Python Evaluator
The simplest evaluator is a plain function. It receives the run output and the reference (expected) output, and returns a score.
LLM-as-Judge Evaluator
Keyword overlap is crude. For nuanced quality assessment, use an LLM to judge whether the answer is correct, complete, and well-written. The langsmith and langchain packages provide built-in evaluator classes for this.
With evaluators defined, we run the evaluation against our dataset. LangSmith orchestrates the whole process: it feeds each dataset example to your chain, collects the outputs, runs every evaluator, and aggregates the scores.
The results page in LangSmith shows a table with each example, the chain output, and all evaluator scores side by side. You can sort by score to quickly find the weakest answers, then drill into the trace to understand what went wrong.
Write a function length_evaluator(predicted, reference) that scores how close the predicted answer's length is to the reference answer's length. Return 1.0 if they are the same length, and reduce the score proportionally as they diverge. Use the formula: 1 - abs(len(predicted) - len(reference)) / max(len(predicted), len(reference)). If both strings are empty, return 1.0.
Production Monitoring — Keeping Your LLM App Healthy
Evaluation tells you how your chain performs on a fixed test set. Monitoring tells you how it performs on real traffic — and real traffic always surprises you. LangSmith captures every production run and lets you build dashboards, set up alerts, and track trends.
Setting Up Production Tracing
The first thing I set up for any production deployment: a separate LangSmith project with explicit user and session tracking.
With user_id and session_id in the metadata, you can filter traces in LangSmith by user or session. When a user reports an issue, you search for their ID and see every interaction they had.
Online Evaluation with Automation Rules
LangSmith supports automation rules that run evaluators on production traces automatically. Configure a rule in the UI that triggers an LLM-as-Judge evaluator on every incoming trace (or a sample). Continuous quality monitoring, zero additional code.
Annotation Queues — Human Feedback at Scale
Automated evaluators are useful, but some quality judgments need human eyes. LangSmith’s annotation queues let you route traces to human reviewers who label them directly in the UI. Reviewers see the input, the chain output, and any evaluator scores, then add their own judgment.
This is where monitoring connects back to evaluation. Human-reviewed traces with corrected outputs become new examples in your evaluation dataset, closing the feedback loop.
The LangSmith SDK — Programmatic Access to Traces
Everything you can do in the LangSmith UI, you can do programmatically with the langsmith Python client. I use this for nightly quality reports, CI/CD gate checks before deployment, and custom Slack alerts when error rates spike.
You can filter runs by tags, metadata, error status, latency range, and more. Nightly quality reports and CI/CD gate checks become straightforward.
Write a function compute_error_rate(runs) that takes a list of run dictionaries (each with a "status" key that is either "success" or "error") and returns the error rate as a float between 0.0 and 1.0, rounded to 2 decimal places. If the list is empty, return 0.0.
Real-World Example: End-to-End LLM App with LangSmith
Time for a complete example. We will build a Python documentation Q&A pipeline, trace it with LangSmith, create an evaluation dataset, and run evaluations — the full workflow in one script.
All three calls are now visible as traced runs in LangSmith. The next step is to formalize these into an evaluation dataset and measure quality.
Common Mistakes and How to Fix Them
I have seen each of these trip up teams adopting LangSmith. They are all easy to fix once you know about them.
import os
os.environ["LANGCHAIN_API_KEY"] = "lsv2_pt_..."
# Missing LANGCHAIN_TRACING_V2!
# This chain runs fine but sends NO traces
chain.invoke({"question": "What is Python?"})import os
os.environ["LANGCHAIN_TRACING_V2"] = "true" # This enables tracing
os.environ["LANGCHAIN_API_KEY"] = "lsv2_pt_..."
# Now traces are captured
chain.invoke({"question": "What is Python?"})The most common mistake is simply forgetting to set LANGCHAIN_TRACING_V2. The chain works perfectly without it — you just get no traces, and you do not realize it until you need them.
# Everything goes to one project
os.environ["LANGCHAIN_PROJECT"] = "my-app"# Dev traces and prod traces stay separate
env = os.getenv("APP_ENV", "development")
os.environ["LANGCHAIN_PROJECT"] = f"my-app-{env}"Mixing dev and production traces makes filtering painful. Your production monitoring dashboards will include test runs, and your evaluations will accidentally use production data. Always separate by environment.
Other mistakes I see frequently:
user_id or session_id, you cannot trace a user complaint back to a specific runWrite a function filter_traces(traces, key, value) that takes a list of trace dictionaries (each with a "metadata" dict) and returns only the traces where metadata[key] equals value. If a trace does not have the specified key in its metadata, skip it.
Pricing, Limits, and Best Practices
LangSmith offers a free tier that is generous enough for development and small-scale production. The free tier includes a monthly trace allowance with limited retention. Paid tiers remove volume limits and extend data retention.
These are the best practices I have settled on after using LangSmith across multiple production applications:
Complete Code
Here is a single script that demonstrates the core LangSmith workflow: setup, tracing, custom functions, evaluation dataset creation, and running evaluations. Replace the API keys with your own before running.
LangSmith vs Alternatives — When to Use What
LangSmith is not the only LLM observability platform. I have tried several alternatives, and the choice depends on your stack and priorities.
| Platform | Best For | LangChain Integration | Self-Hosted | Open Source |
|---|---|---|---|---|
| LangSmith | LangChain/LangGraph apps, teams already in the LangChain ecosystem | Native (automatic) | Enterprise plan | No |
| Arize Phoenix | Framework-agnostic tracing, OpenTelemetry-native teams | Via OpenInference | Yes (free) | Yes |
| Weights & Biases Weave | Teams already using W&B for ML experiment tracking | Manual integration | No | Partially |
| Helicone | Simple request logging, cost tracking, rate limiting | Proxy-based | Yes | Yes |
| OpenTelemetry + Jaeger | Teams with existing OTel infrastructure | Manual instrumentation | Yes | Yes |
LangSmith Fetch — Debug Agents from Your Terminal
LangSmith Fetch is a newer CLI tool that brings trace data directly into your terminal. Instead of switching to the web UI to inspect a trace, you run langsmith-fetch and pipe the output to your coding agent or read it directly.
This is particularly powerful when paired with AI coding assistants. You can pipe trace data directly into Claude Code or Cursor, turning your coding agent into an expert LLM debugger that can see exactly what your agent did.
Frequently Asked Questions
Can I use LangSmith without LangChain?
Yes. The @traceable decorator from the langsmith package works with any Python function, regardless of whether it uses LangChain. You can trace plain OpenAI SDK calls, custom retrieval functions, or any Python code. You just lose the automatic sub-step tracing that LangChain runnables provide — you need to nest @traceable calls yourself.
Does tracing add latency to my application?
Trace data is serialized and uploaded asynchronously in a background thread. In practice, the overhead is negligible for typical LLM applications where the LLM call itself takes hundreds of milliseconds to seconds. The tracing adds single-digit milliseconds of overhead.
How do I disable tracing in specific environments?
Either do not set LANGCHAIN_TRACING_V2, or set it to "false". LangChain checks this variable at runtime — if it is absent or not "true", no traces are sent. For fine-grained control, you can also set it in code based on conditions.
Can I self-host LangSmith?
LangSmith offers a self-hosted option for enterprise customers who cannot send trace data to external servers. The self-hosted version runs in your own infrastructure (Kubernetes) and provides the same UI and API as the cloud version. Check the LangSmith documentation for setup instructions and licensing details.