LangSmith: Trace, Debug, Evaluate, and Monitor LLM Applications
Your LangChain pipeline works on three test inputs. You ship it. Then a user reports nonsense output. You stare at the chain definition and have no idea which step broke, what the intermediate values were, or whether the prompt even received the right context. LangSmith records every single step of every single run, so instead of guessing, you replay the failure and see exactly what happened.
What Is LangSmith and Why Does It Matter?
LangSmith is an observability platform built by the LangChain team. It captures traces (the full execution log of a chain or agent), stores them in a dashboard, and gives you tools to evaluate and monitor your LLM application over time.
I think of it as the "print-statement debugger" you wish you had for LLM apps, except it works in production and keeps a permanent record. Every prompt that went in, every completion that came back, every tool call your agent made, every retry and fallback — all captured automatically.
The platform has four pillars that map directly to the lifecycle of an LLM application:
Setup and Your First Trace
Before anything else, you need two environment variables. LangSmith tracing activates the moment LangChain sees LANGCHAIN_TRACING_V2 set to "true". No code changes required — your existing chains start sending traces automatically.
That is the entire setup. Three environment variables. Every LangChain call you make from this point forward gets traced and uploaded to your LangSmith dashboard.
Let’s run a simple chain and see the trace appear.
Open smith.langchain.com and navigate to your project. You will see a new trace with the full execution tree: the prompt template rendering, the LLM call with exact input/output tokens, and the output parser step. Each node shows its latency, so you immediately know where time is spent.
Understanding Traces — The Execution Tree
A trace is a tree of "runs." The root run is your top-level chain invocation. Each step inside the chain — prompt formatting, LLM call, output parsing, tool invocation — becomes a child run. If a step itself contains sub-steps (like an agent that calls multiple tools), those become grandchild runs.
Here is a more complex chain that demonstrates a deeper trace tree.
In LangSmith, this shows up as two separate root traces (one per .invoke() call). Each root has three child nodes: ChatPromptTemplate → ChatOpenAI → StrOutputParser. Click any node to see its exact input and output.
| The data captured for each LLM call node includes: | |
|---|---|
| ------- | --------------- |
| Input messages | The exact prompt sent to the model |
| Output message | The full completion returned |
| Token usage | Prompt tokens, completion tokens, total |
| Latency | Time to first token and total duration |
| Model name | Which model was called |
| Parameters | Temperature, max_tokens, etc. |
Tracing Non-LangChain Code with @traceable
Not everything in your pipeline is a LangChain runnable. You might have a custom retrieval function, a validation step, or a post-processing function written in plain Python. The @traceable decorator from the langsmith package lets you wrap any function and include it in your trace tree.
In LangSmith, the trace tree now shows python_qa_pipeline as the root, with validate_input, the LangChain chain (and its sub-steps), and format_response as children — all in one unified trace. If validate_input raises a ValueError, you see the exception directly in the trace without any log-hunting.
Debugging Failing Chains
This is where LangSmith earns its keep. When a chain fails in production, you do not need to reproduce the bug locally. You open the trace, find the red (errored) node, and read the exact input that caused the failure. I have caught bugs in 30 seconds that would have taken me an hour with print statements.
Here is a chain that deliberately fails on certain inputs, so we can see what the error trace looks like.
When the LLM wraps its JSON in markdown code fences (a common failure mode), json.loads() fails. In LangSmith, you see the parse_json_output node highlighted in red with the full exception traceback. More importantly, you can click the parent LLM node and read the exact output that caused the parsing failure.
Comparing Successful and Failed Runs
LangSmith lets you select two runs and compare them side by side. I use this constantly — find one successful trace and one failed trace for the same chain, compare them, and the difference is usually obvious. Maybe the successful run got clean JSON while the failed run got JSON wrapped in triple backticks. That tells you exactly what to fix in your parsing code.
The comparison view highlights differences in:
Evaluation Datasets — Systematic Testing for LLM Apps
Spot-checking a few inputs is how most people test their LLM apps. It works until it doesn’t. Evaluation datasets bring the discipline of unit testing to LLM development: define a set of inputs with expected outputs, run your chain against all of them, and measure how well it performs.
The dataset now lives in your LangSmith account. You can also add examples through the UI, or convert production traces into evaluation examples with one click — which is a fantastic way to build your test set from real user queries.
Running Evaluations with Custom and Built-in Evaluators
A dataset alone is just data. The power comes from evaluators — functions that score each chain output against the expected output. LangSmith supports custom Python evaluators and LLM-as-Judge evaluators out of the box.
Custom Python Evaluator
The simplest evaluator is a plain function. It receives the run output and the reference (expected) output, and returns a score.
LLM-as-Judge Evaluator
Keyword overlap is crude. For nuanced quality assessment, use an LLM to judge whether the answer is correct, complete, and well-written. The langsmith and langchain packages provide built-in evaluator classes for this.
With evaluators defined, we run the evaluation against our dataset. LangSmith orchestrates the whole process: it feeds each dataset example to your chain, collects the outputs, runs every evaluator, and aggregates the scores.
The results page in LangSmith shows a table with each example, the chain output, and all evaluator scores side by side. You can sort by score to quickly find the weakest answers, then drill into the trace to understand what went wrong.
Write a function length_evaluator(predicted, reference) that scores how close the predicted answer's length is to the reference answer's length. Return 1.0 if they are the same length, and reduce the score proportionally as they diverge. Use the formula: 1 - abs(len(predicted) - len(reference)) / max(len(predicted), len(reference)). If both strings are empty, return 1.0.
Production Monitoring — Keeping Your LLM App Healthy
Evaluation tells you how your chain performs on a fixed test set. Monitoring tells you how it performs on real traffic. LangSmith captures every production run and lets you build dashboards, set up alerts, and track trends.
Setting Up Production Tracing
For production, you typically want a separate project and explicit user/session tracking.
With user_id and session_id in the metadata, you can filter traces in LangSmith by user or session. When a user reports an issue, you search for their ID and see every interaction they had.
Online Evaluation with Automation Rules
LangSmith supports automation rules that run evaluators on production traces automatically. You configure a rule in the LangSmith UI that triggers an LLM-as-Judge evaluator on every incoming trace (or a sample). This gives you continuous quality monitoring without any additional code.
The typical monitoring setup I use looks like this:
Annotation Queues — Human Feedback at Scale
Automated evaluators are useful, but some quality judgments need human eyes. LangSmith’s annotation queues let you route traces to human reviewers who label them directly in the UI. Reviewers see the input, the chain output, and any evaluator scores, then add their own judgment.
This is where monitoring connects back to evaluation. Human-reviewed traces with corrected outputs become new examples in your evaluation dataset, closing the feedback loop.
The LangSmith SDK — Programmatic Access to Traces
Everything you can do in the LangSmith UI, you can do programmatically with the langsmith Python client. This matters when you want to build custom dashboards, export data for analysis, or integrate LangSmith into CI/CD pipelines.
You can filter runs by tags, metadata, error status, latency range, and more. This makes it straightforward to build nightly reports or integrate quality checks into your deployment pipeline.
Write a function compute_error_rate(runs) that takes a list of run dictionaries (each with a "status" key that is either "success" or "error") and returns the error rate as a float between 0.0 and 1.0, rounded to 2 decimal places. If the list is empty, return 0.0.
Real-World Example: End-to-End LLM App with LangSmith
Let’s put everything together. We will build a Python documentation Q&A pipeline, trace it with LangSmith, create an evaluation dataset, run evaluations, and set up the foundation for production monitoring.
All three calls are now visible as traced runs in LangSmith. The next step is to formalize these into an evaluation dataset and measure quality.
Common Mistakes and How to Fix Them
I have seen each of these trip up teams adopting LangSmith. They are all easy to fix once you know about them.
import os
os.environ["LANGCHAIN_API_KEY"] = "lsv2_pt_..."
# Missing LANGCHAIN_TRACING_V2!
# This chain runs fine but sends NO traces
chain.invoke({"question": "What is Python?"})import os
os.environ["LANGCHAIN_TRACING_V2"] = "true" # This enables tracing
os.environ["LANGCHAIN_API_KEY"] = "lsv2_pt_..."
# Now traces are captured
chain.invoke({"question": "What is Python?"})The most common mistake is simply forgetting to set LANGCHAIN_TRACING_V2. The chain works perfectly without it — you just get no traces, and you do not realize it until you need them.
# Everything goes to one project
os.environ["LANGCHAIN_PROJECT"] = "my-app"# Dev traces and prod traces stay separate
env = os.getenv("APP_ENV", "development")
os.environ["LANGCHAIN_PROJECT"] = f"my-app-{env}"Mixing dev and production traces makes filtering painful. Your production monitoring dashboards will include test runs, and your evaluations will accidentally use production data. Always separate by environment.
Other mistakes I see frequently:
user_id or session_id, you cannot trace a user complaint back to a specific runWrite a function filter_traces(traces, key, value) that takes a list of trace dictionaries (each with a "metadata" dict) and returns only the traces where metadata[key] equals value. If a trace does not have the specified key in its metadata, skip it.
Pricing, Limits, and Best Practices
LangSmith offers a free tier that is generous enough for development and small-scale production. The free tier includes a monthly trace allowance with limited retention. Paid tiers remove volume limits and extend data retention significantly.
Best practices I have settled on after using LangSmith across multiple production applications:
my-app-dev, my-app-staging, my-app-prodComplete Code
Here is a single script that demonstrates the core LangSmith workflow: setup, tracing, custom functions, evaluation dataset creation, and running evaluations. Replace the API keys with your own before running.
Frequently Asked Questions
Can I use LangSmith without LangChain?
Yes. The @traceable decorator from the langsmith package works with any Python function, regardless of whether it uses LangChain. You can trace plain OpenAI SDK calls, custom retrieval functions, or any Python code. You just lose the automatic sub-step tracing that LangChain runnables provide — you need to nest @traceable calls yourself.
Does tracing add latency to my application?
Trace data is serialized and uploaded asynchronously in a background thread. In practice, the overhead is negligible for typical LLM applications where the LLM call itself takes hundreds of milliseconds to seconds. The tracing adds single-digit milliseconds of overhead.
How do I disable tracing in specific environments?
Either do not set LANGCHAIN_TRACING_V2, or set it to "false". LangChain checks this variable at runtime — if it is absent or not "true", no traces are sent. For fine-grained control, you can also set it in code based on conditions.
Can I self-host LangSmith?
LangSmith offers a self-hosted option for enterprise customers who cannot send trace data to external servers. The self-hosted version runs in your own infrastructure (Kubernetes) and provides the same UI and API as the cloud version. Check the LangSmith documentation for setup instructions and licensing details.