Skip to main content

LangSmith: Trace, Debug, Evaluate, and Monitor LLM Applications

Intermediate90 min3 exercises45 XP
Prerequisites
0/3 exercises

Your LangChain pipeline works on three test inputs. You ship it. Then a user reports nonsense output. You stare at the chain definition and have no idea which step broke, what the intermediate values were, or whether the prompt even received the right context. LangSmith records every single step of every single run, so instead of guessing, you replay the failure and see exactly what happened.

What Is LangSmith and Why Does It Matter?

LangSmith is an observability platform built by the LangChain team. It captures traces (the full execution log of a chain or agent), stores them in a dashboard, and gives you tools to evaluate and monitor your LLM application over time.

I think of it as the "print-statement debugger" you wish you had for LLM apps, except it works in production and keeps a permanent record. Every prompt that went in, every completion that came back, every tool call your agent made, every retry and fallback — all captured automatically.

The platform has four pillars that map directly to the lifecycle of an LLM application:

  • Tracing — Record every step of a chain run, including inputs, outputs, token counts, and latencies
  • Debugging — Inspect failing runs, compare successful vs. failed traces, and isolate the broken step
  • Evaluation — Build datasets of input/expected-output pairs and score your chain with automated evaluators (including LLM-as-Judge)
  • Monitoring — Track success rates, latency percentiles, cost, and custom metrics in production
  • Setup and Your First Trace

    Before anything else, you need two environment variables. LangSmith tracing activates the moment LangChain sees LANGCHAIN_TRACING_V2 set to "true". No code changes required — your existing chains start sending traces automatically.

    Environment setup for LangSmith
    Loading editor...

    That is the entire setup. Three environment variables. Every LangChain call you make from this point forward gets traced and uploaded to your LangSmith dashboard.

    Let’s run a simple chain and see the trace appear.

    A simple chain that gets traced automatically
    Loading editor...

    Open smith.langchain.com and navigate to your project. You will see a new trace with the full execution tree: the prompt template rendering, the LLM call with exact input/output tokens, and the output parser step. Each node shows its latency, so you immediately know where time is spent.

    Understanding Traces — The Execution Tree

    A trace is a tree of "runs." The root run is your top-level chain invocation. Each step inside the chain — prompt formatting, LLM call, output parsing, tool invocation — becomes a child run. If a step itself contains sub-steps (like an agent that calls multiple tools), those become grandchild runs.

    Here is a more complex chain that demonstrates a deeper trace tree.

    Multi-step chain with a deeper trace
    Loading editor...

    In LangSmith, this shows up as two separate root traces (one per .invoke() call). Each root has three child nodes: ChatPromptTemplate → ChatOpenAI → StrOutputParser. Click any node to see its exact input and output.

    The data captured for each LLM call node includes:
    ----------------------
    Input messagesThe exact prompt sent to the model
    Output messageThe full completion returned
    Token usagePrompt tokens, completion tokens, total
    LatencyTime to first token and total duration
    Model nameWhich model was called
    ParametersTemperature, max_tokens, etc.

    Tracing Non-LangChain Code with @traceable

    Not everything in your pipeline is a LangChain runnable. You might have a custom retrieval function, a validation step, or a post-processing function written in plain Python. The @traceable decorator from the langsmith package lets you wrap any function and include it in your trace tree.

    Using @traceable for custom functions
    Loading editor...

    In LangSmith, the trace tree now shows python_qa_pipeline as the root, with validate_input, the LangChain chain (and its sub-steps), and format_response as children — all in one unified trace. If validate_input raises a ValueError, you see the exception directly in the trace without any log-hunting.

    Debugging Failing Chains

    This is where LangSmith earns its keep. When a chain fails in production, you do not need to reproduce the bug locally. You open the trace, find the red (errored) node, and read the exact input that caused the failure. I have caught bugs in 30 seconds that would have taken me an hour with print statements.

    Here is a chain that deliberately fails on certain inputs, so we can see what the error trace looks like.

    A chain with a deliberate failure point
    Loading editor...

    When the LLM wraps its JSON in markdown code fences (a common failure mode), json.loads() fails. In LangSmith, you see the parse_json_output node highlighted in red with the full exception traceback. More importantly, you can click the parent LLM node and read the exact output that caused the parsing failure.

    Comparing Successful and Failed Runs

    LangSmith lets you select two runs and compare them side by side. I use this constantly — find one successful trace and one failed trace for the same chain, compare them, and the difference is usually obvious. Maybe the successful run got clean JSON while the failed run got JSON wrapped in triple backticks. That tells you exactly what to fix in your parsing code.

    The comparison view highlights differences in:

  • Input values (did the user send something unexpected?)
  • Prompt text (did template rendering produce a malformed prompt?)
  • LLM output (did the model deviate from the expected format?)
  • Latency (was the failure related to a timeout?)
  • Token usage (did the response get truncated by max_tokens?)
  • Evaluation Datasets — Systematic Testing for LLM Apps

    Spot-checking a few inputs is how most people test their LLM apps. It works until it doesn’t. Evaluation datasets bring the discipline of unit testing to LLM development: define a set of inputs with expected outputs, run your chain against all of them, and measure how well it performs.

    Creating an evaluation dataset
    Loading editor...

    The dataset now lives in your LangSmith account. You can also add examples through the UI, or convert production traces into evaluation examples with one click — which is a fantastic way to build your test set from real user queries.

    Running Evaluations with Custom and Built-in Evaluators

    A dataset alone is just data. The power comes from evaluators — functions that score each chain output against the expected output. LangSmith supports custom Python evaluators and LLM-as-Judge evaluators out of the box.

    Custom Python Evaluator

    The simplest evaluator is a plain function. It receives the run output and the reference (expected) output, and returns a score.

    A custom evaluator that checks keyword overlap
    Loading editor...

    LLM-as-Judge Evaluator

    Keyword overlap is crude. For nuanced quality assessment, use an LLM to judge whether the answer is correct, complete, and well-written. The langsmith and langchain packages provide built-in evaluator classes for this.

    LLM-as-Judge evaluator
    Loading editor...

    With evaluators defined, we run the evaluation against our dataset. LangSmith orchestrates the whole process: it feeds each dataset example to your chain, collects the outputs, runs every evaluator, and aggregates the scores.

    Running the evaluation
    Loading editor...

    The results page in LangSmith shows a table with each example, the chain output, and all evaluator scores side by side. You can sort by score to quickly find the weakest answers, then drill into the trace to understand what went wrong.

    Exercise 1: Build a Length-Based Evaluator
    Write Code

    Write a function length_evaluator(predicted, reference) that scores how close the predicted answer's length is to the reference answer's length. Return 1.0 if they are the same length, and reduce the score proportionally as they diverge. Use the formula: 1 - abs(len(predicted) - len(reference)) / max(len(predicted), len(reference)). If both strings are empty, return 1.0.

    Loading editor...

    Production Monitoring — Keeping Your LLM App Healthy

    Evaluation tells you how your chain performs on a fixed test set. Monitoring tells you how it performs on real traffic. LangSmith captures every production run and lets you build dashboards, set up alerts, and track trends.

    Setting Up Production Tracing

    For production, you typically want a separate project and explicit user/session tracking.

    Production tracing configuration
    Loading editor...

    With user_id and session_id in the metadata, you can filter traces in LangSmith by user or session. When a user reports an issue, you search for their ID and see every interaction they had.

    Online Evaluation with Automation Rules

    LangSmith supports automation rules that run evaluators on production traces automatically. You configure a rule in the LangSmith UI that triggers an LLM-as-Judge evaluator on every incoming trace (or a sample). This gives you continuous quality monitoring without any additional code.

    The typical monitoring setup I use looks like this:

  • Log everything — all traces go to LangSmith with user/session metadata
  • Sample evaluation — an automation rule runs a quality evaluator on 10-20% of production traces
  • Alert on failures — set up threshold alerts (e.g., notify me when the error rate exceeds 5% in any 1-hour window)
  • Human review queue — route low-scoring traces to a review queue where a human can label them as good or bad
  • Feed back to eval dataset — add confirmed-bad production examples to your evaluation dataset so future versions are tested against real failure cases
  • Annotation Queues — Human Feedback at Scale

    Automated evaluators are useful, but some quality judgments need human eyes. LangSmith’s annotation queues let you route traces to human reviewers who label them directly in the UI. Reviewers see the input, the chain output, and any evaluator scores, then add their own judgment.

    Creating an annotation queue and sending runs to it
    Loading editor...

    This is where monitoring connects back to evaluation. Human-reviewed traces with corrected outputs become new examples in your evaluation dataset, closing the feedback loop.

    The LangSmith SDK — Programmatic Access to Traces

    Everything you can do in the LangSmith UI, you can do programmatically with the langsmith Python client. This matters when you want to build custom dashboards, export data for analysis, or integrate LangSmith into CI/CD pipelines.

    Querying traces programmatically
    Loading editor...

    You can filter runs by tags, metadata, error status, latency range, and more. This makes it straightforward to build nightly reports or integrate quality checks into your deployment pipeline.

    Finding error runs and extracting details
    Loading editor...
    Exercise 2: Compute Error Rate from Run Data
    Write Code

    Write a function compute_error_rate(runs) that takes a list of run dictionaries (each with a "status" key that is either "success" or "error") and returns the error rate as a float between 0.0 and 1.0, rounded to 2 decimal places. If the list is empty, return 0.0.

    Loading editor...

    Real-World Example: End-to-End LLM App with LangSmith

    Let’s put everything together. We will build a Python documentation Q&A pipeline, trace it with LangSmith, create an evaluation dataset, run evaluations, and set up the foundation for production monitoring.

    Complete Q&A pipeline with LangSmith integration
    Loading editor...

    All three calls are now visible as traced runs in LangSmith. The next step is to formalize these into an evaluation dataset and measure quality.

    Create eval dataset and run evaluation
    Loading editor...

    Common Mistakes and How to Fix Them

    I have seen each of these trip up teams adopting LangSmith. They are all easy to fix once you know about them.

    Forgetting to set the tracing env var
    import os
    os.environ["LANGCHAIN_API_KEY"] = "lsv2_pt_..."
    # Missing LANGCHAIN_TRACING_V2!
    
    # This chain runs fine but sends NO traces
    chain.invoke({"question": "What is Python?"})
    Always set LANGCHAIN_TRACING_V2
    import os
    os.environ["LANGCHAIN_TRACING_V2"] = "true"  # This enables tracing
    os.environ["LANGCHAIN_API_KEY"] = "lsv2_pt_..."
    
    # Now traces are captured
    chain.invoke({"question": "What is Python?"})

    The most common mistake is simply forgetting to set LANGCHAIN_TRACING_V2. The chain works perfectly without it — you just get no traces, and you do not realize it until you need them.

    Same project for dev and production
    # Everything goes to one project
    os.environ["LANGCHAIN_PROJECT"] = "my-app"
    Separate projects for each environment
    # Dev traces and prod traces stay separate
    env = os.getenv("APP_ENV", "development")
    os.environ["LANGCHAIN_PROJECT"] = f"my-app-{env}"

    Mixing dev and production traces makes filtering painful. Your production monitoring dashboards will include test runs, and your evaluations will accidentally use production data. Always separate by environment.

    Other mistakes I see frequently:

  • Not adding metadata to production runs — without user_id or session_id, you cannot trace a user complaint back to a specific run
  • Building huge evaluation datasets before shipping — start with 5-10 examples, ship, then grow the dataset from real production failures
  • Ignoring token usage data — LangSmith tracks tokens per run; a sudden spike in token usage usually means a prompt regression
  • Exercise 3: Filter Traces by Metadata
    Write Code

    Write a function filter_traces(traces, key, value) that takes a list of trace dictionaries (each with a "metadata" dict) and returns only the traces where metadata[key] equals value. If a trace does not have the specified key in its metadata, skip it.

    Loading editor...

    Pricing, Limits, and Best Practices

    LangSmith offers a free tier that is generous enough for development and small-scale production. The free tier includes a monthly trace allowance with limited retention. Paid tiers remove volume limits and extend data retention significantly.

    Best practices I have settled on after using LangSmith across multiple production applications:

  • Separate projects per environmentmy-app-dev, my-app-staging, my-app-prod
  • Always include user/session metadata — you will need it for debugging user-reported issues
  • Start your eval dataset small — 5-10 examples, then grow from production failures
  • Run evaluations in CI/CD — before deploying a new prompt version, run it against the eval dataset and compare to the previous version
  • Sample production evaluations — running LLM-as-Judge on every single trace is expensive; 10-20% sampling usually catches quality regressions
  • Tag traces with version numbers — when you deploy a new version, tag it so you can compare metrics before and after
  • Complete Code

    Here is a single script that demonstrates the core LangSmith workflow: setup, tracing, custom functions, evaluation dataset creation, and running evaluations. Replace the API keys with your own before running.

    Complete LangSmith workflow script
    Loading editor...

    Frequently Asked Questions

    Can I use LangSmith without LangChain?

    Yes. The @traceable decorator from the langsmith package works with any Python function, regardless of whether it uses LangChain. You can trace plain OpenAI SDK calls, custom retrieval functions, or any Python code. You just lose the automatic sub-step tracing that LangChain runnables provide — you need to nest @traceable calls yourself.

    Using LangSmith without LangChain
    Loading editor...

    Does tracing add latency to my application?

    Trace data is serialized and uploaded asynchronously in a background thread. In practice, the overhead is negligible for typical LLM applications where the LLM call itself takes hundreds of milliseconds to seconds. The tracing adds single-digit milliseconds of overhead.

    How do I disable tracing in specific environments?

    Either do not set LANGCHAIN_TRACING_V2, or set it to "false". LangChain checks this variable at runtime — if it is absent or not "true", no traces are sent. For fine-grained control, you can also set it in code based on conditions.

    Conditional tracing
    Loading editor...

    Can I self-host LangSmith?

    LangSmith offers a self-hosted option for enterprise customers who cannot send trace data to external servers. The self-hosted version runs in your own infrastructure (Kubernetes) and provides the same UI and API as the cloud version. Check the LangSmith documentation for setup instructions and licensing details.


    Related Tutorials