Skip to main content

LangSmith: Trace, Debug, Evaluate, and Monitor LLM Applications

Intermediate90 min3 exercises45 XP
Prerequisites
0/3 exercises

Your LangChain pipeline works on three test inputs. You ship it. Then a user reports nonsense output. You stare at the chain definition and have no idea which step broke, what the intermediate values were, or whether the prompt even received the right context.

I have been in this exact situation more times than I care to admit. LangSmith changed everything — it records every single step of every single run, so instead of guessing, you replay the failure and see exactly what happened.

What Is LangSmith and Why Does It Matter?

I think of it as the "print-statement debugger" you wish you had for LLM apps, except it works in production and keeps a permanent record. Every prompt that went in, every completion that came back, every tool call your agent made — all captured automatically.

The platform has four pillars that map directly to the lifecycle of an LLM application:

Without LangSmith
# Your chain fails in production
# You add print() statements everywhere
# You try to reproduce locally
# You still can't find the bug
# Hours wasted
With LangSmith
# Your chain fails in production
# Open LangSmith dashboard
# Click the red (errored) trace
# See exact input that caused failure
# Fix in 30 seconds

The four pillars:

  • Tracing — Record every step of a chain run, including inputs, outputs, token counts, and latencies
  • Debugging — Inspect failing runs, compare successful vs. failed traces, and isolate the broken step
  • Evaluation — Build datasets of input/expected-output pairs and score your chain with automated evaluators (including LLM-as-Judge)
  • Monitoring — Track success rates, latency percentiles, cost, and custom metrics in production
  • Setup and Your First Trace

    The setup surprised me when I first tried LangSmith — it is genuinely just two environment variables. LangSmith tracing activates the moment LangChain sees LANGCHAIN_TRACING_V2 set to "true". No code changes to your existing chains.

    Environment setup for LangSmith
    Loading editor...

    Here is a simple chain — run it and then check your LangSmith dashboard to see the trace appear.

    A simple chain that gets traced automatically
    Loading editor...

    Open smith.langchain.com and navigate to your project. You will see a new trace with the full execution tree: the prompt template rendering, the LLM call with exact input/output tokens, and the output parser step. Each node shows its latency, so you immediately know where time is spent.

    Understanding Traces — The Execution Tree

    A trace is a tree of "runs." The root run is your top-level chain invocation. Each step inside the chain — prompt formatting, LLM call, output parsing, tool invocation — becomes a child run. Agent sub-steps become grandchild runs.


    This chain generates a quiz question and then evaluates it — two LLM calls that create a deeper trace tree.

    Multi-step chain with a deeper trace
    Loading editor...

    In LangSmith, this shows up as two separate root traces (one per .invoke() call). Each root has three child nodes: ChatPromptTemplate → ChatOpenAI → StrOutputParser. Click any node to see its exact input and output.

    Tracing Non-LangChain Code with @traceable

    Not everything in your pipeline is a LangChain runnable. I have a validation function, a post-processing step, and a custom retriever in most of my projects — all plain Python. The @traceable decorator from the langsmith package wraps any function and includes it in the trace tree.

    Using @traceable for custom functions
    Loading editor...

    In LangSmith, the trace tree now shows python_qa_pipeline as the root, with validate_input, the LangChain chain (and its sub-steps), and format_response as children — all in one unified trace. If validate_input raises a ValueError, you see the exception directly in the trace without any log-hunting.

    Debugging Failing Chains

    This is where LangSmith earns its keep. When a chain fails in production, you do not need to reproduce the bug locally. Open the trace, find the red (errored) node, and read the exact input that caused the failure.

    This chain deliberately fails on certain inputs, so we can see what the error trace looks like.

    A chain with a deliberate failure point
    Loading editor...

    When the LLM wraps its JSON in markdown code fences (a common failure mode), json.loads() fails. In LangSmith, you see the parse_json_output node highlighted in red with the full exception traceback. More importantly, you can click the parent LLM node and read the exact output that caused the parsing failure.

    Comparing Successful and Failed Runs

    LangSmith lets you select two runs and compare them side by side. I use this constantly — find one successful trace and one failed trace for the same chain, compare them, and the difference is usually obvious.

    Evaluation Datasets — Systematic Testing for LLM Apps

    Spot-checking a few inputs is how most people test their LLM apps. It works until it doesn’t. I shipped a chain that passed my five manual tests, then discovered it hallucinated on a completely normal input the next day.


    Evaluation datasets bring the discipline of unit testing to LLM development. Define a set of inputs with expected outputs, run your chain against all of them, and measure how well it performs.

    Creating an evaluation dataset
    Loading editor...

    The dataset now lives in your LangSmith account. You can also add examples through the UI, or convert production traces into evaluation examples with one click — which is a fantastic way to build your test set from real user queries.

    Running Evaluations with Custom and Built-in Evaluators

    A dataset alone is just data. The power comes from evaluators — functions that score each chain output against the expected output. LangSmith supports two flavors: custom Python evaluators (fast, deterministic) and LLM-as-Judge evaluators (nuanced, costlier).

    Custom Python Evaluator

    The simplest evaluator is a plain function. It receives the run output and the reference (expected) output, and returns a score.

    A custom evaluator that checks keyword overlap
    Loading editor...

    LLM-as-Judge Evaluator

    Keyword overlap is crude. For nuanced quality assessment, use an LLM to judge whether the answer is correct, complete, and well-written. The langsmith and langchain packages provide built-in evaluator classes for this.

    LLM-as-Judge evaluator
    Loading editor...

    With evaluators defined, we run the evaluation against our dataset. LangSmith orchestrates the whole process: it feeds each dataset example to your chain, collects the outputs, runs every evaluator, and aggregates the scores.

    Running the evaluation
    Loading editor...

    The results page in LangSmith shows a table with each example, the chain output, and all evaluator scores side by side. You can sort by score to quickly find the weakest answers, then drill into the trace to understand what went wrong.

    Exercise 1: Build a Length-Based Evaluator
    Write Code

    Write a function length_evaluator(predicted, reference) that scores how close the predicted answer's length is to the reference answer's length. Return 1.0 if they are the same length, and reduce the score proportionally as they diverge. Use the formula: 1 - abs(len(predicted) - len(reference)) / max(len(predicted), len(reference)). If both strings are empty, return 1.0.

    Loading editor...

    Production Monitoring — Keeping Your LLM App Healthy

    Evaluation tells you how your chain performs on a fixed test set. Monitoring tells you how it performs on real traffic — and real traffic always surprises you. LangSmith captures every production run and lets you build dashboards, set up alerts, and track trends.

    Setting Up Production Tracing

    The first thing I set up for any production deployment: a separate LangSmith project with explicit user and session tracking.

    Production tracing configuration
    Loading editor...

    With user_id and session_id in the metadata, you can filter traces in LangSmith by user or session. When a user reports an issue, you search for their ID and see every interaction they had.

    Online Evaluation with Automation Rules

    LangSmith supports automation rules that run evaluators on production traces automatically. Configure a rule in the UI that triggers an LLM-as-Judge evaluator on every incoming trace (or a sample). Continuous quality monitoring, zero additional code.

    Annotation Queues — Human Feedback at Scale

    Automated evaluators are useful, but some quality judgments need human eyes. LangSmith’s annotation queues let you route traces to human reviewers who label them directly in the UI. Reviewers see the input, the chain output, and any evaluator scores, then add their own judgment.

    Creating an annotation queue and sending runs to it
    Loading editor...

    This is where monitoring connects back to evaluation. Human-reviewed traces with corrected outputs become new examples in your evaluation dataset, closing the feedback loop.

    The LangSmith SDK — Programmatic Access to Traces

    Everything you can do in the LangSmith UI, you can do programmatically with the langsmith Python client. I use this for nightly quality reports, CI/CD gate checks before deployment, and custom Slack alerts when error rates spike.

    Querying traces programmatically
    Loading editor...

    You can filter runs by tags, metadata, error status, latency range, and more. Nightly quality reports and CI/CD gate checks become straightforward.

    Finding error runs and extracting details
    Loading editor...
    Exercise 2: Compute Error Rate from Run Data
    Write Code

    Write a function compute_error_rate(runs) that takes a list of run dictionaries (each with a "status" key that is either "success" or "error") and returns the error rate as a float between 0.0 and 1.0, rounded to 2 decimal places. If the list is empty, return 0.0.

    Loading editor...

    Real-World Example: End-to-End LLM App with LangSmith

    Time for a complete example. We will build a Python documentation Q&A pipeline, trace it with LangSmith, create an evaluation dataset, and run evaluations — the full workflow in one script.

    Complete Q&A pipeline with LangSmith integration
    Loading editor...

    All three calls are now visible as traced runs in LangSmith. The next step is to formalize these into an evaluation dataset and measure quality.

    Create eval dataset and run evaluation
    Loading editor...

    Common Mistakes and How to Fix Them

    I have seen each of these trip up teams adopting LangSmith. They are all easy to fix once you know about them.

    Forgetting to set the tracing env var
    import os
    os.environ["LANGCHAIN_API_KEY"] = "lsv2_pt_..."
    # Missing LANGCHAIN_TRACING_V2!
    
    # This chain runs fine but sends NO traces
    chain.invoke({"question": "What is Python?"})
    Always set LANGCHAIN_TRACING_V2
    import os
    os.environ["LANGCHAIN_TRACING_V2"] = "true"  # This enables tracing
    os.environ["LANGCHAIN_API_KEY"] = "lsv2_pt_..."
    
    # Now traces are captured
    chain.invoke({"question": "What is Python?"})

    The most common mistake is simply forgetting to set LANGCHAIN_TRACING_V2. The chain works perfectly without it — you just get no traces, and you do not realize it until you need them.

    Same project for dev and production
    # Everything goes to one project
    os.environ["LANGCHAIN_PROJECT"] = "my-app"
    Separate projects for each environment
    # Dev traces and prod traces stay separate
    env = os.getenv("APP_ENV", "development")
    os.environ["LANGCHAIN_PROJECT"] = f"my-app-{env}"

    Mixing dev and production traces makes filtering painful. Your production monitoring dashboards will include test runs, and your evaluations will accidentally use production data. Always separate by environment.

    Other mistakes I see frequently:

  • Not adding metadata to production runs — without user_id or session_id, you cannot trace a user complaint back to a specific run
  • Building huge evaluation datasets before shipping — start with 5-10 examples, ship, then grow the dataset from real production failures
  • Ignoring token usage data — LangSmith tracks tokens per run; a sudden spike in token usage usually means a prompt regression
  • Exercise 3: Filter Traces by Metadata
    Write Code

    Write a function filter_traces(traces, key, value) that takes a list of trace dictionaries (each with a "metadata" dict) and returns only the traces where metadata[key] equals value. If a trace does not have the specified key in its metadata, skip it.

    Loading editor...

    Pricing, Limits, and Best Practices

    LangSmith offers a free tier that is generous enough for development and small-scale production. The free tier includes a monthly trace allowance with limited retention. Paid tiers remove volume limits and extend data retention.

    These are the best practices I have settled on after using LangSmith across multiple production applications:

    Complete Code

    Here is a single script that demonstrates the core LangSmith workflow: setup, tracing, custom functions, evaluation dataset creation, and running evaluations. Replace the API keys with your own before running.

    Complete LangSmith workflow script
    Loading editor...

    LangSmith vs Alternatives — When to Use What

    LangSmith is not the only LLM observability platform. I have tried several alternatives, and the choice depends on your stack and priorities.

    PlatformBest ForLangChain IntegrationSelf-HostedOpen Source
    LangSmithLangChain/LangGraph apps, teams already in the LangChain ecosystemNative (automatic)Enterprise planNo
    Arize PhoenixFramework-agnostic tracing, OpenTelemetry-native teamsVia OpenInferenceYes (free)Yes
    Weights & Biases WeaveTeams already using W&B for ML experiment trackingManual integrationNoPartially
    HeliconeSimple request logging, cost tracking, rate limitingProxy-basedYesYes
    OpenTelemetry + JaegerTeams with existing OTel infrastructureManual instrumentationYesYes

    LangSmith Fetch — Debug Agents from Your Terminal

    LangSmith Fetch is a newer CLI tool that brings trace data directly into your terminal. Instead of switching to the web UI to inspect a trace, you run langsmith-fetch and pipe the output to your coding agent or read it directly.

    Installing and using LangSmith Fetch
    Loading editor...

    This is particularly powerful when paired with AI coding assistants. You can pipe trace data directly into Claude Code or Cursor, turning your coding agent into an expert LLM debugger that can see exactly what your agent did.

    Frequently Asked Questions

    Can I use LangSmith without LangChain?

    Yes. The @traceable decorator from the langsmith package works with any Python function, regardless of whether it uses LangChain. You can trace plain OpenAI SDK calls, custom retrieval functions, or any Python code. You just lose the automatic sub-step tracing that LangChain runnables provide — you need to nest @traceable calls yourself.

    Using LangSmith without LangChain
    Loading editor...

    Does tracing add latency to my application?

    Trace data is serialized and uploaded asynchronously in a background thread. In practice, the overhead is negligible for typical LLM applications where the LLM call itself takes hundreds of milliseconds to seconds. The tracing adds single-digit milliseconds of overhead.

    How do I disable tracing in specific environments?

    Either do not set LANGCHAIN_TRACING_V2, or set it to "false". LangChain checks this variable at runtime — if it is absent or not "true", no traces are sent. For fine-grained control, you can also set it in code based on conditions.

    Conditional tracing
    Loading editor...

    Can I self-host LangSmith?

    LangSmith offers a self-hosted option for enterprise customers who cannot send trace data to external servers. The self-hosted version runs in your own infrastructure (Kubernetes) and provides the same UI and API as the cloud version. Check the LangSmith documentation for setup instructions and licensing details.


    Related Tutorials

    Save your progress across devices

    Never lose your code, challenges, or XP. Sign up free — no password needed.

    Already have an account?