Skip to main content

LangChain Callbacks: Custom Logging, Cost Tracking, and Streaming

Intermediate60 min3 exercises45 XP
Prerequisites
0/3 exercises

You ship an LLM app, and it works. But then your boss asks: "How much did that cost last month?" You don't know. A user reports a weird answer. You can't reproduce it because you never logged the prompt. The model starts returning empty strings at 2 AM, and nobody notices until morning.

LangChain callbacks solve all three problems. They let you hook into every step of an LLM call — before the prompt is sent, after the response arrives, when tokens stream in, when something fails — and run your own code at each point. If you've built anything with LangChain, adding callbacks is the single most impactful improvement you can make to observability.

What Are Callbacks and Why Do They Matter?

A callback is a function that LangChain calls automatically at a specific moment during execution. You don't call it yourself — you register it, and LangChain fires it at the right time.

To create a callback handler, you subclass BaseCallbackHandler and override the event methods you care about. The two most common are on_llm_start (fires before the API request) and on_llm_end (fires after the response arrives). LangChain provides no-op defaults for every method, so you only implement what you need.

A minimal callback handler
Loading editor...

The first two lines come from our callback. The third is the model's response (yours will differ):

LLM call starting with 1 prompt(s)
LLM call finished. Got 1 generation(s)
Python is a high-level, interpreted programming language...

The Callback Lifecycle — Every Hook You Can Use

LangChain doesn't give you just two hooks. It gives you over a dozen, covering every phase of execution across LLMs, chains, tools, and retrievers. I find that most tutorials only show on_llm_start and on_llm_end. Knowing the full set opens up much better logging and debugging.

Here are the callback methods grouped by the component they monitor. You override only the ones you care about — BaseCallbackHandler provides no-op defaults for everything else:

LLM Events:

  • on_llm_start(serialized, prompts, **kwargs) — Before the LLM API call
  • on_llm_new_token(token, **kwargs) — Each streamed token arrives
  • on_llm_end(response, **kwargs) — After the complete response
  • on_llm_error(error, **kwargs) — When the LLM call fails
  • Chain Events:

  • on_chain_start(serialized, inputs, **kwargs) — Before a chain runs
  • on_chain_end(outputs, **kwargs) — After a chain completes
  • on_chain_error(error, **kwargs) — When a chain fails
  • Tool Events:

  • on_tool_start(serialized, input_str, **kwargs) — Before a tool executes
  • on_tool_end(output, **kwargs) — After a tool returns
  • on_tool_error(error, **kwargs) — When a tool fails
  • Retriever Events:

  • on_retriever_start(serialized, query, **kwargs) — Before a retriever searches
  • on_retriever_end(documents, **kwargs) — After documents are returned
  • on_retriever_error(error, **kwargs) — When retrieval fails
  • The lifecycle of a single LLM call flows like this: on_llm_start → (optionally) on_llm_new_token repeated for each token → on_llm_end. If something goes wrong at any point, on_llm_error fires instead of on_llm_end. For chains, the pattern nests: on_chain_starton_llm_start → ... → on_llm_endon_chain_end.

    Build a Logging Handler

    The first callback I build for any project is a logger. Not print() statements scattered through the code — a proper structured logger that captures the prompt, model name, latency, and token counts for every single LLM call. When something goes wrong in production, this log is what saves you.

    This handler uses Python's logging module and the time library to measure latency. The on_llm_start method records the timestamp and model name; on_llm_end computes the elapsed time and extracts token counts from the response metadata. The on_llm_error method captures failures with their duration:

    Structured logging callback
    Loading editor...

    To attach it, pass an instance in the config dict when you call .invoke(). Every method in the handler fires automatically at the right moment:

    Python
    Loading editor...

    Running this produces structured log lines with timestamps, latency, and token counts:

    2026-03-06 14:22:01 | LLM START | model=gpt-4o-mini | prompt_chars=52
    2026-03-06 14:22:02 | LLM END | duration=1.34s | prompt_tokens=18 | completion_tokens=45

    Build a Cost Tracker

    Knowing what your LLM calls cost is critical — especially when you're comparing models. I prefer tracking costs inside my app rather than relying solely on the provider dashboard, because it lets me attribute costs to specific features or users. If you want a broader comparison of API costs across providers, check out our dedicated guide.

    This CostTracker handler maintains a pricing dictionary mapping model names to per-million-token rates. Each time on_llm_end fires, it extracts the prompt and completion token counts from the response metadata, multiplies by the model-specific rate, and accumulates totals. The report() method returns a human-readable summary at any point:

    Token cost tracking callback
    Loading editor...

    Reuse one CostTracker instance across all your calls so the counters accumulate. Here we run three questions through the same tracker and print a summary:

    Python
    Loading editor...

    With gpt-4o-mini, three short questions typically cost well under a tenth of a cent. Swap the model to gpt-4o and the same calls cost roughly 17x more. The tracker makes that difference visible before your invoice does.

    Exercise 1: Build a Per-Model Cost Summary
    Write Code

    Complete the per_model_report() method that returns a dictionary mapping each model name to its total cost. The callback's on_llm_end method already tracks costs. You need to also store costs per model in self.model_costs (a defaultdict(float)) and implement per_model_report() to return that dictionary.

    The model name is available from response.llm_output.get("model_name", "unknown").

    Loading editor...

    Streaming Callbacks — Token-by-Token Output

    When ChatGPT shows text appearing word by word, that's streaming. With LangChain callbacks, you can intercept every single token as it arrives and do whatever you want with it — print it, send it over a WebSocket, feed it to a progress bar, or buffer it for post-processing.

    The key method is on_llm_new_token. It fires once per token, receiving just the new text fragment. This handler prints each token using sys.stdout.write() (instead of print(), which would add a newline after each partial word) and keeps a running count:

    Streaming token handler
    Loading editor...

    To enable streaming, set streaming=True on the LLM. Without this flag, on_llm_new_token never fires — LangChain waits for the full response and skips straight to on_llm_end:

    Python
    Loading editor...

    The output appears progressively in the terminal, token by token, ending with a count like --- 38 tokens streamed ---. The result variable still contains the full response after streaming completes.

    Combining Multiple Callbacks

    You're not limited to one handler. I typically run three simultaneously: a logger, a cost tracker, and an error alerter. Each has a single responsibility, and they all fire on the same LLM call. LangChain calls every handler in the list, in order, for each event.

    Stacking multiple handlers
    Loading editor...

    Constructor-Level vs Invocation-Level Callbacks

    There are two places to attach callbacks, and the choice matters. Constructor-level callbacks attach to the LLM instance and fire on every call. Invocation-level callbacks attach to a single .invoke() call. This distinction applies to LCEL chains the same way it applies to individual LLM calls.

    Constructor-level (always active)
    # Attached at creation — fires on EVERY call
    llm = ChatOpenAI(
        model="gpt-4o-mini",
        callbacks=[LoggingHandler()]
    )
    
    # Both of these calls trigger the handler
    llm.invoke("Question 1")
    llm.invoke("Question 2")
    Invocation-level (per-call)
    llm = ChatOpenAI(model="gpt-4o-mini")
    
    # Only this call triggers the handler
    llm.invoke("Question 1",
        config={"callbacks": [LoggingHandler()]})
    
    # This call has no callbacks
    llm.invoke("Question 2")

    My preference: constructor-level for production monitoring (logging, cost tracking) that should always be on. Invocation-level for temporary debugging or per-request features like streaming to a specific user's WebSocket.


    Error Handling with Callbacks

    API calls fail. Rate limits hit, networks time out, models return unexpected responses. Without callbacks, you're wrapping every .invoke() call in try/except blocks scattered across your codebase. An error-handling callback centralizes all failure logic in one place.

    This ErrorAlertHandler stores every failure as a structured record with a timestamp, error type, full message, and traceback. It hooks into both on_llm_error and on_chain_error to catch failures at every level. The get_error_summary method gives you a quick status check without digging through logs:

    Error alerting callback
    Loading editor...

    In production, you'd replace the print() call with a real alerting mechanism — a Slack webhook, PagerDuty integration, or a metrics counter that triggers an alarm when the error rate spikes.

    Async Callbacks for High-Throughput Apps

    If your app handles multiple LLM calls concurrently — a FastAPI server, a streaming backend, or a batch pipeline — synchronous callbacks can become a bottleneck. LangChain supports async callback handlers through AsyncCallbackHandler.

    The interface is identical to BaseCallbackHandler, but methods are async def and can await I/O operations without blocking the event loop. This AsyncFileLogger writes each LLM event to a log file using aiofiles for non-blocking I/O:

    Async callback handler
    Loading editor...

    Use it with LangChain's async methods — ainvoke instead of invoke:

    Python
    Loading editor...

    Built-in Callback Handlers

    Before building everything from scratch, check whether LangChain's built-in handlers do what you need. These ship with langchain and langchain-core — no extra installation required.

    HandlerWhat It DoesImport Path
    StdOutCallbackHandlerPrints all events to stdoutlangchain_core.callbacks
    StreamingStdOutCallbackHandlerPrints streamed tokens to stdoutlangchain_core.callbacks
    FileCallbackHandlerWrites all events to a filelangchain_community.callbacks
    ConsoleCallbackHandlerRich console output with colorslangchain_core.tracers
    Using built-in handlers
    Loading editor...

    I reach for StdOutCallbackHandler when I just want to see what's happening during development. For production, I switch to a custom handler with structured logging. The built-in handlers are quick starting points — but custom handlers give you full control over format, filtering, and where the data goes.


    Real-World Example — A Production Monitoring Stack

    This handler combines logging, cost tracking, latency monitoring, and error counting into a single class with a dashboard-ready summary method. It's the kind of handler you'd wire up once and leave running in production.

    The key design choice is using run_id (a unique ID LangChain assigns to each invocation) to track concurrent calls. The earlier LoggingHandler used a single self.start_time attribute, which breaks if two calls overlap. Here, a dictionary maps each run_id to its start timestamp, so latencies stay accurate even under load:

    Production monitoring callback
    Loading editor...

    The dashboard() method aggregates the raw data into metrics you'd expose as a health-check API endpoint. It computes average latency, 95th-percentile latency (your worst-case user experience), and the error rate as a percentage:

    Python
    Loading editor...

    After a batch of calls, dashboard() returns a dictionary like this:

    {
        "total_calls": 150,
        "total_tokens": 45200,
        "total_cost_usd": 0.028,
        "avg_latency_s": 1.2,
        "p95_latency_s": 2.8,
        "error_rate_pct": 1.33,
        "errors_by_type": {"RateLimitError": 2}
    }
    Exercise 2: Compute Error Rate and Average Latency
    Write Code

    Complete the compute_stats() function. It receives a list of latencies (floats in seconds) and two integers: call_count and error_count. It should return a dictionary with three keys:

  • "avg_latency": the mean of the latencies list, rounded to 3 decimal places (0 if the list is empty)
  • "p95_latency": the value at the 95th percentile index (int(len * 0.95)), rounded to 3 decimal places (0 if the list is empty)
  • "error_rate": error_count / call_count * 100, rounded to 2 decimal places (0 if call_count is 0)
  • Loading editor...

    Common Mistakes and How to Fix Them

    Callback bugs are subtle because callbacks fail silently — your app keeps working, but your monitoring is broken. These are the four mistakes I see most often.

    Mistake 1: Sharing State Across Concurrent Calls

    Wrong — single start_time attribute
    class BadHandler(BaseCallbackHandler):
        def __init__(self):
            self.start_time = None  # Overwritten by concurrent calls!
    
        def on_llm_start(self, serialized, prompts, **kwargs):
            self.start_time = time.time()
    
        def on_llm_end(self, response, **kwargs):
            elapsed = time.time() - self.start_time  # Wrong if another call started
    Correct — use run_id to track each call
    class GoodHandler(BaseCallbackHandler):
        def __init__(self):
            self._start_times = {}
    
        def on_llm_start(self, serialized, prompts, *, run_id, **kwargs):
            self._start_times[run_id] = time.time()
    
        def on_llm_end(self, response, *, run_id, **kwargs):
            elapsed = time.time() - self._start_times.pop(run_id, time.time())

    If two LLM calls overlap, the first call's start_time gets overwritten by the second. The run_id pattern keeps them isolated.

    Mistake 2: Raising Exceptions in Callbacks

    An unhandled exception in a callback can crash your entire LLM call, even if the API response was fine. Always wrap callback logic in try/except to isolate monitoring failures from application logic:

    Python
    Loading editor...

    Mistake 3: Forgetting streaming=True for Token Callbacks

    This one bites everyone at least once. You write a beautiful on_llm_new_token handler, wire it up, and nothing happens. The fix is a single parameter:

    Python
    Loading editor...

    Mistake 4: Creating a New Handler Instance per Call

    If your callback tracks cumulative metrics (total cost, call count), creating a fresh instance for each .invoke() call resets the counters. Reuse the same instance:

    Python
    Loading editor...
    Exercise 3: Write a Safe Callback Wrapper
    Write Code

    Complete the safe_callback function. It takes a callable callback_fn and returns a new function that:

    1. Calls callback_fn with the given arguments

    2. If callback_fn raises any exception, catches it and returns the string "callback_error: " followed by the exception message

    3. If callback_fn succeeds, returns its return value

    This pattern prevents a broken callback from crashing the main application.

    Loading editor...

    Callbacks vs LangSmith — When to Use Which

    A question I see constantly: should you use custom callbacks or LangSmith for observability? The short answer is they're complementary, not alternatives. LangSmith itself is built on the callback system — it uses a LangChainTracer callback handler under the hood.

    AspectCustom CallbacksLangSmith
    SetupWrite your own handlersSet env vars, auto-traces
    Data locationWherever you send it (your DB, files, etc.)LangChain's cloud servers
    CustomizationFull control over format and logicDashboard UI, limited custom logic
    CostFree (your infrastructure)Free tier + paid plans
    Best forCustom metrics, on-premises, alertingQuick debugging, trace visualization

    I prefer custom callbacks when I need specific metrics (cost attribution per user, custom alerting) or when data cannot leave my infrastructure. LangSmith is faster to set up when you just want trace visualization during development. Many teams run both: LangSmith for development visibility and custom callbacks for production monitoring.

    Frequently Asked Questions

    Can I use callbacks with LCEL chains?

    Yes. LCEL chains accept callbacks through the same config parameter. The chain-level events (on_chain_start, on_chain_end) fire for the overall chain, while LLM events fire for each model call within it:

    Python
    Loading editor...

    How do I access the original prompt inside on_llm_end?

    The on_llm_end method doesn't receive the prompt directly. Store it in on_llm_start and retrieve it in on_llm_end using run_id as the correlation key:

    Python
    Loading editor...

    Do callbacks work with tools and agents?

    Yes. If you're using LangChain tools, the on_tool_start, on_tool_end, and on_tool_error events fire for each tool execution. For agents, the chain-level events capture the agent's reasoning loop, while LLM and tool events capture individual steps within it.

    Can I dispatch custom events from my own code?

    Since LangChain 0.3, you can use dispatch_custom_event(name, data) inside a Runnable to fire arbitrary events that your callback handler can listen for via on_custom_event. This is useful when you need to track application-specific milestones (like "retrieval complete" or "guardrail check passed") that don't map to the standard LLM/chain/tool events.

    What to Learn Next

    You now have a complete callback toolkit: logging, cost tracking, streaming, error handling, async handlers, and production monitoring. Here are the natural next steps in your LangChain journey:

  • LangSmith — LangChain's managed tracing platform, built on callbacks. See your callback data in a visual dashboard.
  • LLM Streaming — Deep dive into streaming patterns beyond basic token callbacks.
  • LangChain Chains — Build multi-step pipelines where chain-level callbacks give you end-to-end visibility.
  • LangChain Tools — Add tool-use callbacks to monitor function calling in agents.
  • Streaming AI API Backend — Wire streaming callbacks into a FastAPI server.
  • Complete Code

    Here is every handler from this tutorial combined into a single copy-paste script. It includes the logging handler, cost tracker, streaming handler, error alerter, and a demo that runs them together:

    Full script — copy-paste and run
    Loading editor...

    References

  • LangChain documentation — Callbacks. Link
  • LangChain documentation — Custom Callback Handlers. Link
  • LangChain API Reference — BaseCallbackHandler. Link
  • LangChain documentation — Async Callbacks. Link
  • OpenAI API documentation — Token Usage. Link
  • OpenAI Pricing. Link
  • Python logging module documentation. Link
  • LangSmith documentation — Tracing Overview. Link
  • Related Tutorials

    Save your progress across devices

    Never lose your code, challenges, or XP. Sign up free — no password needed.

    Already have an account?