LangChain Callbacks: Build Custom Logging, Cost Tracking, and Streaming Handlers

Intermediate60 min3 exercises45 XP

Prerequisites

0/3 exercises

You ship an LLM app, and it works. But then your boss asks: "How much did that cost last month?" You don't know. A user reports a weird answer. You can't reproduce it because you never logged the prompt. The model starts returning empty strings at 2 AM, and nobody notices until morning.

LangChain callbacks solve all three problems. They let you hook into every step of an LLM call — before the prompt is sent, after the response arrives, when tokens stream in, when something fails — and run your own code at each point. I've retrofitted callbacks into apps that were already in production, and every time I wished I had added them from day one.

What Are Callbacks and Why Do They Matter?

A callback is a function that LangChain calls automatically at a specific moment during execution. You don't call it yourself — you register it, and LangChain fires it at the right time.

Think of it like setting up motion-sensor lights in a hallway. You don't flip a switch every time someone walks by — you wire the sensor once, and it activates whenever it detects movement. LangChain callbacks work the same way: you wire them up once, and they fire automatically whenever the LLM starts generating, finishes a response, encounters an error, or produces a new token.

Here is the simplest possible callback — one that prints a message when an LLM call starts and when it ends:

A minimal callback handler

Loading editor...

The first two lines come from our callback. The third is the model's response (yours will differ):

LLM call starting with 1 prompt(s)
LLM call finished. Got 1 generation(s)
Python is a high-level, interpreted programming language...

Two methods, and suddenly you have visibility into every LLM call your app makes. The on_llm_start method fires before the API request goes out, and on_llm_end fires after the response arrives. Everything in between — the actual API call — happens normally.

The Callback Lifecycle — Every Hook You Can Use

LangChain doesn't give you just two hooks. It gives you a dozen, covering every phase of execution across LLMs, chains, tools, and retrievers. I spent my first week with callbacks only using on_llm_start and on_llm_end. Once I discovered the full set, my logging and debugging got dramatically better.

Here are the callback methods grouped by the component they monitor. You override only the ones you care about — BaseCallbackHandler provides no-op defaults for everything else:

LLM Events:

on_llm_start(serialized, prompts, **kwargs) — Before the LLM API call

on_llm_new_token(token, **kwargs) — Each streamed token arrives

on_llm_end(response, **kwargs) — After the complete response

on_llm_error(error, **kwargs) — When the LLM call fails

Chain Events:

on_chain_start(serialized, inputs, **kwargs) — Before a chain runs

on_chain_end(outputs, **kwargs) — After a chain completes

on_chain_error(error, **kwargs) — When a chain fails

Tool Events:

on_tool_start(serialized, input_str, **kwargs) — Before a tool executes

on_tool_end(output, **kwargs) — After a tool returns

on_tool_error(error, **kwargs) — When a tool fails

The lifecycle of a single LLM call flows like this: on_llm_start → (optionally) on_llm_new_token repeated for each token → on_llm_end. If something goes wrong at any point, on_llm_error fires instead of on_llm_end. For chains, the pattern nests: on_chain_start → on_llm_start → ... → on_llm_end → on_chain_end.

Build a Logging Handler

The first callback I build for every new project is a logger. Not print() statements scattered through the code — a proper structured logger that captures the prompt, model name, latency, and token counts for every single LLM call. When something goes wrong in production, this log is what saves you.

This handler uses Python's logging module and the time library to capture timing. The on_llm_start method records the start time; on_llm_end computes the elapsed time and logs the full details:

Structured logging callback

Loading editor...

Using it is the same pattern as before — pass an instance in the callbacks list:

Python

Loading editor...

Running this produces log lines like:

2025-03-15 14:22:01 | LLM START | model=gpt-4o-mini | prompt_chars=52
2025-03-15 14:22:02 | LLM END | duration=1.34s | prompt_tokens=18 | completion_tokens=45

Every LLM call in your app now leaves a trace. If a user reports a bad response, you search the logs by timestamp. If latency spikes, you spot it immediately in the duration field.

Build a Cost Tracker

Token cost tracking callback

Loading editor...

This was the callback that made my manager finally care about LLM observability. Before this, "API costs" was a number on the billing dashboard nobody checked until the end of the month. With CostTracker, you see the cost of every operation in real time.

The pricing dictionary maps model names to per-token costs. When on_llm_end fires, we pull the token counts from the response metadata, multiply by the model-specific rate, and accumulate. The report() method gives you a summary at any point:

Python

Loading editor...

The output will show all three calls accumulated, with the total token count and cost. With gpt-4o-mini, three short questions typically cost well under a tenth of a cent. Swap the model to gpt-4o and the same calls cost roughly 17x more. The tracker makes that difference visible before your invoice does.

Exercise 1: Build a Per-Model Cost Summary

Write Code

Complete the per_model_report() method that returns a dictionary mapping each model name to its total cost. The callback's on_llm_end method already tracks costs. You need to also store costs per model in self.model_costs (a defaultdict(float)) and implement per_model_report() to return that dictionary.

The model name is available from response.llm_output.get("model_name", "unknown").

Loading editor...

Streaming Callbacks — Token-by-Token Output

When ChatGPT shows text appearing word by word, that's streaming. With LangChain callbacks, you can intercept every single token as it arrives and do whatever you want with it — print it, send it over a WebSocket, feed it to a progress bar, or buffer it for post-processing.

The key method is on_llm_new_token. It fires once per token, receiving just the new text fragment. Here's a handler that prints tokens as they arrive, plus counts them:

Streaming token handler

Loading editor...

We use sys.stdout.write() instead of print() because print() adds a newline after each token. Since tokens can be partial words ("de", "cor", "ator"), you want them to flow together as continuous text.

To enable streaming, set streaming=True on the LLM. Without this flag, on_llm_new_token never fires:

Python

Loading editor...

The output appears progressively in the terminal, token by token, ending with a count like --- 38 tokens streamed ---. The result variable still contains the full response after streaming completes.

Combining Multiple Callbacks

Here's where callbacks get really practical. You're not limited to one handler — you can stack as many as you need. I typically run three simultaneously: a logger, a cost tracker, and an error alerter. Each one has a single responsibility, and they all fire on the same LLM call.

Stacking multiple handlers

Loading editor...

LangChain calls every handler in the list, in order, for each event. If on_llm_start fires, both log_handler.on_llm_start() and cost_handler.on_llm_start() execute. If one handler doesn't implement a method, it's silently skipped.

Constructor-Level vs Invocation-Level Callbacks

There are two places to attach callbacks, and the choice matters. Constructor-level callbacks attach to the LLM instance and fire on every call. Invocation-level callbacks attach to a single .invoke() call.

Constructor-level (always active)

# Attached at creation — fires on EVERY call
llm = ChatOpenAI(
    model="gpt-4o-mini",
    callbacks=[LoggingHandler()]
)

# Both of these calls trigger the handler
llm.invoke("Question 1")
llm.invoke("Question 2")

Invocation-level (per-call)

llm = ChatOpenAI(model="gpt-4o-mini")

# Only this call triggers the handler
llm.invoke("Question 1",
    config={"callbacks": [LoggingHandler()]})

# This call has no callbacks
llm.invoke("Question 2")

My rule of thumb: use constructor-level callbacks for production monitoring (logging, cost tracking) that should always be on. Use invocation-level callbacks for temporary debugging or per-request features like streaming to a specific user's WebSocket connection.

Error Handling with Callbacks

API calls fail. Rate limits hit, networks time out, models return unexpected responses. Without callbacks, you're wrapping every .invoke() call in try/except blocks scattered across your codebase. With an error-handling callback, you centralize all failure logic in one place.

Error alerting callback

Loading editor...

This handler catches both LLM-level and chain-level errors. In production, you'd replace the print() call with a real alerting mechanism — a Slack webhook, PagerDuty integration, or a metrics counter that triggers an alarm when the error rate spikes.

The errors list acts as an in-memory error log. For a long-running service, you'd want to write these to a database or log aggregator instead.

Async Callbacks for High-Throughput Apps

If your app handles multiple LLM calls concurrently — a web server fielding many requests, a batch processing pipeline — synchronous callbacks can become a bottleneck. LangChain supports async callback handlers through AsyncCallbackHandler. The interface is identical, but methods are async def and can await I/O operations without blocking.

Async callback handler

Loading editor...

The AsyncFileLogger writes to a file without blocking the event loop. In a FastAPI or aiohttp application, this means your callback logging doesn't slow down other request handlers. Use it with LangChain's async methods:

Python

Loading editor...

Real-World Example — A Production Monitoring Stack

Let's bring everything together into a handler that you'd actually use in production. This combines logging, cost tracking, latency monitoring, and error counting into a single class with a dashboard-ready summary method. I based this on a handler I use in a customer-facing chatbot that processes around 10,000 calls per day.

Production monitoring callback

Loading editor...

Notice the run_id parameter. Each LLM call gets a unique ID, so the handler can correctly match start and end events even when multiple calls run concurrently. I used a plain self.start_time in the earlier logging example for simplicity, but run_id is the robust approach.

The dashboard() method aggregates everything into a summary you can expose as an API endpoint or log periodically:

Python

Loading editor...

After a batch of calls, dashboard() returns something like:

{
    "total_calls": 150,
    "total_tokens": 45200,
    "total_cost_usd": 0.028,
    "avg_latency_s": 1.2,
    "p95_latency_s": 2.8,
    "error_rate_pct": 1.33,
    "errors_by_type": {"RateLimitError": 2}
}

That dictionary tells you everything you need for a status page or Grafana dashboard. P95 latency above your SLA? Time to investigate. Error rate climbing? Check the errors_by_type breakdown.

Exercise 2: Compute Error Rate and Average Latency

Write Code

Complete the compute_stats() function. It receives a list of latencies (floats in seconds) and two integers: call_count and error_count. It should return a dictionary with three keys:

"avg_latency": the mean of the latencies list, rounded to 3 decimal places (0 if the list is empty)

"p95_latency": the value at the 95th percentile index (int(len * 0.95)), rounded to 3 decimal places (0 if the list is empty)

"error_rate": error_count / call_count * 100, rounded to 2 decimal places (0 if call_count is 0)

Loading editor...

Common Mistakes and How to Fix Them

I've debugged every one of these in my own code or in code review. They're subtle because callbacks fail silently — your app keeps working, but your monitoring is broken.

Mistake 1: Sharing State Across Concurrent Calls

Wrong — single start_time attribute

class BadHandler(BaseCallbackHandler):
    def __init__(self):
        self.start_time = None  # Overwritten by concurrent calls!

    def on_llm_start(self, serialized, prompts, **kwargs):
        self.start_time = time.time()

    def on_llm_end(self, response, **kwargs):
        elapsed = time.time() - self.start_time  # Wrong if another call started

Correct — use run_id to track each call

class GoodHandler(BaseCallbackHandler):
    def __init__(self):
        self._start_times = {}

    def on_llm_start(self, serialized, prompts, *, run_id, **kwargs):
        self._start_times[run_id] = time.time()

    def on_llm_end(self, response, *, run_id, **kwargs):
        elapsed = time.time() - self._start_times.pop(run_id, time.time())

If two LLM calls overlap, the first call's start_time gets overwritten by the second. The run_id pattern keeps them isolated.

Mistake 2: Raising Exceptions in Callbacks

An unhandled exception in a callback can crash your entire LLM call, even if the API response was fine. Always wrap callback logic in try/except:

Python

Loading editor...

Your monitoring code should never take down your production LLM pipeline. Log the callback failure and move on.

Mistake 3: Forgetting streaming=True for Token Callbacks

This one bites everyone at least once. You write a beautiful on_llm_new_token handler, wire it up, and nothing happens. The fix is a single parameter:

Python

Loading editor...

Mistake 4: Creating a New Handler Instance per Call

If your callback tracks cumulative metrics (total cost, call count), creating a fresh instance for each .invoke() call resets the counters. Reuse the same instance:

Python

Loading editor...

Exercise 3: Write a Safe Callback Wrapper

Write Code

Complete the safe_callback function. It takes a callable callback_fn and returns a new function that:

1. Calls callback_fn with the given arguments

2. If callback_fn raises any exception, catches it and returns the string "callback_error: " followed by the exception message

3. If callback_fn succeeds, returns its return value

This pattern prevents a broken callback from crashing the main application.

Loading editor...

Frequently Asked Questions

Can I use callbacks with LangChain Expression Language (LCEL) chains?

Yes. LCEL chains accept callbacks through the same config parameter. The chain-level events (on_chain_start, on_chain_end) fire for the overall chain, while LLM events fire for each model call within it:

Python

Loading editor...

How do I access the original prompt text inside on_llm_end?

The on_llm_end method doesn't receive the prompt directly. The cleanest approach is to store it in on_llm_start and retrieve it in on_llm_end using run_id:

Python

Loading editor...

What is the difference between callbacks and LangSmith?

LangSmith is LangChain's managed observability platform. Under the hood, it uses callbacks to capture traces. Callbacks give you full control — you decide what to log, where to store it, and how to process it. LangSmith provides a ready-made dashboard with trace visualization, but requires sending data to LangChain's servers. Use callbacks when you need custom logic or must keep data on-premises. Use LangSmith when you want quick setup with a hosted UI.

Complete Code

Full script — copy-paste and run

Loading editor...

References

LangChain documentation — Callbacks. Link

LangChain documentation — Custom Callback Handlers. Link

LangChain API Reference — BaseCallbackHandler. Link

LangChain documentation — Async Callbacks. Link

OpenAI API documentation — Token Usage. Link

OpenAI Pricing. Link

Python logging module documentation. Link

LangSmith documentation — Tracing Overview. Link

What Are Callbacks and Why Do They Matter?

The Callback Lifecycle — Every Hook You Can Use

Build a Logging Handler

Build a Cost Tracker

Streaming Callbacks — Token-by-Token Output

Combining Multiple Callbacks

Constructor-Level vs Invocation-Level Callbacks

Error Handling with Callbacks

Async Callbacks for High-Throughput Apps

Real-World Example — A Production Monitoring Stack

Common Mistakes and How to Fix Them

Mistake 1: Sharing State Across Concurrent Calls

Mistake 2: Raising Exceptions in Callbacks

Mistake 3: Forgetting streaming=True for Token Callbacks

Mistake 4: Creating a New Handler Instance per Call

Frequently Asked Questions

Can I use callbacks with LangChain Expression Language (LCEL) chains?

How do I access the original prompt text inside on_llm_end?

What is the difference between callbacks and LangSmith?

Complete Code

References

Related Tutorials