LangChain Callbacks: Build Custom Logging, Cost Tracking, and Streaming Handlers
You ship an LLM app, and it works. But then your boss asks: "How much did that cost last month?" You don't know. A user reports a weird answer. You can't reproduce it because you never logged the prompt. The model starts returning empty strings at 2 AM, and nobody notices until morning.
LangChain callbacks solve all three problems. They let you hook into every step of an LLM call — before the prompt is sent, after the response arrives, when tokens stream in, when something fails — and run your own code at each point. I've retrofitted callbacks into apps that were already in production, and every time I wished I had added them from day one.
What Are Callbacks and Why Do They Matter?
A callback is a function that LangChain calls automatically at a specific moment during execution. You don't call it yourself — you register it, and LangChain fires it at the right time.
Think of it like setting up motion-sensor lights in a hallway. You don't flip a switch every time someone walks by — you wire the sensor once, and it activates whenever it detects movement. LangChain callbacks work the same way: you wire them up once, and they fire automatically whenever the LLM starts generating, finishes a response, encounters an error, or produces a new token.
Here is the simplest possible callback — one that prints a message when an LLM call starts and when it ends:
The first two lines come from our callback. The third is the model's response (yours will differ):
LLM call starting with 1 prompt(s)
LLM call finished. Got 1 generation(s)
Python is a high-level, interpreted programming language...Two methods, and suddenly you have visibility into every LLM call your app makes. The on_llm_start method fires before the API request goes out, and on_llm_end fires after the response arrives. Everything in between — the actual API call — happens normally.
The Callback Lifecycle — Every Hook You Can Use
LangChain doesn't give you just two hooks. It gives you a dozen, covering every phase of execution across LLMs, chains, tools, and retrievers. I spent my first week with callbacks only using on_llm_start and on_llm_end. Once I discovered the full set, my logging and debugging got dramatically better.
Here are the callback methods grouped by the component they monitor. You override only the ones you care about — BaseCallbackHandler provides no-op defaults for everything else:
LLM Events:
on_llm_start(serialized, prompts, **kwargs) — Before the LLM API callon_llm_new_token(token, **kwargs) — Each streamed token arriveson_llm_end(response, **kwargs) — After the complete responseon_llm_error(error, **kwargs) — When the LLM call failsChain Events:
on_chain_start(serialized, inputs, **kwargs) — Before a chain runson_chain_end(outputs, **kwargs) — After a chain completeson_chain_error(error, **kwargs) — When a chain failsTool Events:
on_tool_start(serialized, input_str, **kwargs) — Before a tool executeson_tool_end(output, **kwargs) — After a tool returnson_tool_error(error, **kwargs) — When a tool failsThe lifecycle of a single LLM call flows like this: on_llm_start → (optionally) on_llm_new_token repeated for each token → on_llm_end. If something goes wrong at any point, on_llm_error fires instead of on_llm_end. For chains, the pattern nests: on_chain_start → on_llm_start → ... → on_llm_end → on_chain_end.
Build a Logging Handler
The first callback I build for every new project is a logger. Not print() statements scattered through the code — a proper structured logger that captures the prompt, model name, latency, and token counts for every single LLM call. When something goes wrong in production, this log is what saves you.
This handler uses Python's logging module and the time library to capture timing. The on_llm_start method records the start time; on_llm_end computes the elapsed time and logs the full details:
Using it is the same pattern as before — pass an instance in the callbacks list:
Running this produces log lines like:
2025-03-15 14:22:01 | LLM START | model=gpt-4o-mini | prompt_chars=52
2025-03-15 14:22:02 | LLM END | duration=1.34s | prompt_tokens=18 | completion_tokens=45Every LLM call in your app now leaves a trace. If a user reports a bad response, you search the logs by timestamp. If latency spikes, you spot it immediately in the duration field.
Build a Cost Tracker
This was the callback that made my manager finally care about LLM observability. Before this, "API costs" was a number on the billing dashboard nobody checked until the end of the month. With CostTracker, you see the cost of every operation in real time.
The pricing dictionary maps model names to per-token costs. When on_llm_end fires, we pull the token counts from the response metadata, multiply by the model-specific rate, and accumulate. The report() method gives you a summary at any point:
The output will show all three calls accumulated, with the total token count and cost. With gpt-4o-mini, three short questions typically cost well under a tenth of a cent. Swap the model to gpt-4o and the same calls cost roughly 17x more. The tracker makes that difference visible before your invoice does.
Complete the per_model_report() method that returns a dictionary mapping each model name to its total cost. The callback's on_llm_end method already tracks costs. You need to also store costs per model in self.model_costs (a defaultdict(float)) and implement per_model_report() to return that dictionary.
The model name is available from response.llm_output.get("model_name", "unknown").
Streaming Callbacks — Token-by-Token Output
When ChatGPT shows text appearing word by word, that's streaming. With LangChain callbacks, you can intercept every single token as it arrives and do whatever you want with it — print it, send it over a WebSocket, feed it to a progress bar, or buffer it for post-processing.
The key method is on_llm_new_token. It fires once per token, receiving just the new text fragment. Here's a handler that prints tokens as they arrive, plus counts them:
We use sys.stdout.write() instead of print() because print() adds a newline after each token. Since tokens can be partial words ("de", "cor", "ator"), you want them to flow together as continuous text.
To enable streaming, set streaming=True on the LLM. Without this flag, on_llm_new_token never fires:
The output appears progressively in the terminal, token by token, ending with a count like --- 38 tokens streamed ---. The result variable still contains the full response after streaming completes.
Combining Multiple Callbacks
Here's where callbacks get really practical. You're not limited to one handler — you can stack as many as you need. I typically run three simultaneously: a logger, a cost tracker, and an error alerter. Each one has a single responsibility, and they all fire on the same LLM call.
LangChain calls every handler in the list, in order, for each event. If on_llm_start fires, both log_handler.on_llm_start() and cost_handler.on_llm_start() execute. If one handler doesn't implement a method, it's silently skipped.
Constructor-Level vs Invocation-Level Callbacks
There are two places to attach callbacks, and the choice matters. Constructor-level callbacks attach to the LLM instance and fire on every call. Invocation-level callbacks attach to a single .invoke() call.
# Attached at creation — fires on EVERY call
llm = ChatOpenAI(
model="gpt-4o-mini",
callbacks=[LoggingHandler()]
)
# Both of these calls trigger the handler
llm.invoke("Question 1")
llm.invoke("Question 2")llm = ChatOpenAI(model="gpt-4o-mini")
# Only this call triggers the handler
llm.invoke("Question 1",
config={"callbacks": [LoggingHandler()]})
# This call has no callbacks
llm.invoke("Question 2")My rule of thumb: use constructor-level callbacks for production monitoring (logging, cost tracking) that should always be on. Use invocation-level callbacks for temporary debugging or per-request features like streaming to a specific user's WebSocket connection.
Error Handling with Callbacks
API calls fail. Rate limits hit, networks time out, models return unexpected responses. Without callbacks, you're wrapping every .invoke() call in try/except blocks scattered across your codebase. With an error-handling callback, you centralize all failure logic in one place.
This handler catches both LLM-level and chain-level errors. In production, you'd replace the print() call with a real alerting mechanism — a Slack webhook, PagerDuty integration, or a metrics counter that triggers an alarm when the error rate spikes.
The errors list acts as an in-memory error log. For a long-running service, you'd want to write these to a database or log aggregator instead.
Async Callbacks for High-Throughput Apps
If your app handles multiple LLM calls concurrently — a web server fielding many requests, a batch processing pipeline — synchronous callbacks can become a bottleneck. LangChain supports async callback handlers through AsyncCallbackHandler. The interface is identical, but methods are async def and can await I/O operations without blocking.
The AsyncFileLogger writes to a file without blocking the event loop. In a FastAPI or aiohttp application, this means your callback logging doesn't slow down other request handlers. Use it with LangChain's async methods:
Real-World Example — A Production Monitoring Stack
Let's bring everything together into a handler that you'd actually use in production. This combines logging, cost tracking, latency monitoring, and error counting into a single class with a dashboard-ready summary method. I based this on a handler I use in a customer-facing chatbot that processes around 10,000 calls per day.
Notice the run_id parameter. Each LLM call gets a unique ID, so the handler can correctly match start and end events even when multiple calls run concurrently. I used a plain self.start_time in the earlier logging example for simplicity, but run_id is the robust approach.
The dashboard() method aggregates everything into a summary you can expose as an API endpoint or log periodically:
After a batch of calls, dashboard() returns something like:
{
"total_calls": 150,
"total_tokens": 45200,
"total_cost_usd": 0.028,
"avg_latency_s": 1.2,
"p95_latency_s": 2.8,
"error_rate_pct": 1.33,
"errors_by_type": {"RateLimitError": 2}
}That dictionary tells you everything you need for a status page or Grafana dashboard. P95 latency above your SLA? Time to investigate. Error rate climbing? Check the errors_by_type breakdown.
Complete the compute_stats() function. It receives a list of latencies (floats in seconds) and two integers: call_count and error_count. It should return a dictionary with three keys:
"avg_latency": the mean of the latencies list, rounded to 3 decimal places (0 if the list is empty)"p95_latency": the value at the 95th percentile index (int(len * 0.95)), rounded to 3 decimal places (0 if the list is empty)"error_rate": error_count / call_count * 100, rounded to 2 decimal places (0 if call_count is 0)Common Mistakes and How to Fix Them
I've debugged every one of these in my own code or in code review. They're subtle because callbacks fail silently — your app keeps working, but your monitoring is broken.
Mistake 1: Sharing State Across Concurrent Calls
class BadHandler(BaseCallbackHandler):
def __init__(self):
self.start_time = None # Overwritten by concurrent calls!
def on_llm_start(self, serialized, prompts, **kwargs):
self.start_time = time.time()
def on_llm_end(self, response, **kwargs):
elapsed = time.time() - self.start_time # Wrong if another call startedclass GoodHandler(BaseCallbackHandler):
def __init__(self):
self._start_times = {}
def on_llm_start(self, serialized, prompts, *, run_id, **kwargs):
self._start_times[run_id] = time.time()
def on_llm_end(self, response, *, run_id, **kwargs):
elapsed = time.time() - self._start_times.pop(run_id, time.time())If two LLM calls overlap, the first call's start_time gets overwritten by the second. The run_id pattern keeps them isolated.
Mistake 2: Raising Exceptions in Callbacks
An unhandled exception in a callback can crash your entire LLM call, even if the API response was fine. Always wrap callback logic in try/except:
Your monitoring code should never take down your production LLM pipeline. Log the callback failure and move on.
Mistake 3: Forgetting streaming=True for Token Callbacks
This one bites everyone at least once. You write a beautiful on_llm_new_token handler, wire it up, and nothing happens. The fix is a single parameter:
Mistake 4: Creating a New Handler Instance per Call
If your callback tracks cumulative metrics (total cost, call count), creating a fresh instance for each .invoke() call resets the counters. Reuse the same instance:
Complete the safe_callback function. It takes a callable callback_fn and returns a new function that:
1. Calls callback_fn with the given arguments
2. If callback_fn raises any exception, catches it and returns the string "callback_error: " followed by the exception message
3. If callback_fn succeeds, returns its return value
This pattern prevents a broken callback from crashing the main application.
Frequently Asked Questions
Can I use callbacks with LangChain Expression Language (LCEL) chains?
Yes. LCEL chains accept callbacks through the same config parameter. The chain-level events (on_chain_start, on_chain_end) fire for the overall chain, while LLM events fire for each model call within it:
How do I access the original prompt text inside on_llm_end?
The on_llm_end method doesn't receive the prompt directly. The cleanest approach is to store it in on_llm_start and retrieve it in on_llm_end using run_id:
What is the difference between callbacks and LangSmith?
LangSmith is LangChain's managed observability platform. Under the hood, it uses callbacks to capture traces. Callbacks give you full control — you decide what to log, where to store it, and how to process it. LangSmith provides a ready-made dashboard with trace visualization, but requires sending data to LangChain's servers. Use callbacks when you need custom logic or must keep data on-premises. Use LangSmith when you want quick setup with a hosted UI.
Complete Code
References
logging module documentation. Link