Build a Chatbot with Memory: Buffer, Window, and Summary Memory in LangChain
Your LangChain chatbot answers questions perfectly — until the user says "wait, go back to what you said earlier." The bot has no idea what it said earlier. Every chain invocation starts from scratch, and without explicit memory management, even the most sophisticated prompt template produces a bot with amnesia.
By the end of this tutorial, you'll know three distinct memory strategies — buffer, window, and summary — and exactly when each one is the right choice. You'll also learn the modern RunnableWithMessageHistory pattern that replaced the legacy ConversationChain.
Why LLMs Forget — and How LangChain Memory Fixes It
Every LLM API call is stateless. You send a list of messages, the model generates a response, and then it forgets everything. What feels like "memory" in ChatGPT is actually the application resending the entire conversation with every request.
I spent an embarrassing amount of time debugging a chatbot that "forgot" user preferences before realizing the issue wasn't the model — it was my code. I was creating a fresh message list on every call instead of appending to a persistent one. LangChain's memory module exists to solve exactly this problem: it manages conversation history so you don't have to track it manually.
Here's a bare-bones chatbot with no memory. Two calls, and the second one has no idea about the first:
The second call returns something like "I don't have access to your personal information." The model genuinely doesn't know — it never saw the first message.
LangChain provides three memory classes that solve this at different levels of sophistication. Each one answers the same question differently: what should we include in the message history?
| Memory Type | What It Stores | Token Usage | Best For |
|---|---|---|---|
| ConversationBufferMemory | Every message, verbatim | Grows linearly | Short conversations (<20 turns) |
| ConversationBufferWindowMemory | Last k message pairs | Fixed ceiling | Medium conversations with recent-context focus |
| ConversationSummaryMemory | Running summary of the conversation | Roughly constant | Long conversations (50+ turns) |
ConversationBufferMemory — Remember Everything
This is the simplest memory strategy and the one I reach for first when prototyping. ConversationBufferMemory stores every message in the conversation — every user input and every AI response — and resends all of them on every call. Nothing is trimmed, nothing is summarized.
The setup requires three pieces: a prompt template with a placeholder for chat history, the memory object, and the chain that wires them together. Here's the complete working example:
Five pieces, each with a clear job. The session_store dictionary maps session IDs to ChatMessageHistory objects. RunnableWithMessageHistory handles the plumbing: before each call it loads the history, after each call it saves the new messages. You never touch the history list directly.
The config dictionary is how you tell the chatbot which conversation session to use. Every invocation needs it:
Turn 3 works because the entire conversation — all six messages (three from the user, three from the assistant) — gets sent to the model. The bot can answer "Your name is Priya and you're learning decorators" because it literally sees both facts in the history.
You can inspect what's stored at any time by pulling the session history:
After three turns, you'll see six messages — alternating human and ai types. This is the full, uncompressed record of the conversation.
ConversationBufferWindowMemory — Keep Only the Last k Turns
The k=5 parameter means "keep the last 5 exchange pairs" (10 messages total — 5 human + 5 AI). Once the conversation exceeds 5 turns, the oldest pair gets dropped. The model never sees turn 1 when you're on turn 7.
This is the strategy I use for most production chatbots. The tradeoff is explicit: you lose distant context but gain predictable token costs and a guarantee you'll never blow past the context window. For customer support bots and FAQ assistants, the last 5-10 turns contain everything the model needs.
The modern approach uses RunnableWithMessageHistory with a custom trimming function instead of the legacy memory class. Here's how to implement window-style memory with the current LangChain patterns:
The trim_messages utility trims the history before it reaches the prompt. Setting strategy="last" keeps the most recent messages and discards older ones. The include_system=True flag ensures the system prompt is never trimmed away — a subtle but important detail.
One gotcha with window memory: if the user refers to something from turn 1 and you're on turn 15 with k=5, the model will confidently hallucinate or say it doesn't know. It genuinely doesn't — that message was trimmed. For some applications this is fine. For others, you need summary memory.
ConversationSummaryMemory — Compress the Past Into a Summary
What if the conversation is 50 turns long and the user references something from turn 3? Window memory lost it. Buffer memory would work but costs a fortune in tokens. Summary memory offers a third option: instead of storing raw messages, it asks the LLM to maintain a running summary of the conversation.
After each exchange, the summary gets updated to include the new information. The model receives this compressed summary plus the most recent messages, giving it both distant context and recent detail.
Instead of storing six messages, the memory holds a condensed summary like: "The user is Priya, a data engineer at Spotify who needs to build a real-time ETL pipeline using Kafka with Avro schema evolution for backward compatibility." Three turns compressed into one sentence.
The real power of summary memory shows up in long conversations. After 50 turns, buffer memory is sending 100+ messages per call. Summary memory is still sending one paragraph plus the latest exchange.
Here's a comparison of what the LLM receives at turn 20 under each strategy:
Running this produces:
The gap widens as conversations get longer. At turn 100, buffer memory sends roughly 10,000 tokens per call while summary memory stays around 300.
RunnableWithMessageHistory — The Modern Pattern
If you've read LangChain tutorials from 2023, you'll see ConversationChain everywhere. That class is now deprecated. The replacement is RunnableWithMessageHistory, which is more flexible, composable, and fits naturally into LangChain's expression language (LCEL).
We've already used RunnableWithMessageHistory in the buffer memory section. Here I want to break down why it works the way it does and show the complete pattern with all the moving parts labeled.
Three things to notice. First, input_messages_key tells the wrapper which part of your input dictionary is the user's message — this is what gets saved to history. Second, history_messages_key must match the variable_name in your MessagesPlaceholder. Third, the session ID comes from the config at invocation time, not at construction time.
Multiple Sessions Running Simultaneously
A real chatbot serves many users at once. The session ID is how you keep their conversations separate. Each user gets their own history:
The store dictionary now has two entries: "user-001" with Priya's history and "user-002" with Carlos's history. In production, you'd replace the dictionary with Redis, a database, or another persistent backend.
Write a function get_or_create_session that takes a session_id (string) and a store (dictionary) as arguments. If the session ID exists in the store, return the existing list. If not, create a new empty list, add it to the store, and return it.
Then simulate two sessions:
1. Add the message "Hello from Priya" to session "s1"
2. Add the message "Hello from Carlos" to session "s2"
3. Add the message "Follow-up from Priya" to session "s1"
Finally, print the length of session "s1" and session "s2" on separate lines.
Combining Strategies — Summary Plus Recent Messages
In production, the most effective pattern combines summary and window memory. You keep the last k messages verbatim for detailed recent context, and prepend a summary of everything older for long-range awareness. This gives the model both precision (exact recent messages) and breadth (compressed older context).
LangChain provides ConversationSummaryBufferMemory for exactly this. It stores raw messages until the total token count exceeds a threshold, then summarizes the older messages and keeps the recent ones intact:
With only two turns, the token count is below 300, so you'll see the raw messages. Add 5-6 more turns and the older messages get compressed into a summary while recent ones stay verbatim.
Persistent Memory — Surviving Server Restarts
Everything we've built so far uses a Python dictionary for session storage. Restart the server and all conversations disappear. For production chatbots, you need persistent storage. LangChain supports several backends through ChatMessageHistory implementations.
The architecture stays the same — you just swap the get_session_history function. Here's how it looks with Redis as the backend:
That's the elegance of the RunnableWithMessageHistory design. Your chain, prompt, and invocation code stay identical. Only the storage backend changes. LangChain's community package includes implementations for Redis, PostgreSQL, MongoDB, DynamoDB, Firestore, and more.
Here's a quick reference for the most common backends:
| Backend | Import | When to Use |
|---|---|---|
| In-memory dict | ChatMessageHistory | Development and testing |
| Redis | RedisChatMessageHistory | Low-latency production apps |
| PostgreSQL | PostgresChatMessageHistory | When you already have a Postgres DB |
| MongoDB | MongoDBChatMessageHistory | Document-oriented storage needs |
| SQLite | SQLChatMessageHistory | Single-server apps, local persistence |
Write a function trim_to_window(messages: list, k: int) -> list that takes a list of messages and returns only the last k messages. If the list has fewer than k messages, return all of them.
Then test it:
1. Create a list of 8 messages: ["msg1", "msg2", ..., "msg8"]
2. Trim to window size 5 and print the result
3. Trim the original list to window size 20 and print the result
Real-World Example: Customer Support Bot with Memory
Let's put everything together into a customer support chatbot that uses summary buffer memory, handles multiple sessions, and includes a system prompt tailored for support interactions. This is close to what I've deployed in production.
The chain combines trimming with the prompt and LLM in a single LCEL pipeline. The trimmer runs first, cutting the history to 20 messages before the prompt template formats everything.
Each turn builds on the previous context. By turn 4, the bot knows the customer's order number (TS-78432), that they paid for express shipping, and that the package hasn't arrived. Without memory, each turn would start from zero and the customer would have to repeat everything.
You can inspect the full history to verify what's been stored:
Common Mistakes and How to Fix Them
After building several LangChain chatbots, these are the bugs I see most often — including ones I've made myself.
Mistake 1: Forgetting the Config on Invocation
# This crashes with a ConfigError
response = chatbot.invoke({"input": "Hello"})# Always pass config with session_id
config = {"configurable": {"session_id": "user-001"}}
response = chatbot.invoke({"input": "Hello"}, config=config)RunnableWithMessageHistory requires a session ID to know which conversation to load. Without the config, you get a ValueError about missing configurable fields.
Mistake 2: Mismatched Placeholder and Key Names
prompt = ChatPromptTemplate.from_messages([
("system", "You are helpful."),
MessagesPlaceholder(variable_name="chat_history"),
("human", "{input}"),
])
bot = RunnableWithMessageHistory(
prompt | llm,
get_history,
input_messages_key="input",
history_messages_key="history", # WRONG: doesn't match "chat_history"
)prompt = ChatPromptTemplate.from_messages([
("system", "You are helpful."),
MessagesPlaceholder(variable_name="history"),
("human", "{input}"),
])
bot = RunnableWithMessageHistory(
prompt | llm,
get_history,
input_messages_key="input",
history_messages_key="history", # Matches "history" above
)This is the sneakiest bug because it doesn't raise an error. The history placeholder just receives an empty list, and the bot behaves like it has no memory. I've debugged this for other developers at least five times.
Mistake 3: Using the Deprecated ConversationChain
If you see from langchain.chains import ConversationChain in a tutorial, that tutorial is outdated. ConversationChain was deprecated in LangChain 0.2.7. The modern replacement is the RunnableWithMessageHistory pattern shown throughout this tutorial.
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory
# This still works but is deprecated
chain = ConversationChain(
llm=llm,
memory=ConversationBufferMemory(),
)from langchain_core.runnables.history import RunnableWithMessageHistory
# Compose with LCEL, wrap with history
chain = prompt | llm
bot = RunnableWithMessageHistory(chain, get_history,
input_messages_key="input",
history_messages_key="history",
)Write a function summarize_conversation(messages: list[dict]) -> str that takes a list of message dictionaries (each with "role" and "content" keys) and returns a one-line summary string.
The summary should follow this format: "{n} messages: {roles}" where {n} is the total message count and {roles} is a comma-separated list of unique roles in the order they first appear.
Test with:
1. A conversation with 4 messages (roles: user, assistant, user, assistant)
2. A conversation with 1 message (role: system)
Frequently Asked Questions
Can I use memory with streaming responses?
Yes. RunnableWithMessageHistory works with both .invoke() and .stream(). The wrapper saves messages to history after the stream completes:
How do I clear a session's memory?
Call .clear() on the ChatMessageHistory object for that session:
What happens when the conversation exceeds the model's context window?
With buffer memory, you get an API error when the total tokens (system prompt + history + new message) exceed the model's limit. GPT-4o-mini supports 128K tokens, so this is rare for text-only conversations. For models with smaller context windows (8K-32K), use window memory or summary buffer memory to prevent hitting the limit.
Can I add metadata to messages (timestamps, user roles)?
LangChain messages have an additional_kwargs field for arbitrary metadata. You can add timestamps, user IDs, or any other data without affecting how the LLM processes the message: