Context at Scale: What to Keep, What to Cut
At production scale, context window management becomes a hard engineering problem. Every token costs money and adds latency. Here is how to build context management that holds up in production.
Key Takeaways
- Comprehensive strategies proven to work at top companies
- Actionable tips you can implement immediately
- Expert insights from industry professionals
In development, you accumulate context and everything works. You have a handful of test conversations, they run cleanly, and you ship. In production, with thousands of users running long sessions, the context window becomes the most expensive line in your infrastructure budget and the most common source of silent quality degradation. This post is about how to manage it systematically.
Why Naive Context Accumulation Breaks Production Apps
The naive approach is to accumulate everything. Append each user message and assistant response to a list. Send the whole list on every turn. This works perfectly until it does not. Context windows have hard token limits. When you hit the limit, the API call fails with an error that is confusing to debug if you have not planned for it.
Cost is the most visible problem. A 40-turn conversation with 500 tokens per turn is 20,000 tokens of context on every API call in that session. At scale, that cost compounds across every active user. A product with 1,000 daily active users in long sessions can easily spend ten times more on tokens than it needs to, purely because context is not being managed.
Latency is the second problem. Larger context windows mean slower responses. Every additional thousand tokens of context adds measurable latency to inference, and users notice delays above about 2 seconds for the first token. Trimming unnecessary context directly reduces time-to-first-token, which is the latency metric that matters most for perceived responsiveness.
Coherence degradation is the subtlest problem. Research on lost-in-the-middle effects shows that models pay less attention to content in the middle of a long context window than to content at the beginning and end. Counterintuitively, trimming the context to the most relevant parts often improves response quality.
The Four Strategies for Context Management
Sliding window: Keep only the most recent N turns. This is the simplest strategy and handles the majority of conversational assistant use cases. The downside is loss of early context. If the user set up a critical constraint in turn 1 (their budget, their role, the product they are asking about) and that turn falls out of the window, the model loses the original framing.
Summarization: Before dropping old turns, compress them into a summary. Inject the summary as a synthetic message at the start of the conversation history. This preserves the gist of early context at a fraction of the token cost. Summarization is more expensive to implement but substantially more robust for long sessions where early context matters.
Compression: Identify and remove low-value content from messages before including them in context. Boilerplate greetings, repeated confirmations, verbose tool output that has already been acted on, and filler phrases can all be stripped or condensed.
Selective retrieval: Do not include full conversation history at all. Index the conversation as it accumulates and retrieve only the turns most relevant to the current message. This handles arbitrarily long sessions without degrading quality, but requires embedding infrastructure and good retrieval quality to work reliably.
Implementing a Context Budget in Code
A context budget is a hard ceiling on the tokens your context will contain, enforced before every API call. The right place to enforce it is at message insertion time, not at call time.
Prompt
"When building a production AI feature with multi-turn conversations, define a token budget before writing your first API call. Decide: what is the maximum context I will send on any single call? What happens when I hit that limit? Implement the enforcement logic first, then build the conversation logic on top of it."
class ContextManager:
def __init__(self, max_tokens=8000, chars_per_token=4):
self.max_tokens = max_tokens
self.chars_per_token = chars_per_token
self.system_prompt = ""
self.messages = []
def set_system(self, prompt):
self.system_prompt = prompt
def add(self, role, content):
self.messages.append({"role": role, "content": content})
self._enforce_budget()
def token_count(self):
total = len(self.system_prompt) // self.chars_per_token
for m in self.messages:
total += len(m["content"]) // self.chars_per_token
return total
def _enforce_budget(self):
while self.token_count() > self.max_tokens:
if len(self.messages) <= 1:
break
self.messages.pop(0)
def build_payload(self):
return {
"system": self.system_prompt,
"messages": self.messages
}
For the summarization strategy, extend the manager to compress old turns with a lightweight model call:
import anthropic
def compress_old_turns(client, messages, keep_recent=6):
if len(messages) <= keep_recent:
return messages
old_turns = messages[:-keep_recent]
recent_turns = messages[-keep_recent:]
history_text = " | ".join(
"{0}: {1}".format(m["role"], m["content"][:200])
for m in old_turns
)
summary_prompt = (
"Summarize this conversation history in 2-3 sentences. "
"Focus on decisions made, facts established, and the user goal. "
"History: {0}"
).format(history_text)
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=250,
messages=[{"role": "user", "content": summary_prompt}]
)
summary = response.content[0].text
pinned = {
"role": "user",
"content": "[Earlier conversation summary: {0}]".format(summary)
}
return [pinned] + recent_turns
What to Keep vs Cut for Different Task Types
Conversation tasks (customer support, chat assistant): Keep the most recent 8 to 10 turns verbatim. Summarize anything older. The last few turns almost always contain the information needed for response coherence. Earlier turns matter mainly for persistent facts which a good summary captures.
Task execution (writing, coding, analysis): Always pin the original task description. This is the highest-priority context and must never be dropped. Keep the most recent tool call results. Intermediate steps that have already produced their output can be summarized or removed.
Document analysis: The document itself is the most important context. Trim conversation history aggressively to make room for it. The user's question plus the relevant document chunk is more valuable than ten turns of prior conversation about a different section.
The meta-principle across all task types: rank context by relevance to the current inference and enforce a hard budget. The system prompt is always relevant. The user's current message is always relevant. Everything else is a tradeoff between completeness and cost, and that tradeoff should be made deliberately by your code rather than accidentally by context accumulation.
Want to build this live with Aki?
Join a Lightning Lesson and go deeper on this topic. Browse upcoming sessions →
Aki Wijesundara
Expert team of AI professionals and career advisors with experience at top tech companies. We've helped 500+ students land internships at Google, Meta, OpenAI, and other leading AI companies.
Ready to Launch Your AI Career?
Join our comprehensive program and get personalized guidance from industry experts who've been where you want to go.
Table of Contents
Share Article
Get Weekly AI Career Tips
Join 5,000+ professionals getting actionable career advice in their inbox.
No spam. Unsubscribe anytime.