This guide covers strategies for managing context in LLM agents, including truncation, compaction, eviction, and other techniques to work within context window limits while preserving important information.
┌─────────────────────────────────────────────────────────────────────────────┐
│ CONTEXT WINDOW MANAGEMENT │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Context Window (e.g., 272K tokens) │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ RESERVED (5%) │ USABLE CONTEXT (95%) │ │
│ │ ┌───────────────┐ │ ┌─────────────────────────────────────────────┐ │ │
│ │ │ System prompt │ │ │ Conversation history + Tool outputs │ │ │
│ │ │ Tool overhead │ │ │ │ │ │
│ │ │ Output buffer │ │ │ ← Managed by context management strategies │ │ │
│ │ └───────────────┘ │ └─────────────────────────────────────────────┘ │ │
│ └───────────────────┴─────────────────────────────────────────────────┘ │
│ │
│ Strategies: │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────────────────┐ │
│ │ Truncation │ │ Eviction │ │ Compaction │ │ Selective Retention │ │
│ │ (per-item) │ │ (oldest) │ │ (summarize)│ │ (priority-based) │ │
│ └────────────┘ └────────────┘ └────────────┘ └────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
// From Codex's openai_model_info.rs
pub fn get_model_info(model_family: &ModelFamily) -> Option<ModelInfo> {
match model_family.slug.as_str() {
"gpt-4.1" => Some(ModelInfo::new(1_047_576)), // ~1M tokens
"gpt-5-codex" => Some(ModelInfo::new(272_000)), // 272K tokens
"gpt-4o" => Some(ModelInfo::new(128_000)), // 128K tokens
"o3" => Some(ModelInfo::new(200_000)), // 200K tokens
"gpt-3.5-turbo" => Some(ModelInfo::new(16_385)),// 16K tokens
_ => None,
}
}
Not all of the context window is usable for conversation:
pub struct ModelFamily {
/// Percentage of context window considered usable for inputs
/// after reserving headroom for system prompts, tool overhead, and output
pub effective_context_window_percent: i64, // Default: 95%
}
// Calculate effective window
pub fn get_model_context_window(&self) -> Option<i64> {
let model_family = self.get_model_family();
let percent = model_family.effective_context_window_percent;
self.config.model_context_window
.or_else(|| get_model_info(&model_family).map(|info| info.context_window))
.map(|window| (window * percent) / 100)
}
pub struct TokenUsageInfo {
/// Usage from the last API response
pub last_token_usage: TokenUsage,
/// Total accumulated usage
pub total_token_usage: TokenUsage,
/// Known context window size
pub model_context_window: Option<i64>,
}
impl TokenUsageInfo {
/// Tokens currently in the context
pub fn tokens_in_context_window(&self) -> i64 {
self.total_token_usage.input_tokens
+ self.total_token_usage.output_tokens
}
/// Percentage of context remaining (accounting for baseline overhead)
pub fn percent_of_context_window_remaining(&self, context_window: i64) -> i64 {
const BASELINE_TOKENS: i64 = 5_000; // Reserved for system overhead
if context_window <= BASELINE_TOKENS {
return 0;
}
let effective_window = context_window - BASELINE_TOKENS;
let used = (self.tokens_in_context_window() - BASELINE_TOKENS).max(0);
let remaining = effective_window - used;
(remaining * 100) / effective_window
}
}
Truncation prevents individual items (especially tool outputs) from consuming too much context.
#[derive(Debug, Clone, Copy)]
pub enum TruncationPolicy {
Bytes(usize), // Truncate by byte count
Tokens(usize), // Truncate by token estimate
}
// Model families have default policies
model_family!(
"gpt-5-codex",
truncation_policy: TruncationPolicy::Tokens(10_000),
)
model_family!(
"gpt-4o",
truncation_policy: TruncationPolicy::Bytes(10_000),
)
fn truncate_with_byte_estimate(s: &str, policy: TruncationPolicy) -> String {
let max_bytes = policy.byte_budget();
if s.len() <= max_bytes {
return s.to_string();
}
// Split budget: half for beginning, half for end
let (left_budget, right_budget) = (max_bytes / 2, max_bytes - max_bytes / 2);
// Split on UTF-8 boundaries
let (removed_chars, left, right) = split_string(s, left_budget, right_budget);
// Create truncation marker
let marker = format!("…{removed_chars} chars truncated…");
format!("{left}{marker}{right}")
}
Example output:
Total output lines: 5000
drwxr-xr-x 5 user user 160 Jan 1 12:00 .
drwxr-xr-x 3 user user 96 Jan 1 11:00 ..
-rw-r--r-- 1 user user 234 Jan 1 12:00 package.json
…4850 chars truncated…
-rw-r--r-- 1 user user 1234 Jan 1 12:00 README.md
-rw-r--r-- 1 user user 567 Jan 1 12:00 tsconfig.json
Tool outputs are truncated when recorded into history:
impl ContextManager {
pub fn record_items<I>(&mut self, items: I, policy: TruncationPolicy) {
for item in items {
// Process (potentially truncate) the item
let processed = self.process_item(&item, policy);
self.items.push(processed);
}
}
fn process_item(&self, item: &ResponseItem, policy: TruncationPolicy) -> ResponseItem {
match item {
ResponseItem::FunctionCallOutput { call_id, output } => {
// Truncate content
let truncated = truncate_text(&output.content, policy);
// Truncate structured content items too
let truncated_items = output.content_items.as_ref().map(|items| {
truncate_function_output_items_with_policy(items, policy)
});
ResponseItem::FunctionCallOutput {
call_id: call_id.clone(),
output: FunctionCallOutputPayload {
content: truncated,
content_items: truncated_items,
success: output.success,
},
}
}
// Other items pass through unchanged
_ => item.clone(),
}
}
}
Users can override default truncation:
# config.toml
tool_output_token_limit = 50000 # Override default limit
impl TruncationPolicy {
pub fn new(config: &Config, default_policy: TruncationPolicy) -> Self {
if let Some(token_limit) = config.tool_output_token_limit {
// User override takes precedence
Self::Tokens(token_limit)
} else {
default_policy
}
}
}
When context is full, the oldest items are removed first to preserve recent context.
impl ContextManager {
pub fn remove_first_item(&mut self) {
if !self.items.is_empty() {
// Remove oldest item (front of the list)
let removed = self.items.remove(0);
// Maintain invariants: remove orphaned call/output pairs
normalize::remove_corresponding_for(&mut self.items, &removed);
}
}
}
If compaction itself exceeds context, items are evicted iteratively:
async fn run_compact_task(...) {
let mut truncated_count = 0;
loop {
let turn_input = history.get_history_for_prompt();
match drain_to_completed(&sess, &turn_context, &prompt).await {
Ok(()) => {
// Success - exit loop
if truncated_count > 0 {
sess.notify_background_event(
format!("Trimmed {truncated_count} older items before compacting")
).await;
}
break;
}
Err(CodexErr::ContextWindowExceeded) => {
if turn_input.len() > 1 {
// Remove oldest item and retry
history.remove_first_item();
truncated_count += 1;
continue;
}
// Can't remove more - fail
return Err(e);
}
Err(e) => return Err(e),
}
}
}
Compaction summarizes the conversation history to reduce token usage while preserving essential information.
Codex automatically triggers compaction when token usage exceeds a threshold:
// Default: 90% of context window
const fn default_auto_compact_limit(context_window: i64) -> i64 {
(context_window * 9) / 10
}
// In the main turn loop
let limit = turn_context.client
.get_auto_compact_token_limit()
.unwrap_or(i64::MAX);
let total_usage_tokens = sess.get_total_token_usage().await;
let token_limit_reached = total_usage_tokens >= limit;
if token_limit_reached {
// Trigger compaction
run_inline_auto_compact_task(sess.clone(), turn_context.clone()).await;
continue; // Retry the turn with compacted history
}
The summarization prompt instructs the model how to compress:
You are performing a CONTEXT CHECKPOINT COMPACTION. Create a handoff
summary for another LLM that will resume the task.
Include:
- Current progress and key decisions made
- Important context, constraints, or user preferences
- What remains to be done (clear next steps)
- Any critical data, examples, or references needed to continue
Be concise, structured, and focused on helping the next LLM seamlessly
continue the work.
pub fn build_compacted_history(
initial_context: Vec<ResponseItem>,
user_messages: &[String],
summary_text: &str,
) -> Vec<ResponseItem> {
const MAX_USER_MESSAGE_TOKENS: usize = 20_000;
let mut history = initial_context;
let mut remaining = MAX_USER_MESSAGE_TOKENS;
let mut selected_messages = Vec::new();
// Keep recent user messages (working backwards for recency)
for message in user_messages.iter().rev() {
if remaining == 0 { break; }
let tokens = approx_token_count(message);
if tokens <= remaining {
selected_messages.push(message.clone());
remaining -= tokens;
} else {
// Truncate and include partial
let truncated = truncate_text(message, TruncationPolicy::Tokens(remaining));
selected_messages.push(truncated);
break;
}
}
selected_messages.reverse();
// Add preserved user messages
for message in &selected_messages {
history.push(ResponseItem::Message {
role: "user".to_string(),
content: vec![ContentItem::InputText { text: message.clone() }],
});
}
// Add summary as final message
history.push(ResponseItem::Message {
role: "user".to_string(),
content: vec![ContentItem::InputText {
text: format!("{SUMMARY_PREFIX}\n{summary_text}")
}],
});
history
}
| Preserved | Discarded |
|---|---|
| Initial context (system prompt, AGENTS.md, environment) | Full tool call/output pairs |
| Recent user messages (up to 20K tokens) | Assistant reasoning |
| Summary of conversation | Intermediate tool outputs |
| Ghost snapshots (for undo) | Old user messages |
For some configurations, compaction uses a dedicated API endpoint:
pub fn should_use_remote_compact_task(session: &Session) -> bool {
session.auth_manager.auth()
.is_some_and(|auth| auth.mode == AuthMode::ChatGPT)
&& session.enabled(Feature::RemoteCompaction)
}
async fn run_remote_compact_task_inner_impl(...) -> CodexResult<()> {
// Send full history to remote endpoint
let new_history = turn_context.client
.compact_conversation_history(&prompt)
.await?;
// Replace history with compacted version
sess.replace_history(new_history).await;
sess.recompute_token_usage(turn_context).await;
}
Claude Code provides a PreCompact hook for custom compaction logic:
{
"PreCompact": [
{
"matcher": "*",
"hooks": [
{
"type": "command",
"command": "${CLAUDE_PLUGIN_ROOT}/hooks/pre-compact.sh"
}
]
}
]
}
This allows:
The context manager maintains invariants to ensure valid conversation structure.
impl ContextManager {
fn normalize_history(&mut self) {
// Ensure all calls have outputs
normalize::ensure_call_outputs_present(&mut self.items);
// Remove orphaned outputs
normalize::remove_orphan_outputs(&mut self.items);
}
}
pub fn ensure_call_outputs_present(items: &mut Vec<ResponseItem>) {
let mut call_ids_needing_output: HashSet<String> = HashSet::new();
// First pass: collect all call IDs
for item in items.iter() {
match item {
ResponseItem::FunctionCall { call_id, .. } => {
call_ids_needing_output.insert(call_id.clone());
}
ResponseItem::FunctionCallOutput { call_id, .. } => {
call_ids_needing_output.remove(call_id);
}
_ => {}
}
}
// Second pass: add placeholder outputs for missing ones
for call_id in call_ids_needing_output {
items.push(ResponseItem::FunctionCallOutput {
call_id,
output: FunctionCallOutputPayload {
content: "(no output recorded)".to_string(),
success: Some(false),
..Default::default()
},
});
}
}
Accurate token counting is expensive, so agents use approximations:
const APPROX_BYTES_PER_TOKEN: usize = 4;
pub fn approx_token_count(text: &str) -> usize {
let len = text.len();
len.saturating_add(APPROX_BYTES_PER_TOKEN - 1) / APPROX_BYTES_PER_TOKEN
}
pub fn approx_tokens_from_byte_count(bytes: usize) -> u64 {
(bytes as u64 + 3) / 4
}
impl ContextManager {
pub fn estimate_token_count(&self, turn_context: &TurnContext) -> Option<i64> {
let model_family = turn_context.client.get_model_family();
// Base tokens from system prompt
let base_tokens = approx_token_count(&model_family.base_instructions);
// Sum tokens from all items
let items_tokens = self.items.iter().fold(0i64, |acc, item| {
acc + match item {
// Skip internal items
ResponseItem::GhostSnapshot { .. } => 0,
// Estimate encrypted reasoning content
ResponseItem::Reasoning { encrypted_content: Some(content), .. } => {
estimate_reasoning_length(content.len())
}
// Serialize and estimate others
item => {
let serialized = serde_json::to_string(item).unwrap_or_default();
approx_token_count(&serialized) as i64
}
}
});
Some(base_tokens.saturating_add(items_tokens))
}
}
When the context window is exceeded, the agent must handle it gracefully:
pub enum CodexErr {
ContextWindowExceeded,
// ...
}
// Detection from API response
if response.error.code == "context_length_exceeded" {
return Err(CodexErr::ContextWindowExceeded);
}
// Handling
match result {
Err(CodexErr::ContextWindowExceeded) => {
// Mark token usage as full
sess.set_total_tokens_full(turn_context.as_ref()).await;
// Notify user
let event = EventMsg::Error(ErrorEvent {
message: "Your input exceeds the context window. Please adjust and try again.".into(),
code: Some("context_window_exceeded".into()),
});
sess.send_event(&turn_context, event).await;
}
}
impl TokenUsageInfo {
pub fn fill_to_context_window(&mut self, context_window: i64) {
let previous_total = self.last_token_usage.total_tokens;
let delta = (context_window - previous_total).max(0);
self.model_context_window = Some(context_window);
self.last_token_usage = TokenUsage {
total_tokens: context_window,
input_tokens: self.last_token_usage.input_tokens + delta,
output_tokens: self.last_token_usage.output_tokens,
};
}
}
Images consume significant context. Special handling is needed:
impl ContextManager {
pub fn replace_last_turn_images(&mut self, placeholder: &str) {
if let Some(last_item) = self.items.last_mut() {
match last_item {
ResponseItem::Message { role, content, .. } if role == "user" => {
for item in content.iter_mut() {
if matches!(item, ContentItem::InputImage { .. }) {
*item = ContentItem::InputText {
text: placeholder.to_string(),
};
}
}
}
// Also handle function call outputs with images
ResponseItem::FunctionCallOutput { output, .. } => {
if let Some(items) = output.content_items.as_mut() {
for item in items.iter_mut() {
if matches!(item, FunctionCallOutputContentItem::InputImage { .. }) {
*item = FunctionCallOutputContentItem::InputText {
text: placeholder.to_string(),
};
}
}
}
}
_ => {}
}
}
}
}
// Truncate every tool output immediately
let processed = truncate_text(&output, policy);
// But only compact when necessary (90% threshold)
if total_tokens >= auto_compact_limit {
run_compaction();
}
// Keep recent user messages during compaction
for message in user_messages.iter().rev().take(MAX_RECENT) {
preserved.push(message);
}
// After compaction, re-inject initial context
let initial_context = sess.build_initial_context(turn_context);
let new_history = build_compacted_history(
initial_context, // System prompt, AGENTS.md, environment
&user_messages,
&summary,
);
// Update after every API response
sess.update_token_usage_info(turn_context, token_usage.as_ref()).await;
// Check before expensive operations
let usage = sess.get_total_token_usage().await;
if usage >= limit * 0.8 {
warn!("Approaching context limit: {}%", (usage * 100) / limit);
}
// Single item that exceeds context
if turn_input.len() == 1 && err == ContextWindowExceeded {
// Can't evict more - must fail
return Err(err);
}
// Empty summary fallback
let summary_text = if summary.is_empty() {
"(no summary available)".to_string()
} else {
summary
};
// Warn about compaction
sess.send_event(&turn_context, EventMsg::ContextCompacted(ContextCompactedEvent {})).await;
// Warn about accuracy impact
sess.send_event(&turn_context, EventMsg::Warning(WarningEvent {
message: "Long conversations and multiple compactions can reduce accuracy. \
Start a new conversation when possible.".into(),
})).await;
Context management in LLM agents involves multiple complementary strategies:
| Strategy | When Applied | What It Does |
|---|---|---|
| Truncation | Per-item, at recording | Limits individual items to prevent bloat |
| Eviction | When compaction fails | Removes oldest items to make room |
| Compaction | At token threshold | Summarizes history to reduce size |
| Normalization | Before sending | Ensures valid conversation structure |
| Re-injection | After compaction | Restores critical initial context |
The key insight is that context management is a continuous process, not a one-time operation. Agents must:
This multi-layered approach ensures agents can maintain long, productive conversations while staying within model constraints.