Context Management: Compaction and Beyond

This guide covers strategies for managing context in LLM agents, including truncation, compaction, eviction, and other techniques to work within context window limits while preserving important information.

Overview: The Context Management Challenge

┌─────────────────────────────────────────────────────────────────────────────┐
│                     CONTEXT WINDOW MANAGEMENT                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Context Window (e.g., 272K tokens)                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ RESERVED (5%)     │ USABLE CONTEXT (95%)                            │   │
│  │ ┌───────────────┐ │ ┌─────────────────────────────────────────────┐ │   │
│  │ │ System prompt │ │ │ Conversation history + Tool outputs        │ │   │
│  │ │ Tool overhead │ │ │                                             │ │   │
│  │ │ Output buffer │ │ │ ← Managed by context management strategies │ │   │
│  │ └───────────────┘ │ └─────────────────────────────────────────────┘ │   │
│  └───────────────────┴─────────────────────────────────────────────────┘   │
│                                                                             │
│  Strategies:                                                                │
│  ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────────────────┐   │
│  │ Truncation │ │ Eviction   │ │ Compaction │ │ Selective Retention    │   │
│  │ (per-item) │ │ (oldest)   │ │ (summarize)│ │ (priority-based)       │   │
│  └────────────┘ └────────────┘ └────────────┘ └────────────────────────┘   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Part 1: Understanding Context Windows

Model Context Window Sizes

// From Codex's openai_model_info.rs
pub fn get_model_info(model_family: &ModelFamily) -> Option<ModelInfo> {
    match model_family.slug.as_str() {
        "gpt-4.1" => Some(ModelInfo::new(1_047_576)),  // ~1M tokens
        "gpt-5-codex" => Some(ModelInfo::new(272_000)), // 272K tokens
        "gpt-4o" => Some(ModelInfo::new(128_000)),      // 128K tokens
        "o3" => Some(ModelInfo::new(200_000)),          // 200K tokens
        "gpt-3.5-turbo" => Some(ModelInfo::new(16_385)),// 16K tokens
        _ => None,
    }
}

Effective Context Window

Not all of the context window is usable for conversation:

pub struct ModelFamily {
    /// Percentage of context window considered usable for inputs
    /// after reserving headroom for system prompts, tool overhead, and output
    pub effective_context_window_percent: i64,  // Default: 95%
}

// Calculate effective window
pub fn get_model_context_window(&self) -> Option<i64> {
    let model_family = self.get_model_family();
    let percent = model_family.effective_context_window_percent;
    
    self.config.model_context_window
        .or_else(|| get_model_info(&model_family).map(|info| info.context_window))
        .map(|window| (window * percent) / 100)
}

Token Usage Tracking

pub struct TokenUsageInfo {
    /// Usage from the last API response
    pub last_token_usage: TokenUsage,
    
    /// Total accumulated usage
    pub total_token_usage: TokenUsage,
    
    /// Known context window size
    pub model_context_window: Option<i64>,
}

impl TokenUsageInfo {
    /// Tokens currently in the context
    pub fn tokens_in_context_window(&self) -> i64 {
        self.total_token_usage.input_tokens
            + self.total_token_usage.output_tokens
    }

    /// Percentage of context remaining (accounting for baseline overhead)
    pub fn percent_of_context_window_remaining(&self, context_window: i64) -> i64 {
        const BASELINE_TOKENS: i64 = 5_000;  // Reserved for system overhead
        
        if context_window <= BASELINE_TOKENS {
            return 0;
        }
        
        let effective_window = context_window - BASELINE_TOKENS;
        let used = (self.tokens_in_context_window() - BASELINE_TOKENS).max(0);
        let remaining = effective_window - used;
        
        (remaining * 100) / effective_window
    }
}

Part 2: Truncation (Per-Item)

Truncation prevents individual items (especially tool outputs) from consuming too much context.

Truncation Policies

#[derive(Debug, Clone, Copy)]
pub enum TruncationPolicy {
    Bytes(usize),   // Truncate by byte count
    Tokens(usize),  // Truncate by token estimate
}

// Model families have default policies
model_family!(
    "gpt-5-codex",
    truncation_policy: TruncationPolicy::Tokens(10_000),
)

model_family!(
    "gpt-4o",
    truncation_policy: TruncationPolicy::Bytes(10_000),
)

Truncation Strategy: Preserve Head and Tail

fn truncate_with_byte_estimate(s: &str, policy: TruncationPolicy) -> String {
    let max_bytes = policy.byte_budget();
    
    if s.len() <= max_bytes {
        return s.to_string();
    }

    // Split budget: half for beginning, half for end
    let (left_budget, right_budget) = (max_bytes / 2, max_bytes - max_bytes / 2);

    // Split on UTF-8 boundaries
    let (removed_chars, left, right) = split_string(s, left_budget, right_budget);

    // Create truncation marker
    let marker = format!("…{removed_chars} chars truncated…");

    format!("{left}{marker}{right}")
}

Example output:

Total output lines: 5000

drwxr-xr-x  5 user user  160 Jan 1 12:00 .
drwxr-xr-x  3 user user   96 Jan 1 11:00 ..
-rw-r--r--  1 user user  234 Jan 1 12:00 package.json
…4850 chars truncated…
-rw-r--r--  1 user user 1234 Jan 1 12:00 README.md
-rw-r--r--  1 user user  567 Jan 1 12:00 tsconfig.json

Truncation at Recording Time

Tool outputs are truncated when recorded into history:

impl ContextManager {
    pub fn record_items<I>(&mut self, items: I, policy: TruncationPolicy) {
        for item in items {
            // Process (potentially truncate) the item
            let processed = self.process_item(&item, policy);
            self.items.push(processed);
        }
    }

    fn process_item(&self, item: &ResponseItem, policy: TruncationPolicy) -> ResponseItem {
        match item {
            ResponseItem::FunctionCallOutput { call_id, output } => {
                // Truncate content
                let truncated = truncate_text(&output.content, policy);
                
                // Truncate structured content items too
                let truncated_items = output.content_items.as_ref().map(|items| {
                    truncate_function_output_items_with_policy(items, policy)
                });
                
                ResponseItem::FunctionCallOutput {
                    call_id: call_id.clone(),
                    output: FunctionCallOutputPayload {
                        content: truncated,
                        content_items: truncated_items,
                        success: output.success,
                    },
                }
            }
            // Other items pass through unchanged
            _ => item.clone(),
        }
    }
}

Configurable Truncation Limits

Users can override default truncation:

# config.toml
tool_output_token_limit = 50000  # Override default limit
impl TruncationPolicy {
    pub fn new(config: &Config, default_policy: TruncationPolicy) -> Self {
        if let Some(token_limit) = config.tool_output_token_limit {
            // User override takes precedence
            Self::Tokens(token_limit)
        } else {
            default_policy
        }
    }
}

Part 3: Eviction (Oldest First)

When context is full, the oldest items are removed first to preserve recent context.

Simple Eviction

impl ContextManager {
    pub fn remove_first_item(&mut self) {
        if !self.items.is_empty() {
            // Remove oldest item (front of the list)
            let removed = self.items.remove(0);
            
            // Maintain invariants: remove orphaned call/output pairs
            normalize::remove_corresponding_for(&mut self.items, &removed);
        }
    }
}

Eviction During Compaction

If compaction itself exceeds context, items are evicted iteratively:

async fn run_compact_task(...) {
    let mut truncated_count = 0;
    
    loop {
        let turn_input = history.get_history_for_prompt();
        
        match drain_to_completed(&sess, &turn_context, &prompt).await {
            Ok(()) => {
                // Success - exit loop
                if truncated_count > 0 {
                    sess.notify_background_event(
                        format!("Trimmed {truncated_count} older items before compacting")
                    ).await;
                }
                break;
            }
            Err(CodexErr::ContextWindowExceeded) => {
                if turn_input.len() > 1 {
                    // Remove oldest item and retry
                    history.remove_first_item();
                    truncated_count += 1;
                    continue;
                }
                // Can't remove more - fail
                return Err(e);
            }
            Err(e) => return Err(e),
        }
    }
}

Part 4: Compaction (Summarization)

Compaction summarizes the conversation history to reduce token usage while preserving essential information.

Auto-Compaction Trigger

Codex automatically triggers compaction when token usage exceeds a threshold:

// Default: 90% of context window
const fn default_auto_compact_limit(context_window: i64) -> i64 {
    (context_window * 9) / 10
}

// In the main turn loop
let limit = turn_context.client
    .get_auto_compact_token_limit()
    .unwrap_or(i64::MAX);

let total_usage_tokens = sess.get_total_token_usage().await;
let token_limit_reached = total_usage_tokens >= limit;

if token_limit_reached {
    // Trigger compaction
    run_inline_auto_compact_task(sess.clone(), turn_context.clone()).await;
    continue;  // Retry the turn with compacted history
}

Compaction Prompt

The summarization prompt instructs the model how to compress:

You are performing a CONTEXT CHECKPOINT COMPACTION. Create a handoff 
summary for another LLM that will resume the task.

Include:
- Current progress and key decisions made
- Important context, constraints, or user preferences
- What remains to be done (clear next steps)
- Any critical data, examples, or references needed to continue

Be concise, structured, and focused on helping the next LLM seamlessly 
continue the work.

Building Compacted History

pub fn build_compacted_history(
    initial_context: Vec<ResponseItem>,
    user_messages: &[String],
    summary_text: &str,
) -> Vec<ResponseItem> {
    const MAX_USER_MESSAGE_TOKENS: usize = 20_000;
    
    let mut history = initial_context;
    let mut remaining = MAX_USER_MESSAGE_TOKENS;
    let mut selected_messages = Vec::new();

    // Keep recent user messages (working backwards for recency)
    for message in user_messages.iter().rev() {
        if remaining == 0 { break; }
        
        let tokens = approx_token_count(message);
        if tokens <= remaining {
            selected_messages.push(message.clone());
            remaining -= tokens;
        } else {
            // Truncate and include partial
            let truncated = truncate_text(message, TruncationPolicy::Tokens(remaining));
            selected_messages.push(truncated);
            break;
        }
    }
    selected_messages.reverse();

    // Add preserved user messages
    for message in &selected_messages {
        history.push(ResponseItem::Message {
            role: "user".to_string(),
            content: vec![ContentItem::InputText { text: message.clone() }],
        });
    }

    // Add summary as final message
    history.push(ResponseItem::Message {
        role: "user".to_string(),
        content: vec![ContentItem::InputText { 
            text: format!("{SUMMARY_PREFIX}\n{summary_text}")
        }],
    });

    history
}

What's Preserved vs. Discarded

Preserved Discarded
Initial context (system prompt, AGENTS.md, environment) Full tool call/output pairs
Recent user messages (up to 20K tokens) Assistant reasoning
Summary of conversation Intermediate tool outputs
Ghost snapshots (for undo) Old user messages

Remote Compaction

For some configurations, compaction uses a dedicated API endpoint:

pub fn should_use_remote_compact_task(session: &Session) -> bool {
    session.auth_manager.auth()
        .is_some_and(|auth| auth.mode == AuthMode::ChatGPT)
        && session.enabled(Feature::RemoteCompaction)
}

async fn run_remote_compact_task_inner_impl(...) -> CodexResult<()> {
    // Send full history to remote endpoint
    let new_history = turn_context.client
        .compact_conversation_history(&prompt)
        .await?;
    
    // Replace history with compacted version
    sess.replace_history(new_history).await;
    sess.recompute_token_usage(turn_context).await;
}

Part 5: Claude Code's PreCompact Hook

Claude Code provides a PreCompact hook for custom compaction logic:

{
  "PreCompact": [
    {
      "matcher": "*",
      "hooks": [
        {
          "type": "command",
          "command": "${CLAUDE_PLUGIN_ROOT}/hooks/pre-compact.sh"
        }
      ]
    }
  ]
}

This allows:

Part 6: History Normalization

The context manager maintains invariants to ensure valid conversation structure.

Invariants

  1. Every tool call must have a corresponding output
  2. Every output must have a corresponding call
  3. No orphaned items
impl ContextManager {
    fn normalize_history(&mut self) {
        // Ensure all calls have outputs
        normalize::ensure_call_outputs_present(&mut self.items);
        
        // Remove orphaned outputs
        normalize::remove_orphan_outputs(&mut self.items);
    }
}

Adding Missing Outputs

pub fn ensure_call_outputs_present(items: &mut Vec<ResponseItem>) {
    let mut call_ids_needing_output: HashSet<String> = HashSet::new();
    
    // First pass: collect all call IDs
    for item in items.iter() {
        match item {
            ResponseItem::FunctionCall { call_id, .. } => {
                call_ids_needing_output.insert(call_id.clone());
            }
            ResponseItem::FunctionCallOutput { call_id, .. } => {
                call_ids_needing_output.remove(call_id);
            }
            _ => {}
        }
    }
    
    // Second pass: add placeholder outputs for missing ones
    for call_id in call_ids_needing_output {
        items.push(ResponseItem::FunctionCallOutput {
            call_id,
            output: FunctionCallOutputPayload {
                content: "(no output recorded)".to_string(),
                success: Some(false),
                ..Default::default()
            },
        });
    }
}

Part 7: Token Estimation

Accurate token counting is expensive, so agents use approximations:

const APPROX_BYTES_PER_TOKEN: usize = 4;

pub fn approx_token_count(text: &str) -> usize {
    let len = text.len();
    len.saturating_add(APPROX_BYTES_PER_TOKEN - 1) / APPROX_BYTES_PER_TOKEN
}

pub fn approx_tokens_from_byte_count(bytes: usize) -> u64 {
    (bytes as u64 + 3) / 4
}

Estimating History Size

impl ContextManager {
    pub fn estimate_token_count(&self, turn_context: &TurnContext) -> Option<i64> {
        let model_family = turn_context.client.get_model_family();
        
        // Base tokens from system prompt
        let base_tokens = approx_token_count(&model_family.base_instructions);

        // Sum tokens from all items
        let items_tokens = self.items.iter().fold(0i64, |acc, item| {
            acc + match item {
                // Skip internal items
                ResponseItem::GhostSnapshot { .. } => 0,
                
                // Estimate encrypted reasoning content
                ResponseItem::Reasoning { encrypted_content: Some(content), .. } => {
                    estimate_reasoning_length(content.len())
                }
                
                // Serialize and estimate others
                item => {
                    let serialized = serde_json::to_string(item).unwrap_or_default();
                    approx_token_count(&serialized) as i64
                }
            }
        });

        Some(base_tokens.saturating_add(items_tokens))
    }
}

Part 8: Context Window Exceeded Handling

When the context window is exceeded, the agent must handle it gracefully:

pub enum CodexErr {
    ContextWindowExceeded,
    // ...
}

// Detection from API response
if response.error.code == "context_length_exceeded" {
    return Err(CodexErr::ContextWindowExceeded);
}

// Handling
match result {
    Err(CodexErr::ContextWindowExceeded) => {
        // Mark token usage as full
        sess.set_total_tokens_full(turn_context.as_ref()).await;
        
        // Notify user
        let event = EventMsg::Error(ErrorEvent {
            message: "Your input exceeds the context window. Please adjust and try again.".into(),
            code: Some("context_window_exceeded".into()),
        });
        sess.send_event(&turn_context, event).await;
    }
}

Filling Token Info on Overflow

impl TokenUsageInfo {
    pub fn fill_to_context_window(&mut self, context_window: i64) {
        let previous_total = self.last_token_usage.total_tokens;
        let delta = (context_window - previous_total).max(0);
        
        self.model_context_window = Some(context_window);
        self.last_token_usage = TokenUsage {
            total_tokens: context_window,
            input_tokens: self.last_token_usage.input_tokens + delta,
            output_tokens: self.last_token_usage.output_tokens,
        };
    }
}

Part 9: Image Handling

Images consume significant context. Special handling is needed:

Replace Invalid Images

impl ContextManager {
    pub fn replace_last_turn_images(&mut self, placeholder: &str) {
        if let Some(last_item) = self.items.last_mut() {
            match last_item {
                ResponseItem::Message { role, content, .. } if role == "user" => {
                    for item in content.iter_mut() {
                        if matches!(item, ContentItem::InputImage { .. }) {
                            *item = ContentItem::InputText {
                                text: placeholder.to_string(),
                            };
                        }
                    }
                }
                // Also handle function call outputs with images
                ResponseItem::FunctionCallOutput { output, .. } => {
                    if let Some(items) = output.content_items.as_mut() {
                        for item in items.iter_mut() {
                            if matches!(item, FunctionCallOutputContentItem::InputImage { .. }) {
                                *item = FunctionCallOutputContentItem::InputText {
                                    text: placeholder.to_string(),
                                };
                            }
                        }
                    }
                }
                _ => {}
            }
        }
    }
}

Part 10: Best Practices

1. Truncate Aggressively, Compact Lazily

// Truncate every tool output immediately
let processed = truncate_text(&output, policy);

// But only compact when necessary (90% threshold)
if total_tokens >= auto_compact_limit {
    run_compaction();
}

2. Preserve Recent Context

// Keep recent user messages during compaction
for message in user_messages.iter().rev().take(MAX_RECENT) {
    preserved.push(message);
}

3. Re-inject Critical Context

// After compaction, re-inject initial context
let initial_context = sess.build_initial_context(turn_context);
let new_history = build_compacted_history(
    initial_context,  // System prompt, AGENTS.md, environment
    &user_messages,
    &summary,
);

4. Track Token Usage Continuously

// Update after every API response
sess.update_token_usage_info(turn_context, token_usage.as_ref()).await;

// Check before expensive operations
let usage = sess.get_total_token_usage().await;
if usage >= limit * 0.8 {
    warn!("Approaching context limit: {}%", (usage * 100) / limit);
}

5. Handle Edge Cases

// Single item that exceeds context
if turn_input.len() == 1 && err == ContextWindowExceeded {
    // Can't evict more - must fail
    return Err(err);
}

// Empty summary fallback
let summary_text = if summary.is_empty() {
    "(no summary available)".to_string()
} else {
    summary
};

6. Notify Users of Context Operations

// Warn about compaction
sess.send_event(&turn_context, EventMsg::ContextCompacted(ContextCompactedEvent {})).await;

// Warn about accuracy impact
sess.send_event(&turn_context, EventMsg::Warning(WarningEvent {
    message: "Long conversations and multiple compactions can reduce accuracy. \
              Start a new conversation when possible.".into(),
})).await;

Summary

Context management in LLM agents involves multiple complementary strategies:

Strategy When Applied What It Does
Truncation Per-item, at recording Limits individual items to prevent bloat
Eviction When compaction fails Removes oldest items to make room
Compaction At token threshold Summarizes history to reduce size
Normalization Before sending Ensures valid conversation structure
Re-injection After compaction Restores critical initial context

The key insight is that context management is a continuous process, not a one-time operation. Agents must:

  1. Track token usage constantly
  2. Truncate outputs as they're recorded
  3. Trigger compaction before hitting limits
  4. Preserve the most valuable context (recent messages, initial instructions)
  5. Handle failures gracefully (eviction, user notification)

This multi-layered approach ensures agents can maintain long, productive conversations while staying within model constraints.