engineeringtoken-optimizationllm-infrastructureclaudeproduction

Reverse-Engineering Claude: Token Optimization Strategies from the Backend

Token optimization isn't just a prompt engineering trick — it is hardcoded into the Claude backend.

2026-04-07·5 min read

Use with AI

When building systems around large language models, token optimization isn't just a best practice — it's a core architectural constraint. As we shift from traditional CI/CD to Continuous Calibration / Continuous Development (CC/CD) for non-deterministic models, tracking and optimizing token costs becomes a first-class infrastructure concern, right alongside managing KV cache and quantization limits.

A look under the hood of the Claude Code harness reveals exactly how Anthropic handles this internally. Token optimization isn't just a prompt engineering trick; it is hardcoded into their backend services.

Here is an actionable breakdown of the methods you can use to optimize token usage and get better results for less spend, backed directly by the internal architecture of Claude Code.

1. Context Compaction: Prune the Sliding Window

The single highest-impact habit is actively compressing conversations before the context window fills up. Waiting for a context overflow means dragging dead weight into every subsequent API call.

The backend proof: The internal TranscriptStore uses a compact(keep_last: int) method to actively prune older turns from memory. The QueryEnginePort.compact_messages_if_needed() automatically triggers compaction on every turn once a predefined compact_after_turns threshold is hit.

Actionable: Proactively invoke /compact (or your own summarization logic) at logical task boundaries. Flush transcripts and snapshot the summarized state to cut input tokens dramatically on the next turn.

2. Surgical File Operations: Stop Dumping Full Files

Passing entire 1,000-line files into context when you only need to modify a single function is the fastest way to burn your token budget and degrade model focus.

The backend proof: Claude relies on GlobTool to find file paths and GrepTool to extract relevant lines, deliberately avoiding full-file reads. When reads are necessary, FileReadTool uses a strict limits.ts module to enforce line-range boundaries. For writes, FileEditTool applies minimal diffs via utils.ts instead of whole-file overwrites.

Actionable: Force your workflow into a search-and-diff pattern: Glob → Grep → Read (by line range) → Edit (by diff). Never use full-file writes for minor updates.

3. Strip the Tool Manifest Surface Area

Every tool you provide to the model requires token-heavy schema definitions in the system prompt. Passing a massive toolbox to an agent doing a simple task wastes input tokens before a single user message is processed.

The backend proof: The backend supports a simple_mode=True parameter that aggressively strips the tool pool down to just three core tools (BashTool, FileReadTool, FileEditTool). It also uses a ToolPermissionContext to surgically block specific tools or prefixes, and an --no-mcp flag to drop heavy Model Context Protocol tools.

Actionable: Scope your tool manifest precisely to the task. If a task doesn't require complex file manipulation, drop those tools from the request. For high-volume workflows, model the tool count vs. input token cost as part of your cost structure.

4. Isolate State via Sub-Agents and External Memory

Keeping static facts or entire planning sessions in the active conversation wastes tokens on every single turn.

The backend proof: The AgentTool suite uses forkSubagent.ts and spawnMultiAgent.ts to spin up parallel workers with strictly scoped, task-specific contexts. Instead of relying on conversational history, the system persists knowledge externally using a SessionMemory service and agentMemorySnapshot tools.

Actionable: Store stable facts — coding standards, project patterns, configuration — in external memory files and load them by reference. When executing complex tasks, spawn sub-agents with only the context they need for their specific slice of work. This prevents the "growing transcript" problem that causes costs to compound across turns.

5. Enforce Hard Budgets and Use Plan Mode

Agentic loops can spiral quickly. Executing trial-and-error code modifications using a heavy model with full context will drain budget fast.

The backend proof: QueryEngineConfig is hardcoded with a max_budget_tokens and max_turns ceiling that cleanly halts execution if exceeded. To prevent wasted loops, the system uses an EnterPlanModeTool (and a dedicated planAgent) to map out execution before committing to heavy tool-use turns.

Actionable: Set strict turn limits on your LLM loops. Use a plan mode — a cheaper, thinking-only pass — to determine exact steps before allowing the model to execute expensive, multi-turn file edits. Monitor token consumption in real-time using cost hooks. The cost difference between a plan pass and an execution pass is typically 5–10x.

The Underlying Pattern

These five strategies share a common design principle: minimize the tokens that don't contribute to the current decision. Historical turns, full file contents, unused tools, and static facts are all versions of the same problem — context that costs tokens on every call without improving the output for that specific call.

The Claude backend applies these constraints systematically. Treating token optimization as architecture — not as a late-stage cost reduction exercise — is how you build LLM systems that remain economically viable as usage scales. The teams that discover this early avoid the "worked in prototype, too expensive in production" failure mode that's increasingly common as agentic workloads move from demos to real deployments.

See how this applies to your stack

20-minute discovery call — no pitch, just specifics.

Book a Call