Token costs
Input and output token spend accumulates across prompts, context windows, responses, and multi-step runs.
Production agents spend money on model calls, tool calls, retries, infrastructure, observability, and human review. Cost control starts with measuring every run and budgeting like cloud infrastructure.
Agent cost compounds across the entire workflow, especially when agents loop, retry, and use external systems.
Input and output token spend accumulates across prompts, context windows, responses, and multi-step runs.
Searches, database queries, code execution, and API calls multiply with step count.
Orchestration, memory stores, vector databases, and runtimes add fixed and variable overhead.
Human review, guardrail checks, failure re-runs, and monitoring tools create hidden spend.
Start by profiling the task, then price each component, then add buffers for failures, context growth, and review overhead.
Measure average tokens, tool calls, and expected monthly run volume.
Calculate LLM token spend, tool API spend, and infrastructure overhead.
Account for retry overhead, growing context, human review, and QA.
Long contexts are billed on every step. Expensive models can cost dramatically more than smaller models. Failures and retries quietly inflate the total bill.
Cost governance works best when controls are built into the agent architecture from day one.
Summarize history, strip redundant context, and keep prompts lean.
Use smaller models for routing, classification, formatting, and simple QA.
Cache deterministic tool outputs and common sub-task results.
Cap agent steps and tool calls to avoid infinite or expensive loops.
Set daily caps and alert at 80% utilization before overruns happen.
Tag every call by task type, user, team, model, and tool path.
A complete model includes token cost, tool APIs, fixed infrastructure, retry overhead, human review, and a buffer for uncertainty.
Cost discipline prevents expensive architectural rewrites later.
Instrument every run with token counts, tool calls, latency, retries, and outcome status.
Use large models only where complex reasoning requires them.
Set per-team and per-task budgets, alert early, and enforce hard stops.
Track every token, tool call, retry, and review step. Then use compression, routing, caching, step budgets, alerts, and dashboards to keep AI agent spend predictable.