Context Management¶
How Zorac manages the conversation context window — counting tokens, auto-summarizing older messages, and balancing between conversation memory and available context budget.
The Context Window Problem¶
What the Context Window Is¶
Every LLM has a context window — a fixed limit on the total number of tokens it can process at once. This window includes everything: the system prompt, the entire conversation history, and the model's response.
For Mistral-Small-24B as configured in Zorac:
Context window: 16,384 tokens (set by --max-model-len)
├── System prompt: ~200 tokens
├── Chat history: variable (grows with conversation)
├── Current message: variable
└── Model response: up to 4,000 tokens (MAX_OUTPUT_TOKENS)
The available space for conversation history is:
This is why Zorac defaults to MAX_INPUT_TOKENS=12000 — it reserves enough room for the model to generate a full response.
Why It Has a Fixed Size¶
The context window size is determined by the model's architecture and the available GPU memory. The KV cache (which stores the attention state for all tokens in the context) consumes VRAM proportional to the context length:
On an RTX 4090 with a 24B AWQ model, the KV cache for 16k tokens uses approximately 8GB of VRAM — a significant chunk of the 24GB available. Doubling the context to 32k would require ~16GB for the KV cache alone, leaving barely enough for the model weights.
What Happens When You Exceed It¶
If the conversation history exceeds the context window, one of two things happens:
- API error — vLLM returns an error because the input exceeds
--max-model-len - Truncation — Some servers silently drop tokens from the beginning of the input, losing early conversation context without warning
Neither outcome is good. Zorac prevents both by proactively managing the context window through token counting and auto-summarization.
Token Counting in Zorac¶
Using tiktoken¶
Zorac counts tokens using tiktoken, the same tokenizer library used by OpenAI's APIs. The counting logic lives in zorac/utils.py:
def count_tokens(messages, encoding_name=None):
encoding = tiktoken.get_encoding(resolved_encoding)
num_tokens = 0
for message in messages:
num_tokens += 4 # Message envelope overhead
for _, value in message.items():
if isinstance(value, str):
num_tokens += len(encoding.encode(value))
num_tokens += 2 # Reply priming
return num_tokens
The function follows OpenAI's token counting methodology:
- 4 tokens per message for the envelope:
<im_start>{role}\n{content}<im_end>\n - Encoded content — each string value is tokenized and counted
- 2 tokens at the end for reply priming:
<im_start>assistant
Encoding Families and Model Matching¶
Different model families use different tokenizers. Zorac defaults to cl100k_base (the encoding used by GPT-4 and most modern models), which provides a reasonable approximation for Mistral models.
If you're running a model with a significantly different tokenizer, you can change the encoding:
# Via /config command
/config set TIKTOKEN_ENCODING o200k_base
# Via environment variable
TIKTOKEN_ENCODING=o200k_base zorac
Approximation Is Sufficient
The token count doesn't need to be exact — it's used to trigger summarization (a gradual process) rather than to enforce a hard cutoff. Being off by a few percent is perfectly fine.
Where Token Counts Appear in the UI¶
Token counts surface in several places in Zorac's interface:
| Location | What It Shows |
|---|---|
| Stats bar (after chat) | Response tokens, duration, tok/s, plus total conversation tokens |
| Stats bar (session loaded) | Message count and total token usage |
/tokens command |
Current token usage, limits, and remaining capacity |
| Auto-summarization warning | Appears when token count exceeds MAX_INPUT_TOKENS |
The stats bar is a Static widget pinned to the bottom of the TUI, updated in real-time during streaming and after each interaction.
Auto-Summarization¶
When It Triggers¶
Before every chat message is sent to the LLM, Zorac checks the total token count:
# In handle_chat() — zorac/main.py
current_tokens = count_tokens(self.messages)
if current_tokens > MAX_INPUT_TOKENS:
await self._summarize_messages()
When the conversation exceeds MAX_INPUT_TOKENS (default: 12,000), auto-summarization kicks in. This threshold leaves a comfortable buffer below the actual context window (16,384 tokens) to account for the model's response.
You can also trigger summarization manually at any time with the /summarize command — useful if you want to compress the conversation proactively.
How the Summary Is Generated¶
Summarization works by sending the older messages to the LLM with a dedicated summarization prompt:
summary_response = await client.chat.completions.create(
model=VLLM_MODEL,
messages=[
{"role": "system",
"content": "You are a helpful assistant that creates concise summaries."},
{"role": "user",
"content": f"Please create a concise summary of this conversation "
f"history, preserving key facts, decisions, and context:"
f"\n\n{conversation_text}"},
],
temperature=0.1,
stream=False,
)
The summarization call:
- Uses a separate system prompt (focused on summarization, not being "Zorac")
- Uses low temperature (0.1) for consistent, factual summaries
- Runs without streaming because the complete summary is needed before proceeding
- Shows a spinner in the UI ("Summarizing...") because the API call takes a few seconds
What Gets Preserved (KEEP_RECENT_MESSAGES)¶
Not everything gets summarized. The message list is split into three segments:
Before summarization:
[system, msg1, msg2, msg3, ..., msgN-5, msgN-4, msgN-3, msgN-2, msgN-1, msgN]
^ ^──── KEEP_RECENT_MESSAGES (6) ────^
|── kept ──|── old (summarized) ─|──── kept (recent) ──────────────|
After summarization:
[system, summary, msgN-5, msgN-4, msgN-3, msgN-2, msgN-1, msgN]
- System message — Always preserved (first message, contains Zorac's identity and instructions)
- Old messages — Everything between the system message and the recent messages gets summarized into a single paragraph
- Recent messages — The last
KEEP_RECENT_MESSAGES(default: 6) messages are kept verbatim for conversational continuity
The recent messages are preserved because they represent the immediate context of the conversation — what was just discussed. Summarizing them would lose the nuance needed for coherent follow-up responses.
The Summary Message Format¶
The summary is stored as a system message with a recognizable prefix:
This format allows:
- The LLM to understand it's reading a summary of prior conversation (not instructions)
- The
/summarycommand to find and display the summary by checking for the"Previous conversation summary:"prefix - Clear distinction from the main system message
If summarization fails (network error, model error), Zorac falls back to keeping only the system message plus recent messages. Losing old context is better than crashing.
Configuration¶
MAX_INPUT_TOKENS¶
Default: 12000
The maximum number of tokens allowed for the system prompt plus conversation history. When this limit is exceeded, auto-summarization triggers.
This should be set lower than your model's actual context window (--max-model-len) to leave room for the model's response. A good rule of thumb:
For Zorac's defaults: 16384 - 4000 - 384 ≈ 12000
MAX_OUTPUT_TOKENS¶
Default: 4000
The maximum number of tokens the model can generate per response. This is passed directly to the API as max_tokens:
stream_response = await self.client.chat.completions.create(
model=self.vllm_model,
messages=self.messages,
max_tokens=self.max_output_tokens, # ← This setting
stream=True,
)
Higher values allow longer responses but consume more of the context budget. For coding tasks where you want complete code blocks, 4000 is a good default. For shorter Q&A, you could reduce this to 2000.
KEEP_RECENT_MESSAGES¶
Default: 6
The number of recent messages preserved during auto-summarization. This creates a "sliding window" effect:
- Higher values (8-10) — More immediate context preserved, but summarization saves less space
- Lower values (2-4) — More aggressive compression, but the model may lose track of recent topics
- Default (6) — Preserves the last 3 user-assistant exchanges, which is usually enough for conversational coherence
Choosing Values for Your Use Case¶
| Use Case | MAX_INPUT_TOKENS | MAX_OUTPUT_TOKENS | KEEP_RECENT_MESSAGES |
|---|---|---|---|
| General chat | 12000 | 4000 | 6 |
| Coding assistant | 12000 | 4000 | 8 |
| Quick Q&A | 8000 | 2000 | 4 |
| Long-form writing | 10000 | 6000 | 4 |
| Short context (8k model) | 4000 | 2000 | 4 |
All three settings can be changed at runtime via the /config command:
/config set MAX_INPUT_TOKENS 10000
/config set MAX_OUTPUT_TOKENS 6000
/config set KEEP_RECENT_MESSAGES 4
Or in your .env file for persistent defaults: