Prompt Engineering¶

How Zorac constructs its system prompt, injects command awareness into the LLM, and designs specialized prompts for different tasks — turning a generic language model into a context-aware chat assistant.

Why the System Prompt Matters¶

Every conversation with an LLM starts with a system message — the first message in the messages array that establishes the model's identity, behavior, and context. The system prompt is the single most influential piece of text in any LLM application because it shapes every response that follows.

For Zorac, the system prompt needs to accomplish several things in a tight token budget (~200 tokens):

Establish identity — The model should know it's "Zorac", not a generic assistant
Provide temporal context — The model needs today's date to answer time-sensitive questions
Enable command awareness — The model should understand and suggest Zorac's interactive commands
Stay compact — Every token in the system prompt is a token unavailable for conversation

This is a balancing act. A richer system prompt produces better-behaved responses, but it compresses the available context window for actual conversation.

Anatomy of the System Prompt¶

Zorac's system prompt is built in zorac/commands.py by the get_initial_system_message() function. It has two distinct parts:

Part 1: Identity and Date¶

today = datetime.date.today().strftime("%A, %B %d, %Y")
base_message = f"You are Zorac, a helpful AI assistant. Today's date is {today}."

This produces something like:

You are Zorac, a helpful AI assistant. Today's date is Monday, February 18, 2026.

Why include the date? LLMs have a training cutoff — they don't inherently know what day it is. Including the date lets the model answer questions like "what day is it?" and provides temporal context for the conversation. The full weekday-month-day-year format (%A, %B %d, %Y) gives the model maximum flexibility to answer date-related questions without additional parsing.

Why "You are Zorac"? Naming the assistant serves two purposes. It gives the model a consistent identity to respond from, and it lets users ask "what are you?" and get a coherent answer. Without a name, the model might identify itself as ChatGPT, Claude, or whatever it encountered most during training.

Part 2: Command Awareness¶

The second part is generated by get_system_prompt_commands() and appended directly after the identity line:

The user is interacting with you through Zorac, a terminal-based chat client
for local LLMs.

Available Commands:
The following commands are available to the user:

/help - Display a list of all available interactive commands with descriptions.
This helps users discover and understand the functionality available in Zorac.

/quit or /exit - Save the current conversation session to disk and exit Zorac.
The session will be automatically restored on the next run.

/clear - Clear the entire conversation history and reset to a fresh session
with only the initial system message. The cleared session is automatically
saved to disk.

...

When users ask about functionality, help them understand these commands
naturally. Suggest relevant commands when appropriate and provide usage
examples when helpful.

This block gives the model enough context to:

Answer "how do I save?" with "You can use /save to manually save your session"
Proactively suggest /tokens when a user mentions running out of context
Explain the difference between /save and /quit (both save, but /quit also exits)

The closing instruction — "suggest relevant commands when appropriate" — is a soft behavioral nudge. It doesn't force the model to mention commands in every response, but it makes the model aware that doing so is desirable.

Two Audiences, Two Formats¶

The command registry (COMMANDS list) serves both humans and the LLM, but each audience needs a different format. This is why each command entry has two description fields:

{
    "triggers": ["/tokens"],
    "description": "Display current token usage statistics",
    "detailed": "Show detailed token usage information including current token count, "
                "token limit, remaining capacity, and message count. Helps users monitor "
                "conversation size and predict when auto-summarization will occur.",
}

Field	Audience	Format	Purpose
`description`	Human (`/help`)	Short, imperative	Quick scanning in a command list
`detailed`	LLM (system prompt)	Thorough, explanatory	Deep understanding for natural suggestions

get_help_text() formats commands with Rich markup for terminal display — color-coded triggers, aligned columns, and special handling for /config subcommands.

get_system_prompt_commands() formats commands as plain text for the LLM. Rich markup like [cyan] would confuse the model and waste tokens on syntax it can't render.

This dual-format approach means adding a new command only requires updating one place — the COMMANDS list in commands.py. Both the /help display and the system prompt stay in sync automatically.

The Summarization Prompt¶

When the conversation exceeds the token limit, Zorac needs to compress older messages into a summary. This requires a completely different prompt strategy than the main chat.

Why a Separate System Prompt?¶

The summarization call in zorac/llm.py uses its own system message:

summary_response = await client.chat.completions.create(
    model=VLLM_MODEL,
    messages=[
        {"role": "system",
         "content": "You are a helpful assistant that creates concise summaries."},
        {"role": "user",
         "content": f"Please create a concise summary of this conversation "
                    f"history, preserving key facts, decisions, and context:"
                    f"\n\n{conversation_text}"},
    ],
    temperature=0.1,
    stream=False,
)

The summarization prompt deliberately does not include the Zorac identity or command awareness. This is intentional:

Focus — The model's only job here is summarization. Including Zorac's personality and commands would dilute focus and waste tokens on irrelevant context.
Consistency — A neutral "helpful assistant" identity produces more predictable summaries than a personality-driven one. "Zorac" might add conversational flair to the summary that isn't useful as compressed context.
Token efficiency — The summarization prompt is ~30 tokens. The full Zorac system prompt is ~200 tokens. Those saved tokens can be used for the actual conversation text being summarized.

Prompt Design Choices¶

"Preserving key facts, decisions, and context" — This instruction tells the model what information to prioritize. Without it, the model might produce a generic "they talked about X" summary instead of preserving actionable details like "the user prefers Python 3.13" or "they decided to use PostgreSQL."

Low temperature (0.1) — Summarization should be factual and deterministic, not creative. A temperature of 0.1 minimizes randomness, producing consistent summaries across runs. This contrasts with the main chat where the default temperature allows more natural variation.

No streaming — The summary must be complete before the conversation can continue (the old messages need to be replaced atomically). Streaming would add complexity without benefit — the user can't meaningfully read a summary in progress.

How the Summary Enters the Conversation¶

The completed summary is wrapped in a system message with a recognizable prefix:

summary_message = {
    "role": "system",
    "content": f"Previous conversation summary: {summary}",
}

This format serves multiple purposes:

The LLM recognizes it as context — The "Previous conversation summary:" prefix tells the model this is compressed history, not instructions. The model treats it differently than it would a user message or a system instruction.
The /summary command can find it — The command handler searches for the "Previous conversation summary:" prefix to display the current summary.
It's clearly distinct from the main system message — The conversation always has exactly one system message at position 0 (Zorac's identity). The summary, if present, sits at position 1.

The Messages Array¶

Understanding how the messages array is structured helps explain the prompt engineering decisions.

OpenAI Chat Format¶

Zorac uses the OpenAI-compatible chat completions API, which expects a messages array where each message has a role and content:

messages = [
    {"role": "system",    "content": "You are Zorac..."},        # Always first
    {"role": "user",      "content": "Hello!"},                   # User's message
    {"role": "assistant", "content": "Hi! How can I help?"},      # Model's response
    {"role": "user",      "content": "What commands are there?"}, # Next user message
    {"role": "assistant", "content": "You can use /help to..."},  # Next response
]

The three roles serve distinct purposes:

Role	Purpose	Who generates it
`system`	Instructions and context for the model	Application code
`user`	Human input	The person using Zorac
`assistant`	Model responses	The LLM

How Messages Accumulate¶

Each conversation turn adds two messages — one user and one assistant. This means the messages array grows linearly:

Turn 1: [system, user1, assistant1]                    → 3 messages
Turn 2: [system, user1, assistant1, user2, assistant2] → 5 messages
Turn 5: [system, ...10 chat messages...]               → 11 messages

After auto-summarization, the structure becomes:

[system, summary, recent_user, recent_assistant, ...]

The summary replaces all the old user/assistant pairs with a single system message, compressing potentially dozens of messages into one paragraph.

Session Persistence¶

The entire messages array is what gets saved to ~/.zorac/session.json and restored on the next run. This means the system prompt, any summary, and all recent messages persist across sessions. When you restart Zorac, the conversation continues exactly where you left off — including the model's understanding of prior context.

Token Budget Tradeoffs¶

The system prompt competes with conversation history for space in the context window. Here's how the budget breaks down:

Total context window:     16,384 tokens
├── System prompt:          ~200 tokens (identity + commands)
├── Summary (if present):   ~200 tokens (compressed history)
├── Recent messages:       variable (the active conversation)
├── Current user message:  variable
└── Response budget:       4,000 tokens (MAX_OUTPUT_TOKENS)

What If You Wanted a Richer System Prompt?¶

You could add personality traits, response formatting rules, or domain-specific instructions to the system prompt. But each addition has a cost:

Addition	Tokens	Tradeoff
Basic identity + date	~30	Minimal — essential for coherent behavior
Command awareness	~170	Worthwhile — enables natural command suggestions
Response formatting rules	~100	Moderate — trades conversation space for consistency
Domain-specific instructions	~200	Expensive — significantly reduces available context
Full personality description	~300	Very expensive — may not be worth the context cost

Zorac's ~200-token system prompt hits a sweet spot: enough context for the model to be helpful and command-aware, without consuming so much of the budget that conversations get summarized prematurely.

The Compound Effect¶

System prompt size has a compound effect on user experience. A larger prompt means:

Less room for conversation before summarization triggers
More frequent summarization calls (each adding latency)
More context lost to compression over a long conversation

For a 16k context window, increasing the system prompt by 500 tokens means summarization triggers roughly 4% sooner — which might mean one fewer user-assistant exchange before context is compressed. Over a long session with multiple summarization cycles, this compounds.

Design Principles¶

Looking across Zorac's prompt engineering, several principles emerge:

1. One prompt, one job. The main system prompt handles identity and command awareness. The summarization prompt handles compression. Mixing concerns would dilute both.

2. Plain text for LLMs, rich text for humans. The command registry generates two formats from one data source, ensuring they stay in sync without duplicating information.

3. Compact over comprehensive. In a 16k context window, every token in the system prompt is a token unavailable for conversation. Zorac's prompt includes what's necessary and nothing more.

4. Graceful degradation. If summarization fails, the conversation continues with reduced context rather than crashing. The prompt engineering doesn't assume every LLM call will succeed.

5. Single source of truth. Adding a command means updating one list in commands.py. The /help display, system prompt, and command awareness all derive from the same registry. This eliminates the class of bugs where the help text says one thing but the system prompt says another.