🍲 The cheapest token is the one you never spend

A local AI assistant that knows when not to use AI.

Runs on your own computer. Does the cheap, deterministic work with scripts — for free — and saves the model for what actually needs thinking.

Install TokenSoup View on GitHub

Self-hosted — your machine Multi-LLM — Claude · GPT · Gemini · Groq · Ollama · any OpenAI-compatible Windows · macOS · Linux Reach it from browser · terminal · Telegram · Discord Apache-2.0

🔒Privacy by design

Don't want an AI using your personal data?

TokenSoup redacts names, IDs, IBANs, cards and emails before anything is sent to a model.

The safest data is the data that never left your computer.

⚡0 tokens

Why pay a model to run `git status`?

Deterministic work runs as a local script — and reuses the scripts you already trust. No round trip, no cost.

The cheapest token is the one you never spend.

🛡️Human in the loop

Nothing destructive happens without your yes.

Reject a command and it stays rejected — the model can't bypass you by switching tools or rewording the path.

You stay in control of your own machine.

🧠Multi-model teams

One task, the right model for each part.

A coordinator splits the work across parallel agents — frontier to plan, cheap or local models for the legwork.

Orchestrate many models, not just pick one.

🔗Over MCP · beta

Works with your other tools.

Any MCP client — Claude Code, Cursor, Cline — can call TokenSoup's tools; TokenSoup can use other MCP servers too.

A good neighbor in your toolchain.

The point

Deterministic work and scripts you already trust shouldn't cost a single token.

Most cloud agents answer “git status” or “is this valid JSON” by shipping your context to a model and paying for the round trip. TokenSoup recognizes those, runs a local routine instead, and only reaches for a model — the right-sized one — when the task genuinely needs reasoning. It runs on your machine, redacts personal data before anything leaves it, and asks before doing anything destructive.

What it is

A finished, self-hosted assistant you launch with a double-click and talk to in your browser, terminal, or a chat app. One API key to start. Built for a computer-comfortable person, not just developers.

What it isn't

Not a developer library you import into your code, and not trying to compete with those. Those are toolkits for engineers building agent systems. TokenSoup is the finished assistant itself.

How it saves tokens

A few ideas, one principle.

Spend nothing where nothing needs spending; spend modestly where a small model suffices; spend fully only on the hard parts.

⚡

Indie Agents

0 tokens

Deterministic tasks — git status, file counts, JSON checks, format checks — run as built-in indie scripts, not model calls. The smart router recognizes them and runs the local routine automatically. The assistant can also author a new reusable script when it meets a repeatable task — shown to you for approval before anything is saved.

🧩

Reuse what works

your scripts, 0 tokens

Already have battle-tested glue in Bash, PowerShell, Python, or Node — or a compiled binary? Register it as an indie agent and TokenSoup calls it directly instead of re-implementing, and re-paying for, the same logic in an LLM. Your legacy automation becomes part of the assistant.

🎯

Smart Router

right-sized model

When a model is needed, the router classifies the task and routes it: simple text to a cheap/fast model, standard work to a mid model, heavy work to a frontier model. A semantic cache returns repeated answers with no new call.

🗺️

SoupGraph

codebase map, 0 tokens

An AST knowledge graph of your code, built locally with no model calls. Ask about your codebase and TokenSoup feeds the model the exact symbols and file locations instead of dumping whole files — far fewer tokens. Build it with one click in the web UI (or /project map); once built, relevant context is added automatically. It can auto-refresh itself after edits (opt-in, incremental, in the background).

📚

Skills

loaded only when relevant

Short instruction sets — testing, git, security, and your own — are injected into the prompt only when a message is actually relevant to them, never all at once. Add, remove, and toggle each one from the web UI; a disabled skill is never detected or injected, so it costs nothing.

✂️

Context narrowing

skeleton, not whole files

For large code files, TokenSoup sends a structural skeleton — imports and function/class signatures with line numbers — instead of the whole file, and the assistant asks for exact line ranges when it needs the body. A Full / Balanced / Lean switch tunes how aggressive this is, and the UI shows how many tokens it saved.

Scenario	Typical baseline	With TokenSoup
Git operations	~500 tok / query	0 tok · indie
Repeated queries	~1000 tok	0 tok · cache
Simple tasks	~800 tok · large model	~200 tok · small model
Email triage	~2000 tok / email	~200 tok · small model
Large code file read	~36k tok · whole file	skeleton + ranges

⚠️ The table figures are illustrative; the end-to-end result is measured. A reproducible benchmark suite ships in benchmark/: the deterministic mechanisms — skeletons, skill gating, SoupGraph context — are measured directly (run python3 benchmark/suites/run_all.py), and an end-to-end token/cost A/B against a direct frontier-model call is runnable with an API key (benchmark/suites/bench_e2e.py) — it shows ~59% lower dollar cost with accuracy preserved. Each test also checks correctness, since saved tokens only count if the answer is still right.

Security-first

An assistant on your machine shouldn't be able to quietly do harm.

Safety is layered, on the assumption that a tool which can run commands needs guardrails — not blind trust. This security work is complete and shipped in v1.0, validated by an end-to-end benchmark. See SECURITY.md for the honest trust model & limits.

0

PII Redaction

Emails, IDs, IBANs, phone numbers replaced with placeholders before anything is sent to a model.

1

Injection Scan

User messages checked for prompt-injection patterns.

2

External Content Guard

Web and file content passes the same guard before it can influence the model.

3

Canary Tokens

Secret markers in context detect exfiltration attempts.

4·5

Response Validation

The final response is scanned before it's shown to you.

🔐

Encrypted KeyVault

API keys encrypted at rest (AES-256-GCM); localhost-only by default.

HITL — Human-in-the-Loop Path Guard

Dangerous operations require your explicit approval. Reject a file operation and that path is protected against every tool — bash, write_file, edit_file, python -c, shell redirects — for the rest of the turn. The model can't bypass your rejection by switching tools or rewording the path, and an already-rejected command won't re-prompt until you ask for it again yourself. Destructive writes that would clobber an existing file require confirmation.

Trust model & honest limits

TokenSoup assumes a single trusted operator per instance. Everything the agent reads from the outside — web pages, files, messages — is treated as untrusted. It is not a hostile-multi-tenant boundary: to run code for mutually distrusting users, isolate at the infrastructure level (separate instances, credentials, and ideally hosts or VMs).

The layers above reduce risk and limit blast radius — they do not make it zero. OS-level isolation (bubblewrap / WSL2 / containers, Enterprise) shares the host kernel and is not a hypervisor; the soft folder sandbox is a parser and can be defeated by obfuscation; prompt injection can attempt to hijack the agent. Channels deny by default until you set their allowlist or pair a user in (or deliberately open them), and an optional audited break-glass can run a command unsandboxed — but only for the owner (solo) or an RBAC tools.elevated role (Enterprise), never a paired guest. For adversarial, multi-tenant, or regulated workloads, run TokenSoup inside a dedicated VM and treat these as an inner layer, not the boundary.

Provided under Apache-2.0, “AS IS”, without warranty; this is not legal advice. Full details: SECURITY.md.

Under the hood

How a request flows.

Unlike a gateway that fans messages out to channels, TokenSoup sends every request down through layers — one engine applies redaction, caching, and routing, then branches to a free local script or a right-sized model, with your approval guarding anything destructive.

Works with your other tools — over MCP beta

TokenSoup speaks the Model Context Protocol in both directions: any MCP client (Claude Code, Cursor, Cline, …) can call TokenSoup's tools, and TokenSoup can call out to other MCP servers.

Polyglot — port a slow function to a faster language alpha

A separate, experimental layer: hand it a slow Python function and it benchmarks, proposes a port to Rust/C/Go/COBOL, waits for your approval, compiles, and verifies the result matches. Registering the artifact as an indie agent is a deliberate second step.

Multi-model orchestration — teams of agents

A coordinator breaks a complex task into subtasks and runs workers in parallel — and each agent can run a different model, including a local one. Use a frontier model to plan, cheap or local models for the legwork, a strong model to synthesize.

And there's more under the hood — durable multi-step flows that survive a session restart, an autonomous agent mode, scheduled cron tasks, batch processing, a semantic cache, and per-agent security profiles. TokenSoup is built to bend to how you work, not the other way around.

Get started

Install & run.

First run installs dependencies, asks for one API key, and opens the web UI at http://localhost:7070.

👆 Easiest: open the tokensoup folder and double-click TokenSoup.bat (or TokenSoup-PS.cmd) — that's it.

…or from a terminal:

# From the extracted folder
> cd path\to\tokensoup
> python soup.py

Linux

$ cd tokensoup
$ bash install.sh        # or: python3 soup.py

macOS — creates ~/Applications/TokenSoup.app

$ cd tokensoup
$ bash install.sh        # or: python3 soup.py

Flag	What it does
python soup.py	Web UI at localhost:7070 (default)
--cli	Terminal chat instead of the web UI
--reset	Re-enter API keys
--check	Check installation state
--lang en \| sr	Force English or Serbian
--port 8080	Custom port

Reference

Every command.

All commands are typed directly in chat. Arguments in [brackets] are optional. Full reference also lives in COMMANDS.md.

No commands match that filter.

💬 Sessions

/new [name]New session /sessionsList all sessions /statusSession status — tokens, model, cost /contextContext window details /clearClear conversation history /compactCompress context /memoryShow MEMORY.md /resume [name]Resume previous session

🤖 Model

/model <name>Switch model /modelsList available models /lang <code>Interface language

⚙️ Configuration

/config get <key>Read setting /config set <key> <val>Set value /config listAll settings /security statusSecurity layer overview /security pii on|offPII redaction /security set <name>Security profile (full/standard/minimal/monitor/off) /vault set <key> <val>Store encrypted value /vault get <key>Get encrypted value

🔑 API Keys

/config key <provider> <key>Set/update an API key /doctorDiagnostic — checks keys & providers web UI · SettingsManage keys in the interface

🚀 Agent Mode

/agent <task>Run autonomous agent /agent stopStop agent /agent statusAgent status /agent sandbox on|offSandbox isolation /agent capabilitiesAllowed tools /hitl on|offRequire approval before each step /hitl strictApproval for ALL bash commands /hitl trust <category> [session|always|never]Trust policy for a risky-op category /hitl memory [clear]Show/clear remembered trust decisions /approveApprove pending step /observe on|offMonitor agent activity /router statusSmart router status /router set <tier> <model>Model for a tier /si <task>Self-improve agent

⚡ Indie Agents · 0 tokens

Local scripts that run without API calls.

/indie listAll registered agents /indie run <name> [args]Run an agent /indie new <name>Create template /indie register <name> <path>Register a script /indie create "<desc>"LLM generates + registers /indie remove <name>Remove agent /indie info <name>Agent details

Built-in agents

git_statusgit_diffgit_loggit_commitfile_statsfile_findjson_validatejson_formatsystem_infodisk_usageweb_scrapesession_summarysoupgraph_query

📊 Semantic Cache

/cache on|offToggle semantic cache /cache statusHit rate and stats /cache semanticSemantic matching mode /cache exactExact-match mode only /cache modelShow/set embedding model /cache installInstall sentence-transformers (once) /cache threshold <0-1>Similarity threshold (default 0.85) /cache ttl <sec>Entry time-to-live /cache refreshRe-embed cached entries /cache flushClear cache

📋 Batch Processing

Batch API costs 50% less; results within 24h.

/batch create <name>New batch job /batch add <name> <prompt>Add item /batch run <name>Submit /batch status <id>Check status /batch results <id>Get results

👥 Teams (Multi-Agent)

/team create <name>Create team /team add <team> <agent>Add agent /team run <name> <task>Run team task /team status <name>Team status /team listAll teams /economy on|offEconomy mode — minimize cost /budget set <type> <amt>Spend limits (session/daily/tokens) /ratelimit set <provider> …Per-provider rate limits

💼 Flows (Durable Tasks)

/flow start <name>Start flow /flow listActive flows /flow resume <id>Resume paused flow /flow stop <id>Stop flow /goal <desc>Autonomous run toward a goal /cron listSchedule recurring jobs /notifyNotification channels /changelogShow recent changelog

📧 Email & Content · optional

/email triageTriage inbox /email draft <to> <subject>Draft email /email send <id>Send drafted email /email configEmail settings /generate <image|video|audio> <model> <prompt>Generate media /marketing <topic>Marketing content pipeline /mediaList media generations

🗺️ SoupGraph

/project mapBuild AST knowledge graph (0 tokens) /project map <query>Search codebase /project loadLoad project context (uses graph) /indie soupgraph_query <query>Search via indie agent

📚 Skills

/skillsList available skills /skill <name>Show a skill web UI · Agents → SkillsAdd, remove, enable/disable

🔧 Development

/commitStage & commit with an AI message /diffShow working-tree diff /reviewAI review of current changes /prDraft a pull request from commits /verify on|offVerify model output (read or execute) /improve <task>Propose & apply code improvements /effort low|medium|maxReasoning effort (cost knob) /ultraplan <task>Deep multi-step planning session /auditSecurity audit + history /tasksProject task/TODO manager /polyglot scanPerf layer — compiler scan & optimize /project initInitialize project

🔗 MCP beta

/mcp connect <url>Connect MCP server /mcp listConnected servers /mcp toolsAvailable MCP tools /mcp disconnect <id>Disconnect server

💡 Other

/search <query>Web search /fetch <url>Fetch URL content /saveSave session (memory + snapshot) /dreamConsolidate session memory now /voiceVoice mode /vimVim input mode /doctorFull diagnostic check /help [command]List commands / command help /costToken cost summary /exportExport session to Markdown /versionTokenSoup version

Who it's for

Made for people, not just terminals.

🧑‍💻

Computer-comfortable, not a coder

You want a capable assistant you launch with a double-click and talk to in a browser — not a CLI tool that assumes you live in a terminal.

🔒

Privacy-minded

You want the assistant on your own machine, with personal data redacted before anything leaves it, and a clear say over destructive actions.

🏢

Small teams Enterprise · in development

One self-hosted helper for everyday work. An Enterprise module (in development) will add team features, stronger controls, auditable oversight, and the option of a fully local model behind your firewall — all on the same core engine.

Project status & honesty

The HITL security path is well-tested — path guard, repeated-rejection blocking, .NET/alias-aware destructive-command detection, and destructive-write detection all have dedicated coverage.
The token-savings numbers are now measured, not just estimated: an end-to-end A/B (same context, model-locked vs auto router, accuracy checked) shows ~59% lower dollar cost with accuracy preserved. Reproduce with benchmark/suites/bench_e2e.py.
Some areas remain opt-in or post-v1.0: the LLM query classifier (off by default for privacy), the semantic-cache embedding model (install to match paraphrases; exact-match works without it), and a self-trained micro-model classifier.
The security model has not yet had an independent third-party audit. For high-stakes deployments, commission one before relying on it.