Background
Heavy v1 proved the idea. LH2 changes the default.
The original Leemer Heavy release introduced a useful shift: stop asking one model to be researcher, engineer, critic, and writer in the same pass. Heavy used a central orchestrator that could delegate into research, reasoning, refinement, and synthesis.
That made Heavy feel closer to GPT-4.1 or Claude Sonnet 3.5 on practical engineering work. LH2 is a bigger jump. It behaves like a small engineering review team, landing in the same operating band we expect from GPT-5.3 Codex and Claude Sonnet 4.6-class workflows.
The key design change: the council is not optional. Every LeemerH2 request goes through all six seats plus three arbiters, regardless of size. The latency investment is deliberate. It is the price of work that has been reviewed.
The council
Six seats. Fixed roles. No improvising.
The council is not a swarm of identical models voting on the best answer. Each seat has a specific job, a specific model chosen for that job, and a clear output format the arbiters read. The roles are fixed because improvised delegation produces inconsistent coverage — especially under long-context pressure.
Mission director
x-ai/grok-4.3
Owns the objective, keeps the answer aligned with the latest user request, and prevents scope drift. Acts as the final arbiter of what the mission is actually asking.
Strategy architect
moonshotai/kimi-k2.6
Designs the response structure, implementation path, and product-level tradeoffs. Produces the skeleton the rest of the council builds from.
Long-context scout
qwen/qwen3.5-flash-02-23
Recovers details from long conversations, files, tool outputs, and evidence blocks. Specializes in context that other models tend to drop under pressure.
Research validator
google/gemini-3.1-flash-lite-preview
Checks freshness, source quality, and whether external context actually supports the claim. The only role that can flag a fact as unverified before arbitration.
Tool executor
inclusionai/ling-2.6-flash
Looks for useful tool, API, GitHub, uploaded-file, and code-execution opportunities. Executes side-channel work the main council cannot interrupt for.
Red-team critic
deepseek/deepseek-v4-flash
Attacks weak assumptions, hidden coupling, unsafe certainty, and final-answer gaps. The council's last filter before the arbitration pass.
Arbitration
Disagreement is a resource, not a formatting problem.
Many multi-agent systems ask several models for opinions and flatten the result into a bland summary. LH2 adds an arbitration wave after the first council. The arbiters look for consensus, meaningful disagreement, unsupported claims, and the answer structure that will survive scrutiny.
That is the difference between a swarm and a pile of drafts. Arbitration makes disagreement visible and resolvable — the critic may flag a weak assumption, the scout may have recovered a context chunk that changes the strategy, and the validator may have rejected a source. The arbiters see all of it.
The three arbiters vote on the strongest synthesis path. When they disagree, the director model breaks the tie using the original mission objective. Only then does the synthesis pass begin, with the full 128K token budget.
Unsupported claims flagged
Validator and critic outputs are cross-referenced before arbitration. Unverified claims are downgraded, not silently included.
Context recovered
Long-context scout outputs are merged with the strategy pass before synthesis. Nothing is dropped because the model ran out of room.
Consensus over committee
The three arbiters must converge on a synthesis path. Plurality is not enough — the arbiters iterate until the plan is consistent.
Internal launch eval
The numbers, without the asterisks.
These are Leemer internal launch numbers, not third-party benchmark claims. The signal is the shape: LH2 improves most where a council should improve — context recovery, bug localization, and catching regressions before the final answer lands.
Mean plan latency dropped 18% despite running six models plus three arbiters. The parallelism in the first council wave buys back most of the overhead. The synthesis pass is the only serial bottleneck.
| Internal engineering eval | Heavy v1 | LeemerH2 | Delta |
|---|---|---|---|
| Repo task resolution | 41.8% | 68.7% | +64% |
| Multi-file refactor pass rate | 44.6% | 73.2% | +64% |
| Long-context bug localization | 52.4% | 86.1% | +64% |
| Regression caught before final | 28.9% | 64.5% | +123% |
| Architecture review usefulness | 6.7 / 10 | 8.9 / 10 | +33% |
| Mean first useful plan latency | 71s | 58s | −18% |
Use cases
Where the council earns its latency.
LeemerH2 is built for requests where being slightly wrong is expensive: migrations, architecture choices, policy decisions, incident retrospectives, and high-context product planning. The council format adds friction in the right places so weak claims are challenged before they reach the final answer.
Repo-scale engineering
- Codebase-wide refactor planning with dependency analysis
- Multi-file bug localization across large TypeScript or Rust codebases
- Architecture proposals with tradeoff matrices and migration paths
- PR review passes that catch regressions the diff doesn't reveal
Architecture decisions
- System design reviews with explicit failure mode analysis
- Database migration planning with rollback and rollforward strategies
- API contract design across microservice boundaries
- Performance bottleneck analysis with profiler-informed recommendations
Research synthesis
- Competitor analysis with claim verification and source scoring
- Technical literature review with citation-level accuracy checking
- Market landscape reports with structured evidence tables
- Research-to-decision briefs for executive or investor audiences
Incident response
- Post-mortem drafts with root-cause chains and timeline reconstruction
- Runbook generation from incident logs and affected system context
- Risk assessment for proposed hotfixes before deployment
- Cross-system impact analysis when a service changes behavior
Technical notes
Engineering-native by design.
128K synthesis
LH2 requests a 131,072-token final budget for long technical work when the provider honors it. Long output is the point, not a byproduct of the process.
Engineering native
GitHub context, file reads, code execution, and systems review are part of the default council identity. The tool executor seat runs non-blocking.
Stream compatible
The chat UI keeps the same token and tool-event stream contract. The council overhead is invisible to the streaming consumer.
Comparison
Heavy v1 vs. LeemerH2
Knowing which model to reach for is part of using LeemerH2 well. Use lighter models for quick answers, drafts, and low-stakes tasks. LeemerH2 is the right tool when the task deserves a team.
| Area | Heavy v1 | LeemerH2 |
|---|---|---|
| Architecture | Single orchestrator with optional delegates | Fixed council of 6 + 3 arbiters by default |
| Context recovery | Orchestrator-scoped | Dedicated long-context scout seat |
| Validation | Optional refinement pass | Mandatory critic + arbiter wave before synthesis |
| Output length | Standard model limit | 131,072-token synthesis budget when supported |
| Tool usage | Orchestrator-driven | Parallel tool-executor seat, non-blocking |
| Best for | Most chat, moderate engineering tasks | Repo-scale work, architecture, incident analysis |
In production
Why this matters, not just in demos.
Users see better decomposition, clearer assumptions, and stronger sequencing across long replies. Instead of one model improvising from memory, the system now recovers context, coordinates specialist passes, and runs synthesis only after disagreement is resolved.
That shift is why LH2 feels more like a technical review partner than a generic chatbot. The output is still fast, but now it is more auditable, more explicit about uncertainty, and better aligned to execution-heavy workflows.
Concretely: a repo migration plan from LeemerH2 includes the rollback path and the hidden coupling the strategy architect found. A market analysis includes the rejected sources the validator flagged. An incident post-mortem includes the alternative root-cause hypotheses the critic raised. That extra layer is not noise — it is what makes the output actionable.
The practical takeaway
Use lighter models when you need a quick answer. Use LeemerH2 when the task deserves a team: repo-scale engineering, architecture, research, debugging, migration planning, or any decision where a critic should attack the plan before you act.