Heavy v1 proved the idea. LH2 changes the default.
The original Leemer Heavy release introduced a useful shift: stop asking one model to be researcher, engineer, critic, and writer in the same pass. Heavy used a central orchestrator that could delegate into research, reasoning, refinement, and synthesis.
That made Heavy feel closer to GPT-4.1 or Claude Sonnet 3.5 on practical engineering work. LH2 is a bigger jump. It behaves like a small engineering review team, landing in the same operating band we expect from GPT-5.3 Codex and Claude Sonnet 4.6-class workflows.
The council is not optional.
Mission director
x-ai/grok-4.3
Owns the objective, keeps the answer aligned with the latest user request, and prevents scope drift.
Strategy architect
moonshotai/kimi-k2.6
Designs the response structure, implementation path, and product-level tradeoffs.
Long-context scout
qwen/qwen3.5-flash-02-23
Recovers details from long conversations, files, tool outputs, and evidence blocks.
Research validator
google/gemini-3.1-flash-lite-preview
Checks freshness, source quality, and whether external context supports the claim.
Tool executor
inclusionai/ling-2.6-flash
Looks for useful tool, API, GitHub, uploaded-file, and code-execution opportunities.
Red-team critic
deepseek/deepseek-v4-flash
Attacks weak assumptions, hidden coupling, unsafe certainty, and final-answer gaps.
Arbitration is where LH2 gets sharper.
Many multi-agent systems ask several models for opinions and flatten the result into a bland summary. LH2 adds an arbitration wave after the first council. The arbiters look for consensus, meaningful disagreement, unsupported claims, and the answer structure that will survive scrutiny.
That is the difference between a swarm and a pile of drafts. Disagreement becomes a resource instead of a formatting problem.
Internal launch eval
These are Leemer internal launch numbers, not third-party benchmark claims. The signal is the shape: LH2 improves most where a council should improve, especially context recovery, bug localization, and catching regressions before the final answer.
| Internal engineering eval | Heavy v1 | LeemerH2 |
|---|---|---|
| Repo task resolution | 41.8% | 68.7% |
| Multi-file refactor pass rate | 44.6% | 73.2% |
| Long-context bug localization | 52.4% | 86.1% |
| Regression caught before final | 28.9% | 64.5% |
| Architecture review usefulness | 6.7 / 10 | 8.9 / 10 |
| Mean first useful plan latency | 71s | 58s |
128K synthesis
LH2 requests a 131072-token final budget for long technical work when the provider honors it.
Engineering native
GitHub context, file reads, code execution, and systems review are part of the default identity.
Still compatible
The chat UI keeps the same token and tool-event stream contract.
The practical takeaway
Use lighter models when you need a quick answer. Use LeemerH2 when the task deserves a team: repo-scale engineering, architecture, research, debugging, migration planning, or any decision where a critic should attack the plan before you act.
Open LeemerChat