Introducing LeemerH2: The Model Council

Heavy v1 proved the idea. LH2 changes the default.

The original Leemer Heavy release introduced a useful shift: stop asking one model to be researcher, engineer, critic, and writer in the same pass. Heavy used a central orchestrator that could delegate into research, reasoning, refinement, and synthesis.

That made Heavy feel closer to GPT-4.1 or Claude Sonnet 3.5 on practical engineering work. LH2 is a bigger jump. It behaves like a small engineering review team, landing in the same operating band we expect from GPT-5.3 Codex and Claude Sonnet 4.6-class workflows.

The council is not optional.

Mission director

x-ai/grok-4.3

Owns the objective, keeps the answer aligned with the latest user request, and prevents scope drift.

Strategy architect

moonshotai/kimi-k2.6

Designs the response structure, implementation path, and product-level tradeoffs.

Long-context scout

qwen/qwen3.5-flash-02-23

Recovers details from long conversations, files, tool outputs, and evidence blocks.

Research validator

google/gemini-3.1-flash-lite-preview

Checks freshness, source quality, and whether external context supports the claim.

Tool executor

inclusionai/ling-2.6-flash

Looks for useful tool, API, GitHub, uploaded-file, and code-execution opportunities.

Red-team critic

deepseek/deepseek-v4-flash

Attacks weak assumptions, hidden coupling, unsafe certainty, and final-answer gaps.

Arbitration is where LH2 gets sharper.

Many multi-agent systems ask several models for opinions and flatten the result into a bland summary. LH2 adds an arbitration wave after the first council. The arbiters look for consensus, meaningful disagreement, unsupported claims, and the answer structure that will survive scrutiny.

That is the difference between a swarm and a pile of drafts. Disagreement becomes a resource instead of a formatting problem.

Internal launch eval

These are Leemer internal launch numbers, not third-party benchmark claims. The signal is the shape: LH2 improves most where a council should improve, especially context recovery, bug localization, and catching regressions before the final answer.

Internal engineering eval	Heavy v1	LeemerH2
Repo task resolution	41.8%	68.7%
Multi-file refactor pass rate	44.6%	73.2%
Long-context bug localization	52.4%	86.1%
Regression caught before final	28.9%	64.5%
Architecture review usefulness	6.7 / 10	8.9 / 10
Mean first useful plan latency	71s	58s

128K synthesis

LH2 requests a 131072-token final budget for long technical work when the provider honors it.

Engineering native

GitHub context, file reads, code execution, and systems review are part of the default identity.

Still compatible

The chat UI keeps the same token and tool-event stream contract.

The practical takeaway

Use lighter models when you need a quick answer. Use LeemerH2 when the task deserves a team: repo-scale engineering, architecture, research, debugging, migration planning, or any decision where a critic should attack the plan before you act.

Open LeemerChat

LeemerH2 turns Heavy into a model council.