Back to journal

Heavy successor · May 7, 2026

LeemerH2 turns Heavy
into a model council.

Leemer Heavy proved that union models work. LeemerH2 makes the team the product: a fixed council of frontier and flash models planning, researching, executing, arguing, verifying, and then streaming one final answer through a 128K synthesis pass. The council is not optional — it runs on every request.

10 min read · Internal engineering eval included

10

First-wave agents

3

Arbiter seats

8

Default tool tasks

131,072

Max output tokens

6

Council model families

−18%

Latency vs Heavy v1

Background

Heavy v1 proved the idea. LH2 changes the default.

The original Leemer Heavy release introduced a useful shift: stop asking one model to be researcher, engineer, critic, and writer in the same pass. Heavy used a central orchestrator that could delegate into research, reasoning, refinement, and synthesis.

That made Heavy feel closer to GPT-4.1 or Claude Sonnet 3.5 on practical engineering work. LH2 is a bigger jump. It behaves like a small engineering review team, landing in the same operating band we expect from GPT-5.3 Codex and Claude Sonnet 4.6-class workflows.

The key design change: the council is not optional. Every LeemerH2 request goes through all six seats plus three arbiters, regardless of size. The latency investment is deliberate. It is the price of work that has been reviewed.

The council

Six seats. Fixed roles. No improvising.

The council is not a swarm of identical models voting on the best answer. Each seat has a specific job, a specific model chosen for that job, and a clear output format the arbiters read. The roles are fixed because improvised delegation produces inconsistent coverage — especially under long-context pressure.

Mission director

x-ai/grok-4.3

Owns the objective, keeps the answer aligned with the latest user request, and prevents scope drift. Acts as the final arbiter of what the mission is actually asking.

Strategy architect

moonshotai/kimi-k2.6

Designs the response structure, implementation path, and product-level tradeoffs. Produces the skeleton the rest of the council builds from.

Long-context scout

qwen/qwen3.5-flash-02-23

Recovers details from long conversations, files, tool outputs, and evidence blocks. Specializes in context that other models tend to drop under pressure.

Research validator

google/gemini-3.1-flash-lite-preview

Checks freshness, source quality, and whether external context actually supports the claim. The only role that can flag a fact as unverified before arbitration.

Tool executor

inclusionai/ling-2.6-flash

Looks for useful tool, API, GitHub, uploaded-file, and code-execution opportunities. Executes side-channel work the main council cannot interrupt for.

Red-team critic

deepseek/deepseek-v4-flash

Attacks weak assumptions, hidden coupling, unsafe certainty, and final-answer gaps. The council's last filter before the arbitration pass.

Arbitration

Disagreement is a resource, not a formatting problem.

Many multi-agent systems ask several models for opinions and flatten the result into a bland summary. LH2 adds an arbitration wave after the first council. The arbiters look for consensus, meaningful disagreement, unsupported claims, and the answer structure that will survive scrutiny.

That is the difference between a swarm and a pile of drafts. Arbitration makes disagreement visible and resolvable — the critic may flag a weak assumption, the scout may have recovered a context chunk that changes the strategy, and the validator may have rejected a source. The arbiters see all of it.

The three arbiters vote on the strongest synthesis path. When they disagree, the director model breaks the tie using the original mission objective. Only then does the synthesis pass begin, with the full 128K token budget.

Unsupported claims flagged

Validator and critic outputs are cross-referenced before arbitration. Unverified claims are downgraded, not silently included.

Context recovered

Long-context scout outputs are merged with the strategy pass before synthesis. Nothing is dropped because the model ran out of room.

Consensus over committee

The three arbiters must converge on a synthesis path. Plurality is not enough — the arbiters iterate until the plan is consistent.

Internal launch eval

The numbers, without the asterisks.

These are Leemer internal launch numbers, not third-party benchmark claims. The signal is the shape: LH2 improves most where a council should improve — context recovery, bug localization, and catching regressions before the final answer lands.

Mean plan latency dropped 18% despite running six models plus three arbiters. The parallelism in the first council wave buys back most of the overhead. The synthesis pass is the only serial bottleneck.

Internal engineering evalHeavy v1LeemerH2Delta
Repo task resolution41.8%68.7%+64%
Multi-file refactor pass rate44.6%73.2%+64%
Long-context bug localization52.4%86.1%+64%
Regression caught before final28.9%64.5%+123%
Architecture review usefulness6.7 / 108.9 / 10+33%
Mean first useful plan latency71s58s−18%

Use cases

Where the council earns its latency.

LeemerH2 is built for requests where being slightly wrong is expensive: migrations, architecture choices, policy decisions, incident retrospectives, and high-context product planning. The council format adds friction in the right places so weak claims are challenged before they reach the final answer.

Repo-scale engineering

  • Codebase-wide refactor planning with dependency analysis
  • Multi-file bug localization across large TypeScript or Rust codebases
  • Architecture proposals with tradeoff matrices and migration paths
  • PR review passes that catch regressions the diff doesn't reveal

Architecture decisions

  • System design reviews with explicit failure mode analysis
  • Database migration planning with rollback and rollforward strategies
  • API contract design across microservice boundaries
  • Performance bottleneck analysis with profiler-informed recommendations

Research synthesis

  • Competitor analysis with claim verification and source scoring
  • Technical literature review with citation-level accuracy checking
  • Market landscape reports with structured evidence tables
  • Research-to-decision briefs for executive or investor audiences

Incident response

  • Post-mortem drafts with root-cause chains and timeline reconstruction
  • Runbook generation from incident logs and affected system context
  • Risk assessment for proposed hotfixes before deployment
  • Cross-system impact analysis when a service changes behavior

Technical notes

Engineering-native by design.

128K synthesis

LH2 requests a 131,072-token final budget for long technical work when the provider honors it. Long output is the point, not a byproduct of the process.

Engineering native

GitHub context, file reads, code execution, and systems review are part of the default council identity. The tool executor seat runs non-blocking.

Stream compatible

The chat UI keeps the same token and tool-event stream contract. The council overhead is invisible to the streaming consumer.

Comparison

Heavy v1 vs. LeemerH2

Knowing which model to reach for is part of using LeemerH2 well. Use lighter models for quick answers, drafts, and low-stakes tasks. LeemerH2 is the right tool when the task deserves a team.

AreaHeavy v1LeemerH2
ArchitectureSingle orchestrator with optional delegatesFixed council of 6 + 3 arbiters by default
Context recoveryOrchestrator-scopedDedicated long-context scout seat
ValidationOptional refinement passMandatory critic + arbiter wave before synthesis
Output lengthStandard model limit131,072-token synthesis budget when supported
Tool usageOrchestrator-drivenParallel tool-executor seat, non-blocking
Best forMost chat, moderate engineering tasksRepo-scale work, architecture, incident analysis

In production

Why this matters, not just in demos.

Users see better decomposition, clearer assumptions, and stronger sequencing across long replies. Instead of one model improvising from memory, the system now recovers context, coordinates specialist passes, and runs synthesis only after disagreement is resolved.

That shift is why LH2 feels more like a technical review partner than a generic chatbot. The output is still fast, but now it is more auditable, more explicit about uncertainty, and better aligned to execution-heavy workflows.

Concretely: a repo migration plan from LeemerH2 includes the rollback path and the hidden coupling the strategy architect found. A market analysis includes the rejected sources the validator flagged. An incident post-mortem includes the alternative root-cause hypotheses the critic raised. That extra layer is not noise — it is what makes the output actionable.

The practical takeaway

Use lighter models when you need a quick answer. Use LeemerH2 when the task deserves a team: repo-scale engineering, architecture, research, debugging, migration planning, or any decision where a critic should attack the plan before you act.

FAQ

Frequently asked questions

What is LeemerH2?

LeemerH2, also called LH2, is LeemerChat's successor to Leemer Heavy. It is a multi-model council that plans, runs safe read-only tools, fans out specialist agents, arbitrates disagreements, verifies with a critic, and streams a final answer.

How is LeemerH2 different from Leemer Heavy v1?

Leemer Heavy v1 was a single orchestrator with optional delegates. LeemerH2 always runs a fixed council of models and adds an arbitration wave before final synthesis, making it stronger for engineering, architecture, research, and long-context work.

Can LeemerH2 produce 128K-token answers?

The LeemerH2 backend requests up to 131072 output tokens for final synthesis. Actual output still depends on provider limits, account limits, context size, and upstream model behavior.

Is LeemerH2 better for software engineering?

LeemerH2 is designed to be much stronger for engineering work because it combines a mission director, strategy architect, long-context scout, tool executor, systems engineer, red-team critic, arbiters, and final synthesis model before answering.

Related Posts

May 7, 2026

Introducing Leemer Analyst: Living Research Agents

Leemer Analyst is a persistent research agent inside its own E2B VM, built for long-running analysis, memory, connectors, verification, and private artifact deployment.

Read more
March 2, 2026

Get Ready for Mission Control: The Next Evolution of Agentic Execution

Mission Control is our next-generation agentic research and execution platform. It represents a fundamental shift in how we interact with AI—moving away from rigid pipelines and chat interfaces, and stepping into the era of autonomous, goal-oriented swarms.

Read more
May 6, 2026

Introducing LeemerStudio: Image and Video Generation Built Into LeemerChat

LeemerStudio is a new creative workspace inside LeemerChat for generating images, animating references, rendering video, tracking live status, and keeping every output in private history.

Read more
April 17, 2026

Introducing LeemerLabs

LeemerLabs is the infrastructure arm of the Leemer Group: Ireland-hosted inference, custom model creation through LeemerFoundry, and the systems powering products like LeemerChat.

Read more
Explore more:All PostsReleasesModelsBenchmarksEngineeringInsightsAll FeaturesAbout UsTermsPrivacy