Archived issue · 05-04-2026

View latest issue

← fruition.net

verified 8w ago

The Frontier · Issue 05-04-2026

The frontier widens — humanoids ship, agents stop pretending

This week the goalposts moved on two fronts. Frontier model gains went incremental — the headline drops are agent capabilities and tool-use reliability, not raw IQ. Meanwhile robotics quietly crossed a line: more than one humanoid program is now in actual paid production, not pilot decks. We're watching: whether agent-shaped products actually retain users at week 4, what real on-prem deployment costs look like at the new model tier, and how much of the EU AI Act enforcement starts showing up as concrete fines instead of guidance letters.

Published: Monday, May 4, 2026
Entries: 11
Cadence: Weekly · Sundays
Curator: Brad Anderson

Wire

arxiv.org New paper on tool-use generalization across model families ·

huggingface.co Trending: open-weights vision-language model passes 70% on MMMU ·

anthropic.com MCP server registry surpasses 1,200 published servers ·

deepmind.google Gemini Robotics paper updates with new manipulation benchmarks ·

figure.ai Figure publishes monthly humanoid uptime telemetry ·

arxiv.org Mech-interp finding: refusal vector universal across families ·

whitehouse.gov New EO draft on federal agency AI procurement circulating ·

eu.europa.eu AI Act guidance v3 published — focus on systemic-risk thresholds ·

arxiv.org New paper on tool-use generalization across model families ·

huggingface.co Trending: open-weights vision-language model passes 70% on MMMU ·

anthropic.com MCP server registry surpasses 1,200 published servers ·

deepmind.google Gemini Robotics paper updates with new manipulation benchmarks ·

figure.ai Figure publishes monthly humanoid uptime telemetry ·

arxiv.org Mech-interp finding: refusal vector universal across families ·

whitehouse.gov New EO draft on federal agency AI procurement circulating ·

eu.europa.eu AI Act guidance v3 published — focus on systemic-risk thresholds ·

01

Frontier Models

releases · benchmarks · weights

3 entries

anthropic.com 1mo

▲ headline

Claude Opus 4.7 ships with extended 1M-token context as default

Anthropic's flagship moves to Opus 4.7. The headline isn't a benchmark jump — it's that the long-context tier is no longer behind a beta flag, and tool-use reliability on long-running agent tasks is materially better than 4.6.

Fruition take

The interesting line in the release notes isn't capabilities — it's pricing. The cost per cached token at the 1M tier is what makes long-running Claude Code sessions actually viable at scale. We're recalculating our internal agent budgets accordingly.

deepmind.google 2mo

Gemini 3 reclaims top spot on multimodal benchmarks

Google's Gemini 3 release leads MMMU and Video-MME by meaningful margins. Live screen-share and real-time video reasoning are the headline use cases.

deepseek.com 2mo

DeepSeek R3 open-weights drop closes the gap on reasoning benchmarks

DeepSeek released R3 weights under a permissive license. Independent reproductions on AIME and GPQA show it within a few points of the leading proprietary reasoners at roughly 1/8th the inference cost when self-hosted.

Fruition take

For internal-only enterprise use cases — code review bots, document QA, classification pipelines — the open-weights tier is now genuinely competitive. The question is no longer 'can we run it' but 'do we have the GPU capacity and the ops maturity to bother.'

02

Agents & Tooling

protocols · SDKs · runtime

2 entries

modelcontextprotocol.io 2mo

▲ headline

MCP 2.0 spec finalized — streaming, capabilities negotiation, sandboxing

The Model Context Protocol 2.0 spec is final. Notable: streaming responses are now first-class, capability negotiation replaces ad-hoc feature flags, and there's an actual sandboxing model for tool execution.

Fruition take

MCP is quietly winning the 'how do agents talk to tools' fight. Every serious vendor has shipped a server in the last six weeks. This is the protocol layer of the agent stack — bet accordingly.

Fruition 1mo

The week-4 agent retention cliff is the real benchmark

Most published agent demos are evaluated on a single task or a single session. The number we actually care about — and the one nobody publishes — is what percentage of users still rely on the agent in week four after the novelty curve flattens. Until that number is in releases, treat capability claims as ceiling values, not floor values.

Fruition take

When we evaluate agent products for clients, week-4 sustained usage is the only metric we trust. Demo-day numbers and day-one engagement consistently overstate the steady-state value by 3-5x in our experience.

— Brad Anderson

03

Robotics & Embodied

humanoids · manipulation · field deployments

2 entries

▲ headline

Figure 03 enters revenue production at second BMW plant

Figure announced the 03 platform is now in revenue-generating production at a second BMW facility, performing chassis assembly subtasks alongside human workers. Reported uptime: 89% over a rolling 30-day window.

Fruition take

Two plants is the inflection from 'pilot' to 'scaling.' The 89% uptime number is the one to watch — humanoid programs that get stuck in the 60s rarely make it to a third site.

1x.tech 2mo

1X Neo home trial expands to 500 households with teleop disclosure

1X expanded its Neo home trial to 500 households. Notable shift from prior phases: the company now publicly discloses what percentage of tasks are teleoperated vs. autonomous in monthly reports.

04

Research

papers · interp · alignment · scaling

1 entry

Mech-interp paper isolates 'refusal direction' across model families

A paper from the Anthropic interpretability team demonstrates that a single linear direction in the residual stream mediates refusal behavior across multiple frontier models. Ablating it produces consistent jailbreak effects, replicating across Claude, Llama, and an open-weights Mistral checkpoint.

Fruition take

If a single direction controls refusals, the alignment story is more fragile than vendor messaging suggests. Worth pairing this with the Apollo Research paper from March on deceptive alignment.

05

Policy & Governance

enforcement · frameworks · safety

1 entry

digital-strategy.ec.europa.eu 2mo

▲ headline

First EU AI Act enforcement fines hit GPAI providers

The European Commission issued the first concrete fines under the AI Act's general-purpose AI provisions, targeting two providers for incomplete model documentation and missing systemic-risk assessments. Combined penalties reportedly exceed €40M.

Fruition take

Fines against the model providers themselves were always going to land before fines against deployers. If you're building on top of GPAI APIs, your immediate risk is contract clauses being rewritten to push compliance burden downstream — read your renewals carefully this quarter.

06

Field Deployments

what actually shipped in production

2 entries

Fruition 1mo

What we're seeing on actual cost-per-resolved-ticket

Across three Fruition managed deployments this quarter, the median cost-per-resolved-ticket including all infra, model spend, and engineering ops is landing between $0.18 and $0.34. That's well below the $1-2 range vendor case studies cite, but only because we're aggressive about routing easy tickets to small models. The full-fat reasoning model on every request is still the wrong default.

— Brad Anderson

Klarna publishes 18-month retrospective on customer service AI

Klarna published a detailed retrospective on its 18-month deployment of LLM-driven customer service. Headline: handle time down 40%, but resolution-quality scores recovered slowly after an initial dip and required a redesigned escalation path.

Fruition take

The candor in this retro is rare and valuable. The 'initial dip' chart should be required reading for any executive sold on instant CSAT gains from chatbot deployments — the J-curve is real, and most pilots get killed during the dip.