AI Weekly Review 2026-05-25

Sam Atkins

25 May 2026 — 13 min read

Week In Review

This was a week defined by two flagship events held on the same day: Google I/O 2026 in Mountain View and Anthropic’s Code with Claude London. Together they advanced a single thesis — that the frontier of AI is no longer about answering questions but about acting reliably on a user’s behalf. Google’s Gemini 3.5 Flash shipped to general availability on launch day and, by Google’s reported benchmarks, surpasses last generation’s Pro model on coding and agentic tasks while running 4–12x faster. The accompanying Antigravity 2.0 developer platform and the consumer-facing Gemini Spark personal agent show the same playbook applied at both ends of the stack.

The infrastructure underneath that vision matured visibly. Anthropic introduced self-hosted sandboxes and MCP tunnels so that Claude-driven agents can execute tools inside customer networks without exposing internal systems to the public internet — a direct answer to enterprises that have held back from agent deployments over data-perimeter concerns. NVIDIA, on the same day, published its Verified Agent Skills framework, treating reusable agent capabilities like signed software packages with provenance metadata and automated risk scanning. OpenAI’s partnership with Dell Technologies to bring Codex on-premises closes a similar loop on the hardware side. The common pattern: agents are becoming first-class enterprise software, with the security plumbing to match.

Science quietly had its biggest AI week of the year. DeepMind’s Co-Scientist graduated from research preview to Nature publication and now anchors the new Gemini for Science suite, alongside hypothesis-tournament tools and parallel computational discovery agents. A separate line of physics research at Penn opened a possible escape from the energy ceiling that constrains all of these workloads: hybrid light-matter particles that perform optical switching at roughly 4 femtojoules per operation — orders of magnitude below current electronic logic. And on the application side, an FDA-cleared chest X-ray AI from Qure.ai reported a 26.7% increase in lung-nodule detection in a study presented at ARRS 2026.

Regulation moved in step rather than against the technology. On May 19 the European Commission published its draft guidelines on high-risk AI classification and opened a public consultation through June 23 — the first concrete operational guidance for the most consequential tier of the EU AI Act. Taken together, the week made the case that the agentic era is not arriving as a single product launch but as a coordinated build-out of frontier models, deployment infrastructure, scientific applications, hardware foundations, and governance scaffolding.

Items

Gemini 3.5 Flash Launches at I/O 2026

Google opened I/O 2026 with the general availability of Gemini 3.5 Flash, the first model in the new 3.5 family and Google’s biggest Flash-tier release to date. According to Google’s reported benchmarks, the model outperforms last generation’s Gemini 3.1 Pro on Terminal-Bench 2.1 (76.2%), GDPval-AA (1656 Elo), and MCP Atlas (83.6%), and leads in multimodal understanding with 84.2% on CharXiv Reasoning. The framing is significant: a tier traditionally optimized for cost and latency has now caught the previous generation’s reasoning-tier flagship.

Speed is the other half of the pitch. Google reports that 3.5 Flash runs roughly 4x faster than comparable frontier models in general use, and up to 12x faster inside its Antigravity agent environment, where the throughput matters because agents make many sequential model calls per task. The economics of agentic workflows depend heavily on this combination: faster, cheaper inference at near-frontier quality changes which problems can be tackled by spawning fleets of subagents rather than asking a single heavyweight model.

Availability is broad on day one. Gemini 3.5 Flash is accessible via Google Antigravity, the Gemini API in AI Studio and Android Studio, the Gemini Enterprise platform, and the consumer Gemini app and AI Mode in Search. Google noted that Gemini 3.5 Pro, the frontier reasoning tier of the new family, is in testing and expected next month. For users and developers, the takeaway this week is that the workhorse tier of a major frontier family now matches the previous flagship — the kind of price-performance step-up that tends to redraw deployment plans across the industry.

Source: Google Blog

Antigravity 2.0 and the Agentic Gemini Era

Sundar Pichai opened I/O by framing 2026 as “the agentic Gemini era,” and Antigravity 2.0 is the developer-facing centerpiece. The platform is now a standalone desktop application acting as a central hub for agent interaction, supporting parallel subagent execution, scheduled tasks that run automation in the background, and ecosystem integrations across AI Studio, Android, and Firebase. Where the first version of Antigravity was an IDE-like experience for working with a single agent, 2.0 is built for orchestrating many agents at once.

The demonstration Google chose for the keynote was deliberately provocative: Antigravity plus Gemini 3.5 Flash building a working operating system in twelve hours using 93 parallel subagents, more than 15,000 model requests, 2.6 billion tokens, and under $1,000 in API credits. Whatever one makes of the artifact, the cost and time figures matter — they describe a regime in which serious software construction is at least plausible as an autonomously planned, massively parallel computation.

Pichai’s broader announcement positioned this alongside Project Astra (a universal AI assistant with persistent memory and real-world perception), Project Mariner (a browser-controlling agent), Jules (an autonomous coding agent that operates on GitHub), and an expanded Agent Mode in the consumer Gemini app. The unifying thesis is that long-horizon, multi-step work — the kind of task humans use a project manager for — is becoming the default expectation rather than the frontier capability. The architecture being built to deliver it (dedicated VMs, durable memory, explicit approval gates for high-risk actions) is starting to look more like an operating system for agents than a chat product.

Source: Google Blog – Sundar Pichai

Gemini Spark: A 24/7 Personal Agent with a Gmail Address

Spark is the consumer-side companion to Antigravity 2.0. Built on Gemini base models and the same agentic harness from Antigravity, it is a 24/7 personal agent that operates on dedicated Google Cloud virtual machines rather than the user’s device, which means it can carry out long-running tasks in the background — drafting reports, monitoring inboxes, scheduling work — without needing a phone or laptop awake.

The most distinctive design choice is that Spark has its own Gmail address. Users can email it directly, much as they would message a human colleague, and it pulls context from Gmail, Google Docs, and the wider Workspace suite without manual setup. Existing Gemini Enterprise connectors — Microsoft SharePoint, OneDrive, ServiceNow, and others — let the agent reach across the tools where work actually happens, rather than confining it to a single-vendor ecosystem.

Spark allows users to set recurring tasks, teach it new skills over time, and approve sensitive actions explicitly. High-risk actions such as sending external emails require user confirmation, and Spark proactively notifies users when something important happens. Rollout is in stages: trusted testers this week, with beta for U.S. Google AI Ultra subscribers expected the following week. The product is the closest mainstream realization yet of the long-promised “agent that works for you while you sleep,” and its rough edges — billing, governance, error recovery — are about to be tested at scale.

Source: TechCrunch

Co-Scientist Graduates to Nature, Anchors Gemini for Science

Co-Scientist multi-agent workflow diagram

DeepMind’s Co-Scientist made the jump from early-stage research project to formally published Nature paper this week, and was simultaneously released to working researchers as part of the new Gemini for Science suite. The system orchestrates specialised agents — Generation, Reflection, Ranking, Evolution, Meta-review, Proximity, and Supervisor — that propose, debate, and refine scientific hypotheses against the literature and structured biological databases.

The architecture is unusually explicit about disagreement. Generated hypotheses are scored by Reflection and Ranking agents, evolved by competing variants, and reviewed by a Meta-review process before being presented to the human researcher. DeepMind has been testing Co-Scientist with external groups on problems including antimicrobial resistance, plant immunity, and liver fibrosis; the Nature paper documents enough validated cases to move it out of the demo category.

Co-Scientist anchors the broader Gemini for Science suite announced at I/O, which bundles Hypothesis Generation (Co-Scientist), Computational Discovery (built on AlphaEvolve and the Empirical Research Assistance toolset), and Literature Insights (built with NotebookLM). Access is rolling out via Google Labs starting this month. The significance is less about any single tool and more about an emerging template: scientific AI as a coordinated stack of specialized agents rather than a monolithic chatbot, with the human researcher kept squarely in the loop as the final arbiter of which hypotheses are worth testing.

Source: Google DeepMind

Anthropic Adds Self-Hosted Sandboxes and MCP Tunnels to Managed Agents

At Code with Claude London on May 19, Anthropic shipped two features that address the most persistent enterprise objection to autonomous agents: that running them means letting Anthropic-controlled processes touch your sensitive data. Self-hosted sandboxes, now in public beta, move tool execution onto infrastructure the customer controls — either directly or through managed providers such as Cloudflare, Daytona, Modal, and Vercel. The agent loop itself (orchestration, context handling, error recovery) still runs on Anthropic’s servers, but files, code, and network egress stay inside the customer’s environment, governed by their own audit logging and network policies.

MCP tunnels, released as a research preview, solve the complementary problem of letting agents reach into private systems without exposing them publicly. Rather than opening inbound firewall rules to allow an agent to query an internal database or ticketing system, organizations deploy a lightweight gateway that establishes an outbound encrypted connection to Anthropic infrastructure. Claude-managed agents and the Messages API can then call those internal MCP servers through the tunnel while keeping the underlying servers off the public internet.

Together the two features draw a cleaner line between the parts of an agent system that belong to the model provider and the parts that belong to the customer. That separation has been the implicit blocker on a long list of enterprise deployments — particularly in healthcare, finance, and regulated industries — where data residency and network exposure questions cannot be answered with “trust the vendor.” With this release, Anthropic positions Claude agents as something that can run inside the customer’s existing security perimeter rather than alongside it.

Source: The New Stack

OpenAI and Dell Partner to Bring Codex On-Premises

At Dell Technologies World on May 18, OpenAI and Dell announced a partnership to bring Codex into hybrid and on-premises enterprise environments. The deal makes Codex available alongside the Dell AI Data Platform — the data infrastructure many large organizations already use to store, organize, and govern enterprise data inside their own facilities — and explores integration with the Dell AI Factory for hardware and orchestration.

The market context is striking. Codex has quietly become one of OpenAI’s fastest-growing enterprise products, with more than 4 million developers using it weekly across code review, test coverage, incident response, and reasoning across large repositories. According to OpenAI, the platform is also expanding beyond software development into workflows such as gathering context across tools, preparing reports, routing product feedback, qualifying leads, and coordinating work across business systems — territory traditionally occupied by horizontal business automation, now reached through coding-agent infrastructure.

The on-prem story matters because a large class of enterprise buyers — banks, hospitals, defense contractors, regulated manufacturers — have data they cannot move to public cloud, even for the AI inference step. Pairing OpenAI’s coding agent with Dell’s data and hardware footprint gives those customers a path to deploy Codex-powered agents next to systems of record without redrawing their compliance architecture. It is the same problem Anthropic addressed this week with self-hosted sandboxes, attacked from the hardware end of the stack.

Source: OpenAI

Penn Physicists Build Light-Matter Particles That Could Cut AI Compute Energy by Orders of Magnitude

Hybrid exciton-polariton optical chip schematic

A team led by Bo Zhen at the University of Pennsylvania has demonstrated optical switching using hybrid light-matter particles called exciton-polaritons, with energy consumption of roughly 4 femtojoules per switching event — far below the energy needed to briefly power even a tiny LED, and many orders of magnitude below the energy cost of equivalent electronic logic. The result, published in Physical Review Letters and detailed by Penn this week, is one of the strongest demonstrations yet that polariton physics can support the kind of strong nonlinear interactions needed for computation, not just signal routing.

The trick is that exciton-polaritons combine light’s speed with matter’s ability to interact. Pure photons can move information at the speed of light but barely interact with each other, which is why optical computing has historically struggled to do logic. Polaritons inherit the matter component’s interactions while preserving most of the photon’s speed, which means an optical pulse can switch another optical pulse without first being converted into electronic charge. Each conversion in a conventional chip costs energy and time; removing them removes both.

If the technique can be scaled — still a substantial “if” — the downstream implications for AI are direct. Training and inference for large models are now hard-bounded by electrical power delivered to GPUs and the heat that must then be removed. A photonic substrate that performs the same operations at femtojoule energies would not just lower the bill; it would change which models are economically trainable. The same approach could enable processing of information directly from cameras without round-tripping through electrical sensors, and the researchers suggest it could support basic quantum computing functions on future chips. As a near-term technology it is years away; as a research milestone it is the kind of result that reshapes long-horizon hardware roadmaps.

Source: Penn Today

European Commission Opens Public Consultation on High-Risk AI Guidelines

On May 19 the European Commission published its draft guidelines on the classification of high-risk AI systems under Article 6 of the EU AI Act, and opened a public consultation through June 23. The publication is one of the most operationally consequential pieces of guidance issued under the Act so far: the high-risk tier carries the bulk of the Act’s compliance obligations, and until now providers and deployers have had limited official guidance on which systems actually fall into it.

The draft guidelines set out the Commission’s interpretation of the relevant legal concepts and, importantly, include worked examples across sectors of AI systems that should and should not be considered high-risk. Stakeholders — AI providers, deployers, public authorities, academia, and civil society — are invited to submit feedback during the consultation window. The publication follows a delay from the Commission’s original timetable, which had targeted guidance for early February.

For the broader industry, the consultation is the first concrete handle on what “high-risk” will mean in practice within the EU’s compliance regime. Decisions about whether a given AI product needs to register with national authorities, undergo conformity assessment, and meet detailed transparency and human-oversight requirements all hinge on this classification. With foundational model providers and downstream deployers both racing toward the Act’s main 2026 compliance milestones, the guidelines and the feedback they elicit will shape how aggressive Europe’s regulatory posture turns out to be in practice.

Source: European Commission – Shaping Europe’s Digital Future

Qure.ai’s FDA-Cleared AI Catches Lung Cancers Missed on Routine X-Rays

Chest X-ray with AI-detected nodule overlay

A retrospective study presented at the American Roentgen Ray Society’s 2026 annual meeting and highlighted this week showed that Qure.ai’s FDA-cleared qXR-LN tool identified lung cancers that had been initially missed on routine chest X-rays. Conducted at University Hospitals Cleveland Medical Center using a resident education database of difficult cases, the study reported a 26.7% increase in nodule detection rates with the AI assist. The detected nodules were tied to cancers spanning all four clinical stages, including a meaningful share — 46.6% — at Stage I, where five-year survival is highest.

The clinical stakes are concrete. Chest X-ray remains the most common imaging exam in medicine, and missed lung cancers on routine X-rays are a known and persistent source of late-stage diagnoses. An AI second-reader that runs on the existing X-ray pipeline does not change the imaging itself; it changes which cases get pulled forward for follow-up CT or biopsy. Catching tumors earlier on the modality patients are already getting, rather than scaling up access to a more expensive screening pathway, is the kind of leverage point health systems have been searching for.

The study is also a marker of how AI imaging is graduating from controlled trials to in-situ clinical deployments. qXR-LN is FDA-cleared, was tested against historical cases at a U.S. academic medical center rather than a synthetic dataset, and reports gains that are clinically interpretable rather than just statistically significant. The pattern this week — frontier model news in San Francisco, real deployments quietly reporting outcomes in Cleveland — is increasingly characteristic of where applied AI now lives.

Source: Qure.ai Press Coverage

NVIDIA Ships Verified Agent Skills with Cryptographic Signing and Risk Scanning

NVIDIA Verified Agent Skills pipeline diagram

On May 19, NVIDIA published a Verified Agent Skills framework that treats reusable agent capabilities — the snippets of code, tools, and instructions that let an agent do a specific job — as signed, cataloged software packages with explicit trust metadata. Each verified skill is paired with a machine-readable “skill card” that documents what the skill does, who built it, what licenses and dependencies apply, and what technical or safety limitations have been identified.

Before reaching the public NVIDIA Skills catalog, each candidate is run through a publication pipeline that includes human review, automated policy checks, and a security scanner called SkillSpector. SkillSpector inspects both conventional software risks — vulnerable dependencies, credential access, suspicious scripts, data exfiltration paths — and agent-specific risks such as prompt injection, hidden instructions, trigger abuse, tool poisoning, and mismatches between a skill’s stated purpose and its actual requested permissions. Approved skills are cryptographically signed across every file and subdirectory, giving downstream developers a verifiable way to check that a skill is authentic and unmodified.

The framing matters as much as the mechanism. NVIDIA is treating agent skills the way the open-source software ecosystem treats packages — with provenance, signatures, scanning, and a published catalog. As agent ecosystems begin to include thousands of community-built skills and tool integrations, the failure mode is identical to the supply-chain attacks that have plagued package registries, but with much more direct access to a model’s reasoning and actions. The Verified Agent Skills framework is one of the first concrete proposals for what an agent-era equivalent of a signed-package ecosystem might look like.

Source: NVIDIA Developer Blog