Why Grocery AI Fails the ‘Brownie Test’—And How Instacart Is Fixing It

Large language models struggle with real-world constraints like perishable items and regional stock. Instacart’s CTO explains how the company splits tasks into micro-models to keep responses fast—and accurate—while avoiding a single ‘monolithic’ AI system.

Imagine asking an AI to help you bake brownies. The system understands the basic request—but what if organic eggs are sold out in your area? What if the nearest store with dark chocolate chips is too far for fresh delivery? And how does it know whether your 8-year-old will even like the result?

This isn’t a hypothetical. It’s the ‘brownie recipe problem,’ a real-world test for AI in grocery delivery. The challenge reveals why today’s large language models (LLMs) still stumble when faced with real-time constraints: they lack the granular, dynamic context needed to bridge intent and execution.

For Instacart, where every second counts, the stakes are higher. A response delay of 15 seconds would abandon users mid-order. The solution? A modular AI architecture that breaks down reasoning into specialized micro-models—each handling a slice of the puzzle—while integrating with external protocols to manage everything from perishable items to third-party system quirks.

Beyond ‘I Want Brownies’

The gap between a user’s vague request and a deliverable outcome isn’t just about vocabulary. It’s about layering real-world data—like regional stock levels, shelf life, and dietary preferences—into AI decision-making. Instacart’s approach starts with a foundational LLM to parse intent (e.g., ‘healthy snacks for kids’). But the heavy lifting happens downstream.

Small language models (SLMs) take over for catalog context: matching products, suggesting substitutions when items are unavailable, and even inferring what ‘healthy’ means for an 8-year-old. Meanwhile, another layer handles logistics—like calculating whether ice cream can survive a 30-minute delivery in summer heat.

�The problem isn’t just reasoning,’ says the company’s CTO. ‘It’s reasoning plus state—plus* personalization.’ Loading all that into one model would create a computational beast. Instead, Instacart’s system distributes tasks: intent → categorization → substitution → delivery feasibility. Each step is optimized for speed, with SLMs fine-tuned to handle niche domains (e.g., ‘what replaces gluten-free pasta if it’s out of stock?’).

Avoiding the ‘Monolith Trap’

Many AI systems treat complexity as a single problem to solve. Instacart took a different path, inspired by Unix’s modular design: smaller, focused agents for payment processing, inventory checks, and third-party integrations. Why? Because real-world systems don’t behave uniformly. A point-of-sale API might update hourly, while a merchant catalog refreshes daily. Agents specialize in these quirks.

But even modularity has trade-offs. Instacart adopted OpenAI’s Model Context Protocol (MCP) and Google’s Universal Commerce Protocol (UCP) to standardize interactions with external tools. The catch? Failure modes. ‘Two-thirds of our time is spent fixing errors,’ the CTO notes. A POS system might time out. A merchant feed could return stale data. Latency spikes when agents misinterpret service capabilities.

Discovery isn’t the only hurdle. Agents must also learn which tools are appropriate for which tasks—a challenge when APIs behave unpredictably. The result? A hybrid system where micro-models handle the heavy lifting, but human oversight remains critical for edge cases.

For Instacart, the brownie test isn’t just about baking. It’s about proving AI can adapt to the messy reality of grocery delivery—where every second and every substitution matters.