Laurent Giraid is a seasoned technologist whose work sits at the intersection of high-performance machine learning and the evolving ethics of enterprise AI. With a career dedicated to deconstructing how natural language processing integrates with hardware, he has become a leading voice for architects trying to bridge the gap between cloud-scale power and local privacy. In this conversation, we explore the seismic shift introduced by Apple’s third-generation foundation models, specifically focusing on how new memory architectures are dismantling the “DRAM wall” that has long stifled on-device intelligence. We delve into the mechanics of prompt-time expert routing, the strategic partnership with Google Cloud for agentic workloads, and the lingering transparency concerns that keep enterprise compliance officers up at night.
Traditional on-device AI has long been hamstrung by the physical limits of RAM, but we are seeing a shift where 20-billion-parameter models can now run locally. How exactly is the AFM 3 architecture rewriting the rules of what consumer hardware can handle?
The shift we are seeing with the AFM 3 family, particularly the Core Advanced model, is nothing short of an architectural rebellion against the “memory wall.” For years, we were stuck in a loop where the entire weight set of a model had to reside in DRAM, which effectively capped on-device models at a very small fraction of what server-side GPUs could handle. By moving the weight set off DRAM and storing the full 20-billion-parameter model in NAND flash, Apple has effectively turned the device’s storage into an active participant in inference. This is a massive leap from the days when enterprise architects had to settle for “lite” models that felt more like toys than tools. Now, the DRAM acts merely as a working buffer, while the Instruction-Following Pruning (IFP) mechanism allows the system to pull only what is necessary from the flash memory, fundamentally changing the performance ceiling for local agents.
You mentioned that this architecture is quite “exotic” compared to standard Mixture of Experts models. Could you walk us through why routing experts once per prompt, rather than per token, is such a critical pivot for on-device performance?
If you look at a conventional Mixture of Experts (MoE) model, the router is working overtime, selecting different experts for every single token generated. In a server environment with massive memory bandwidth, that is fine, but on consumer silicon, the NAND-to-DRAM bandwidth simply cannot keep up with that level of data swapping. To solve this, Apple’s researchers implemented a system where the routing decision happens just once, right at the start of the query. A small, specialized model predicts which experts from the 20-billion-parameter pool in NAND are needed for the specific prompt and loads them into the active memory alongside the shared experts. This “one-and-done” approach means you aren’t constantly fighting the clock to move weights back and forth during generation, allowing for a fluid user experience that would otherwise be bogged down by hardware latency.
There is a fascinating elasticity in how these models operate, with active parameters scaling between 1 billion and 4 billion. What does this variability look like in practice, and what are the trade-offs involved in this sparse activation?
The genius of AFM 3 Core Advanced lies in its ability to read the room; it doesn’t fire on all cylinders for a simple task like setting a timer or summarizing a short text. The system intelligently scales its active parameter count from 1 billion for those lightweight operations up to 4 billion for high-complexity reasoning, all drawn from that massive 20-billion-parameter reservoir in flash memory. This sparse activation is designed to conserve the very resources that the current documentation remains a bit quiet about, such as thermal headroom and battery longevity. However, the trade-off is the “hidden” cost of that initial load time from NAND; while it is much faster than traditional methods, there is still a physical reality to moving gigabytes of data. This is why the summer technical report is so highly anticipated, as we need to see the actual benchmarks on how this scaling impacts the snappiness of the device under heavy workloads.
For enterprise architects, the boundary between local and cloud processing is becoming a high-stakes decision. How should organizations navigate the dependency on Google Cloud and the lack of transparency regarding when a task is offloaded?
This is where the excitement of the technology meets the cold reality of corporate governance and compliance. While the AFM 3 Cloud Pro offers incredible agentic tool use and complex reasoning by running on Nvidia GPUs within Google Cloud, it creates a dependency that many “local-first” advocates might find jarring. The Private Cloud Compute boundary is designed to ensure data privacy, but as Marco Abis has pointed out, there is a notable gap in the documentation regarding the “trigger” for offloading. For a regulated industry, not knowing whether a specific inference ran on-device or was transparently routed to a server is a direct compliance problem that can’t be ignored. Architects need to be very careful here; they are essentially choosing between a 20-billion-parameter local model that is still proving its production viability and a cloud-tier agent that, while powerful, lives behind a curtain that Apple hasn’t fully pulled back yet.
What is your forecast for the future of hybrid AI ecosystems now that the DRAM constraint has been significantly weakened?
I expect we are entering an era where the “device” is no longer a standalone processor but a dynamic gateway that manages a spectrum of intelligence. Within the next year, the success of the AFM 3 architecture will likely trigger a race among hardware manufacturers to optimize NAND-to-DRAM throughput specifically for these “load-on-demand” model structures. We will see the 20-billion-parameter mark become the new baseline for “high-end” local AI, but the real innovation will be in the transparency tools—developers will demand granular control over that cloud-routing boundary to satisfy legal and ethical requirements. Ultimately, the “DRAM wall” is crumbling, and as it falls, the distinction between what your phone knows and what the cloud knows will become almost entirely invisible to the end-user, though it will remain the primary obsession for the engineers building behind the scenes.
