Home / AI Technologies & Tools / How Does LiteRT Boost On-Device LLMs On MediaTek?

How Does LiteRT Boost On-Device LLMs On MediaTek?

Dec 10, 2025

Daniel MairlyEmerging Tech Advisor

The rapid evolution of large language models has created a significant challenge for the mobile industry, as running these sophisticated AI systems directly on devices like smartphones and laptops without constant reliance on data centers has remained a complex and elusive goal. A groundbreaking collaboration between Google and MediaTek has produced the LiteRT NeuroPilot Accelerator, a pivotal technology designed to make on-device generative AI a practical reality. This innovative stack directly integrates the LiteRT runtime with MediaTek’s NeuroPilot Neural Processing Unit (NPU), providing developers with a single, unified API surface. The result is a streamlined development process that eliminates the need for per-chip custom code, effectively turning MediaTek’s Dimensity NPUs into first-class targets for deploying powerful LLMs and embedding models directly on consumer hardware. This move signals a major shift towards more private, responsive, and efficient AI experiences that operate independently of the cloud.

1. The Core Technology Explained

LiteRT stands as the modern successor to TensorFlow Lite, engineered as a high-performance runtime that operates directly on the device. It is optimized to execute models in the standardized .tflite FlatBuffer format and is capable of targeting a range of processing units—including the CPU, GPU, and now, NPUs—through a sophisticated and unified hardware acceleration layer. This architecture allows developers to build an application once and deploy it across diverse hardware configurations with minimal changes, ensuring that the model can leverage the most efficient processor available for a given task. By providing this abstraction, LiteRT simplifies the complex process of hardware acceleration, allowing developers to focus on model optimization and application logic rather than the intricate details of low-level hardware interaction. The runtime is designed for efficiency and speed, which is crucial for the demanding workloads presented by generative AI models running in resource-constrained mobile environments.

The LiteRT NeuroPilot Accelerator introduces a new, dedicated NPU path specifically for MediaTek hardware, marking a significant advancement over the previous TFLite NeuroPilot delegate. Instead of treating the NPU as a simple delegate, LiteRT now employs a direct integration with the NeuroPilot compiler and runtime. This deeper connection is managed through a comprehensive Compiled Model API that natively understands both Ahead of Time (AOT) and on-device compilation methods. Both compilation strategies are exposed through consistent C++ and Kotlin APIs, granting developers flexibility in their deployment approach. On the hardware front, this integration currently targets a broad spectrum of MediaTek’s System-on-Chips (SoCs), including the Dimensity 7300, 8300, 9000, 9200, 9300, and 9400 series. This wide range of support ensures that the technology is accessible across a large segment of the Android device market, from mid-range smartphones to flagship models, democratizing access to high-performance on-device AI.

2. Streamlining Development for a Fragmented Ecosystem

Historically, the landscape of on-device machine learning has been dominated by CPU and GPU-first development stacks, which often treated specialized NPUs as an afterthought. NPU Software Development Kits (SDKs) were typically shipped as vendor-specific toolchains, forcing developers into separate and often convoluted compilation flows for each target SoC. This fragmentation required the creation of custom delegates and manual runtime packaging, leading to a combinatorial explosion of binaries that were difficult to manage and test. Developers faced the daunting task of debugging device-specific issues across a wide array of hardware, a process that consumed significant time and resources while hindering the widespread adoption of NPU acceleration. This ecosystem complexity created a major barrier to entry for developers who wanted to leverage the power and efficiency of dedicated AI hardware in their applications.

The LiteRT NeuroPilot Accelerator directly addresses these challenges by replacing the fragmented workflow with a standardized, three-step process that remains consistent regardless of the underlying MediaTek NPU. First, a developer converts or loads a .tflite model as usual. Second, they can optionally use the LiteRT Python tools to perform AOT compilation, which generates an AI Pack specifically optimized for one or more target SoCs. Finally, this AI Pack is shipped through Play for On-device AI (PODAI), and the application simply selects Accelerator.NPU at runtime. LiteRT transparently handles all device targeting, runtime loading, and even falls back to the GPU or CPU if the NPU is unavailable. This new paradigm shifts device-targeting logic from messy application code into a structured configuration file and the streamlined Play delivery system. For LLMs, AOT compilation is the recommended approach, as on-device compilation of a model like Gemma-3-270M can exceed one minute, making AOT the only realistic choice for a smooth user experience in production.

3. Supported Models and On-Device Performance

A key aspect of the LiteRT NeuroPilot stack is its focus on open-weight models rather than being locked into a single proprietary ecosystem. This approach provides developers with greater flexibility and access to a diverse range of state-of-the-art models. Google and MediaTek have announced explicit, production-oriented support for several popular models, each tailored for specific use cases. These include Qwen3 0.6B, a model designed for text generation in markets such as mainland China, and Gemma-3-270M, a compact base model that is ideal for fine-tuning on tasks like sentiment analysis and entity extraction. Also supported are Gemma-3-1B, a multilingual text-only model suited for summarization and general reasoning tasks, and Gemma-3n E2B, a powerful multimodal model capable of handling text, audio, and vision for applications like real-time translation and visual question answering. Finally, EmbeddingGemma 300M is included as a text embedding model designed for retrieval-augmented generation (RAG), semantic search, and classification workloads.

The performance gains achieved by leveraging the NPU are substantial. On a device equipped with the latest Dimensity 9500 SoC, such as a Vivo X300 Pro, the Gemma-3n-E2B model demonstrates impressive speed, reaching over 1600 tokens per second in the prefill stage and 28 tokens per second during decode with a 4K context length. These figures represent a massive leap in on-device AI capabilities, with measured throughput for LLM workloads being up to 12 times faster than running on the CPU and 10 times faster than on the GPU. To manage these different workloads, the software stack provides tailored solutions. For text generation use cases, LiteRT-LM sits atop LiteRT, exposing a stateful engine with a simple text-in, text-out API. For embedding tasks, models like EmbeddingGemma utilize the lower-level LiteRT CompiledModel API in a tensor-in, tensor-out configuration, with the NPU being selected through the same straightforward hardware accelerator options, ensuring both power and ease of use.

4. Enhancing the Developer Experience

LiteRT introduces a modernized C++ API that moves away from older C entry points, providing a more intuitive and object-oriented development experience. This new API is designed around explicit objects such as Environment, Model, CompiledModel, and TensorBuffer, which makes the code more readable, manageable, and less prone to errors. For developers working with MediaTek NPUs, this API integrates tightly with Android’s native graphics and hardware buffer systems, including AHardwareBuffer and standard GPU buffers. A critical feature of this integration is the ability to construct input TensorBuffer instances directly from OpenGL or OpenCL buffers using functions like TensorBuffer::CreateFromGlBuffer. This capability allows image processing code to feed inputs to the NPU without an intermediate and costly copy through CPU memory, which is a common performance bottleneck in traditional pipelines.

This tight integration enables zero-copy buffer management, a feature that is especially important for real-time applications involving camera and video processing. In such scenarios, copying each frame of data through CPU memory can quickly saturate memory bandwidth, leading to dropped frames and a poor user experience. By allowing direct data transfer between the GPU and NPU, LiteRT eliminates this bottleneck, preserving bandwidth for other critical system operations. A typical high-level C++ workflow on the device involves loading a model, creating options to specify the NPU as the hardware accelerator, generating a compiled model instance, and then allocating input and output buffers to run inference. Significantly, this same Compiled Model API is used whether the target is the CPU, GPU, or the MediaTek NPU. This consistency drastically reduces the amount of conditional logic required in application code, allowing developers to maintain a single, clean code path for multiple hardware targets.

5. A Unified Path Forward

The introduction of the LiteRT NeuroPilot Accelerator represented a significant turning point for the on-device AI landscape. This collaboration effectively solved the long-standing challenge of hardware fragmentation that had previously hindered the widespread adoption of NPU acceleration. The unified Compiled Model API, coupled with robust support for both AOT and on-device compilation, provided developers with a standardized and highly efficient workflow. This enabled them to deploy sophisticated LLMs at scale without writing custom, device-specific code for each SoC, a task that was once a major development bottleneck. This integration did more than just boost raw performance; it fundamentally altered the development paradigm by abstracting away the underlying hardware complexity. Developers could now focus on building innovative AI features, confident that the runtime would handle the intricate details of hardware optimization. This strategic initiative by Google and MediaTek established a new precedent for how deep hardware and software co-design could unlock the full potential of edge AI, ultimately making it possible to deliver powerful, private, and highly responsive AI experiences directly into the hands of consumers worldwide.