Home / AI Technologies & Tools / How Can You Harness MediaTek’s On-Device AI?

How Can You Harness MediaTek’s On-Device AI?

Dec 9, 2025

Dustin TrainorTech Innovation Expert

The proliferation of on-device artificial intelligence has fundamentally shifted user expectations, with sophisticated generative AI models no longer confined to the cloud but running directly on smartphones, laptops, and smart home devices. Central to this revolution is the Neural Processing Unit (NPU), a specialized processor designed to deliver tens of Tera Operations Per Second (TOPS) with remarkable power efficiency, making it the critical engine for the next generation of intelligent edge computing. However, this powerful hardware has presented significant challenges for developers. The fragmented landscape of System-on-a-Chip (SoC) variants, each with unique requirements, has created a complex web of compilers and runtimes to manage. Furthermore, existing machine learning infrastructure, primarily built for CPUs and GPUs, often lacks the deep integration necessary to fully exploit the capabilities of specialized NPUs, leading to cumbersome and ad-hoc deployment workflows that hinder broad adoption and innovation.

1. A Unified Solution for NPU Acceleration

To address these complexities, a new solution has emerged: the LiteRT NeuroPilot Accelerator, a ground-up successor to the TFLite NeuroPilot delegate designed to streamline AI deployment on MediaTek NPUs. This accelerator moves beyond basic operator delegation to provide a cohesive and powerful development environment. It offers a unified API that abstracts away the underlying hardware complexities, allowing developers to target a wide range of MediaTek NPUs without needing to manage disparate SDKs. This unified workflow is complemented by a choice of compilation strategies—offline (Ahead-of-Time, or AOT) and online (on-device)—giving developers the flexibility to optimize for either minimal first-run latency or platform-agnostic model distribution. This approach is engineered to democratize access to high-performance NPU hardware, enabling a broader range of applications to leverage sophisticated AI capabilities.

The LiteRT NeuroPilot Accelerator also brings state-of-the-art support for Large Language Models (LLMs) and generative AI, unlocking the full potential of advanced open-weight models like the Gemma family directly on the NPU. This collaboration enables the development and deployment of complex features, from advanced text generation to novel multimodal applications, that can run efficiently on edge devices. For developers creating real-time applications, such as those involving camera and video streams, the accelerator introduces a new, simplified C++ API and Native Hardware Buffer Interoperability. This feature is critical for building high-throughput ML pipelines, as it allows for zero-copy data passing from an AHardwareBuffer to the NPU and provides automatic conversion from OpenGL/OpenCL buffers. By eliminating the need to transfer data through the CPU, this innovation significantly reduces latency and improves overall performance, making real-time, on-device AI a practical reality for millions of users worldwide.

2. A Simplified Three-Step Deployment Process

Deploying models with NPU acceleration has been simplified into a straightforward, three-step workflow that removes the traditional barriers of hardware fragmentation. The first optional but highly recommended step is Ahead-of-Time (AOT) compilation. Using the LiteRT Python library, developers can compile their .tflite models for specific target SoCs before the application is distributed. This offline process is particularly beneficial for large, complex models, as it significantly reduces the on-device initialization time and memory footprint when the user first launches the application. By pre-compiling the model, the computational heavy lifting is done beforehand, ensuring a smoother and faster user experience. While this step is not mandatory for on-device compilation, its advantages in production environments for larger models are substantial, making it a best practice for performance-critical applications.

The second step involves packaging and distributing the AI assets for Android applications. LiteRT facilitates the export of model assets and the necessary runtime libraries into a format known as an “AI Pack.” This pack is then integrated into the Android app project. When a user downloads the application from Google Play, a service for On-device AI (PODAI) analyzes the user’s device hardware. It then automatically delivers the correctly compiled model and runtime tailored for that specific device, ensuring compatibility and optimal performance without developer intervention. The final step is running inference using the LiteRT Runtime. This is where the abstraction of hardware complexity becomes most apparent. Developers simply load the model and specify Accelerator.NPU in the options. LiteRT handles the rest, automatically directing the workload to the NPU. The system also includes a robust fallback mechanism; developers can specify the GPU or CPU as secondary options, and if the NPU is unavailable for any reason, LiteRT will seamlessly switch to the next best accelerator.

3. Choosing the Right Compilation Strategy

The direct, native integration with the NeuroPilot compiler and runtime has unlocked a powerful Ahead-of-Time (AOT) compilation workflow, giving developers unprecedented flexibility in their deployment strategy. The offline AOT compilation method is best suited for scenarios involving large, complex models where the target SoC is known in advance. By compiling the model ahead of time, developers can dramatically reduce the initialization costs and lower the memory usage when the application is launched on a user’s device. This pre-processing step shifts the computational burden from the user’s device to the developer’s environment, resulting in a significantly faster and more responsive initial experience. For a large model like Gemma 3 270M, on-device compilation can take over a minute, which is often an unacceptable delay in a production application. In contrast, AOT compilation eliminates this first-run latency, making it the more practical and professional choice for deploying sophisticated AI features.

On the other hand, the online (on-device) compilation strategy is ideal for platform-agnostic distribution, particularly for smaller models. With this approach, the model is compiled on the user’s device during the application’s initialization phase. This method eliminates the need for developers to pre-compile for every possible hardware configuration, simplifying the distribution process and allowing a single model package to work across a wide range of devices. However, this flexibility comes at the cost of a higher first-run computational expense. While this initial delay may be negligible for smaller models, it becomes a significant factor for larger ones. Therefore, the choice between AOT and on-device compilation depends on a trade-off between deployment simplicity and first-run performance. Developers must consider the model’s size, the target audience’s hardware diversity, and the user experience requirements to select the most appropriate strategy for their application.

4. Unleashing Generative AI Capabilities

For developers looking to integrate deeply customized AI features or operate in markets where certain cloud-based solutions are not available, the platform now unlocks the full potential of open-weight models. This includes Google’s Gemma model family, a collection of lightweight, state-of-the-art open models optimized specifically for on-device use cases. Through a collaboration announced at MediaTek’s recent Dimensity 9500 event, optimized, production-ready support has been introduced for several key models on the latest chipsets. Among them are Qwen3 0.6B, a foundational model powering new AI experiences in Mainland China; Gemma 3 270M, a hyper-efficient base model for high-speed, low-latency tasks like sentiment analysis; Gemma 3 1B, a lightweight text model for on-device reasoning and content creation; and Gemma 3n E2B, a powerful multimodal model that natively understands audio, vision, and text for real-time applications. Also supported is EmbeddingGemma 300M, a text embedding model ideal for Retrieval Augmented Generation (RAG) and semantic search.

Leveraging specialized optimizations that target the MediaTek NPU, these Gemma models achieve acceleration of up to 12 times compared to CPU execution and 10 times compared to GPU execution. This dramatic performance boost enables impressively fast inference speeds, as demonstrated by benchmarks on the latest MediaTek Dimensity 9500. For instance, the Gemma 3n E2B model achieves over 1600 tokens per second for prefill and 28 tokens per second for decode with a 4K context on the NPU. Such speeds are essential for enabling sophisticated and responsive multimodal use cases, such as a real-time on-device assistant that can recognize objects in the camera view, identify plants and provide care tips, or generate a travel itinerary based on a user’s verbal request. This level of performance transforms what is possible on an edge device, moving beyond simple tasks to rich, interactive AI experiences that were previously exclusive to cloud-based systems.

5. Efficient Development and Integration

To facilitate the creation of rich, real-time applications across a variety of platforms and devices, significant improvements have been made to the developer experience and data pipeline efficiency. This begins with a new, simplified C++ API that supersedes the previous C API, making it easier to build high-performance, cross-platform ML applications. This modern API has been designed to work seamlessly with native hardware buffers, introducing support for Native Hardware Buffer Interoperability. This feature enables two critical efficiencies for real-time applications. First, it allows for zero-copy data passing with AHardwareBuffer, which means data can be shared between different hardware components without being copied to and from the main system memory. Second, it provides zero-copy interoperability between OpenGL/OpenCL buffers and AHardwareBuffer, which is crucial for applications that use the GPU for image processing.

Instead of inefficiently converting input and output data through the CPU, developers can now pass camera frames or video streams directly from other ML pipeline components, such as a GPU-based pre-processing stage, to the NPU via LiteRT. For text generation tasks using models like Gemma 3 270M, developers can utilize LiteRT-LM, a high-level library that provides a stateful “text-in, text-out” API, simplifying the entire inference process. For models like EmbeddingGemma, which operate on tensors, the standard LiteRT “tensor-in, tensor-out” API is a perfect fit. This streamlined data flow is essential for building the high-throughput, low-latency camera and video applications that are a key goal of this release. By minimizing data movement and leveraging hardware-level integrations, developers can build more responsive and power-efficient AI-driven features, significantly enhancing the end-user experience on millions of MediaTek-powered devices.

A New Era of Accessible On-Device AI

The introduction of the LiteRT NeuroPilot Accelerator marked a significant step forward in making NPU-accelerated machine learning accessible to developers targeting MediaTek devices. The initiative effectively improved the user experience for a massive global audience by simplifying complex deployment workflows and unlocking advanced generative AI capabilities directly on edge devices. Through a combination of unified APIs, flexible compilation options, and deep hardware integration, the barriers to entry for creating high-performance, real-time AI applications were substantially lowered. This advancement provided developers with the tools and documentation needed to harness the full power of on-device NPUs, fostering a new wave of innovation in mobile and IoT applications. The availability of optimized open-weight models further empowered developers to build customized and sophisticated AI features that were previously impractical to deploy outside the cloud.