Home / AI Technologies & Tools / On-Device Language Models Boosted by New Hardware Solution

On-Device Language Models Boosted by New Hardware Solution

Jul 22, 2025 Interview

Dustin TrainorTech Innovation Expert

Laurent Giraid, an expert in artificial intelligence with a keen focus on machine learning and the ethical aspects of AI, joins us today to shed light on breakthrough developments in hardware acceleration for transformer models. His insights will delve into a new hardware solution designed to make on-device execution of large language models both feasible and efficient. This interview aims to explore the exciting innovations and technical details behind this advancement.

Can you start by explaining what the Scalable Transformer Accelerator Unit (STAU) is and what problem it aims to solve?

The Scalable Transformer Accelerator Unit, or STAU, is an innovative hardware solution designed to tackle the challenges posed by the size and complexity of large language models like BERT and GPT. Traditionally, these models demand extensive computational resources and often require cloud-based infrastructure. The STAU overcomes these limitations by executing these models directly on embedded systems, enabling real-time, on-device AI without the need for powerful servers.

How does the Variable Systolic Array (VSA) architecture work, and why is it particularly suited for handling transformer models?

The Variable Systolic Array architecture is at the core of the STAU, streamlining matrix operations vital to transformer models. It dynamically adapts to varying input sizes and model structures by feeding input data in rows and loading model weights in parallel. This approach minimizes memory stalls and enhances throughput, crucial for efficiently handling the diverse sequence lengths and complex token structures in transformer tasks.

What are some of the key differences between the processing with and without hardware accelerators?

Without hardware accelerators, processing large language models is time-consuming and reliant on high-power servers, often resulting in delays and inefficiencies. With the STAU, these models can be processed much faster, directly on the device, with reduced computation time and improved efficiency. This not only speeds up operations but also lowers the reliance on external computational resources.

You mentioned a 3.45× speedup with the STAU over CPU-only execution. What factors contribute to this increased speed?

Several factors contribute to the STAU’s impressive speedup. Firstly, the VSA architecture optimizes the way matrix operations are handled, while parallel data processing reduces bottlenecks. Additionally, the tailored hardware innovations, such as the re-imagined softmax function and custom floating-point formatting, significantly streamline the computation process, enhancing overall speed.

How does the STAU maintain over 97% numerical accuracy while providing such performance improvements?

Maintaining high numerical accuracy was a central focus in STAU’s design. This was achieved through precision-aware optimization techniques that ensure transformations and computations occur with minimal error. The custom 16-bit floating-point format was specifically crafted for transformer workloads, balancing efficient computations with precision to consistently deliver high accuracy.

Can you elaborate on the recent optimizations that led to achieving a speedup of up to 5.18×?

The recent optimizations build on the foundational VSA architecture, with further refinements targeted at data handling and processing efficiencies. Advances in algorithmic implementations, particularly in the handling of longer sequences and optimization of parallel processing capabilities, have been pivotal in pushing the performance envelope to achieve an extraordinary 5.18× speedup.

The softmax function is known to be a bottleneck. How did you redesign it using a Radix-2 approach, and what benefits does this offer?

The softmax function traditionally involves complex exponentiation and normalization tasks. By adopting a Radix-2 approach, we simplified these operations through lightweight shift-and-add calculations, drastically reducing hardware complexity. This redesign not only alleviates the bottleneck but also maintains the quality of output while optimizing resource use.

Why did you choose a custom 16-bit floating-point format, and how does it eliminate the need for layer normalization?

The custom 16-bit floating-point format was selected to precisely cater to the needs of transformer models, allowing for efficient computation without the compromise of accuracy. This reduction in bit complexity diminishes computational load and removes the necessity for layer normalization, which is typically resource-intensive, further simplifying the process pipeline.

Could you explain how implementing the STAU on a Xilinx FPGA with an embedded Arm Cortex-R5 processor benefits developers?

Deploying STAU on a Xilinx FPGA facilitated an adaptable platform where developers can leverage the full potential of hardware-software co-design. This hybrid architecture allows them to support a wide array of transformer models with simple software updates, avoiding the need for incessant hardware changes. The integration with an embedded Arm Cortex-R5 processor ensures optimal control over operations, granting developers flexibility and ease in implementation and scaling.

What types of transformer models can the STAU support, and how are software updates managed in this system?

The STAU is designed to support a diverse range of transformer models prevalent in large language models. Thanks to its flexible architecture, software updates can be efficiently managed on the embedded processor. By updating the software, developers can synchronize with evolving AI models and optimizations, ensuring ongoing compatibility and performance without altering the hardware.

How do you see the STAU making advanced language models more accessible for mobile devices and other platforms?

The STAU’s architecture is a leap forward in democratizing AI capabilities across a range of devices. By enabling advanced language models to run directly on mobile devices, wearables, and edge platforms, it provides real-time AI capabilities while maintaining low latency and privacy. This makes sophisticated AI, previously accessible only via cloud, readily available and usable at the edge.

In what ways does the STAU ensure privacy and low-latency response in real-time AI applications?

The STAU enhances privacy by processing data locally on-device, minimizing data transfer between endpoints and servers. Real-time AI applications benefit from its low-latency responses because the computations occur on-site without the delays associated with network lags. This architecture is tailored to sustain privacy requisites while meeting the demand for quick, real-time contextual responses.

Could you share some insights on the potential impact of the STAU for voice assistant applications and other real-time applications?

The STAU has the potential to significantly elevate the performance and capabilities of voice assistant applications. With its ability to process language models efficiently on-device, users can experience faster response times and more accurate interactions. Its low-latency design also benefits numerous real-time applications, encouraging developers to create more seamless and responsive AI solutions that cater to a broad array of industries.

Looking ahead, what further optimizations or developments do you envision for the STAU?

As the landscape for AI and embedded systems evolves, I foresee iterative enhancements in the STAU’s processing efficiency and compatibility with ever-expanding model architectures. Future developments might include augmented support for even more complex neural architectures and further reductions in power consumption, ensuring that these powerful capabilities remain accessible across next-generation computing platforms. Continuing research will certainly bring new breakthroughs that refine the STAU’s effectiveness and adaptability.

On-Device Language Models Boosted by New Hardware Solution

Related Publications

Subscribe to our weekly news digest.