ONNX and NPU Acceleration for Speech on ARM
Introduction
Automatic Speech Recognition (ASR) is a technique that enables machines to process and convert human spoken language to written text, and it is widely used in many areas today. However, low inference speed is one of the major challenges in automatic speech recognition. This project aims to investigate the different ways of accelerating inference for the Whisper model, leveraging Open Neural Network Exchange (ONNX) format, with different Runtimes and specialised hardware accelerators like NPUs. This project is a part of the IXN (Industrial Exchange Network) program, which is a teaching framework offering master and undergraduate students opportunities to work with industry companies. This Project is supported by UCL CS, Microsoft, and Intel, supervised by Professor Dean Mohamedally from the UCL CS department, Mr. Lee Jonathan Stott, and Mr. Chris Noring from Microsoft. intel provides technical support with an Intel PC equipped with an AI accelerator NPU. This project lasts 3 months, from June 2024 to September 2024.
Project Overview
This project explores the benefits of ONNX and NPU accelerators in accelerating the inference of Whisper models and developing a local Whisper model leveraging these techniques for ARM-based systems. Objectives include investigating the end-to-end speech model – Whisper, and its variants; evaluating the performance of different approaches to using Whisper model; finding different ways of converting Whisper in PyTorch to the ONNX format; applying various optimisations techniques to the ONNX model; comparing the performance of PyTorch models and ONNX models; utilising NPU for inference and comparing its performance with other hardware devices; compiling ONNX model for ARM, testing its compatibility on ARM-based system to offer the model a more practical use in IoT or embedded devices such as phones or chatbots
Technical Details
The Whisper model was initially built and evaluated in PyTorch. To improve inference speed, the model was converted into ONNX format using tools like PyTorch’s torch.onnx.export, the Optimum library, and Microsoft’s Olive tool. These models were then optimized using graph optimizations and quantization techniques to reduce the model size and speed up inference.
To further boost performance, the team explored various inference engines such as ONNX Runtime and OpenVINO, which significantly accelerated the model’s execution. The project also examined the use of NPUs, which, although not delivering the expected performance boost, still offered insights into future improvements.
Visuals and Media
The project included various visualizations and performance charts comparing the inference speeds of the Whisper models with different levels of optimisation across hardware accelerators (CPU, GPU, NPU) and inference engines (ONNX Runtime, OpenVINO). These visual aids were key to understanding the improvements and trade-offs in model performance across platforms.
Figure 1: Accuracy and speed of different Whisper checkpoints
Figure 2: Accuracy and speed of different invocation methods for Whisper-tiny
Figure 3: Average Inference Time per Sample per Second Across Acceleration Methods and
Datasets
Figure 4 : Inference time using CPU, GPU, and NPU on Song Clip dataset
Results and Outcomes
Figure1
As the model size increases from tiny to large, the inference time increases gradually with values of 0.98s, 1.37s, 5.15s, 17.12s, and 29.09s respectively while WER decreases with values of 6.78%, 4.39%, 3.49%, 2.43% and 2.22% in the same order.
Figure2
Among the four methods, the average inference time per sample for the Hugging Face API endpoint was the longest which was 1.82 seconds, while the other three took less than 1 second for inference: 0.91,0.78,0.97 seconds for using Pipeline, OpenAI Whisper library, and Hugging Face transformers library respectively. For accuracy, except the Hugging Face Transformers library which achieved the lowest word error rate of 6.78%, the other three methods produced slightly higher error rates, all around 7%.
Figure3
The project achieved a maximum 5x increase in inference speed for the Whisper model when using the ONNX format optimized with OpenVINO, compared to its PyTorch counterpart. The project demonstrated the feasibility of deploying these optimized models on ARM-based systems, opening doors to their use in IoT devices and real-time speech applications on edge devices.
Figure 4
The plot demonstrates the inference time of Whisper-tiny with different running device config-
urations, tested on the Song Clip dataset. CPU, GPU, and specialised hardware accelerator NPU
were used in this part of the study. The results were sorted in descending order from top to bottom, from the device configuration with the longest inference time to the device configuration with the shortest inference time. The inference time ranges from 1.371s to 0.182s, the configuration that has the fastest inference speed is CPU and the slowest one has setting MULTI: GPU, NPU. From the results, CPU outperforms NPU, and NPU outperforms GPU. Using MULTI or HETERO mode will have faster speed and less inference time than using GPU or NPU alone for certain configurations. For example, HETERO: CPU, NPU, HETERO: CPU, GPU and MULTI: GPU, NPU, CPU have less inference time than NPU and GPU. For the other three cases, results are similar, either CPU or MULTI: GPU, NPU, CPU has the shortest inference time while MULTI: GPU, NPU has the longest.
Lessons Learned
Several critical lessons emerged from this project:
ONNX format is a game changer: Converting models to ONNX significantly improves cross-platform compatibility and inference speed, making it easier to deploy on various devices.
NPUs need further exploration: While promising, NPUs didn’t deliver as much performance improvement as expected. However, this points toward the need for more refined optimization in future work.
Quantization requires careful management: Reducing the precision of model data helped speed up performance, but it also introduced a trade-off in accuracy. Future iterations will need to balance this more effectively.
Collaboration and Teamwork
This project’s success was a testament to the power of collaboration. The combination of academic expertise from UCL and industry knowledge from Microsoft ensured that the project stayed on track and met its goals. Team roles were clearly defined, which allowed for a smooth and productive workflow.
Future Development
The project has laid a solid foundation for further research and development. Several areas are ripe for future exploration:
NPU-specific optimisation: The team plans to explore more advanced methods to harness the full potential of NPUs.
Deployment on real ARM devices: While the models ran successfully in a Docker container, the next step is to deploy them on actual ARM-based devices, such as smartphones or IoT gadgets.
Multilingual testing: Thus far, the project has focused primarily on English speech. Expanding the model to other languages will open up new possibilities for global applications.
Conclusion
This project successfully demonstrated how cutting-edge tools and optimizations can significantly improve the performance of ASR systems. By leveraging ONNX, NPUs, and ARM-based platforms. The project’s outcomes highlight the importance of optimizing models for resource-constrained environments and provide a clear path forward for deploying advanced ASR models in the field.
Microsoft Tech Community – Latest Blogs –Read More