embedded world NA 2025

Sessions Speakers Exhibitors Floor Plan

Extreme Quantization Techniques for Ultra-efficient Embedded AI Inference (Room 303B)

04 Nov 25

3:50 PM - 4:05 PM

Tracks: Embedded AI: AI Efficiency

Speaker(s): Daniel Chang

ENERZAi is an Edge AI startup that recently developed a Spoken Language Understanding pipeline using 1.58-bit quantization via Quantization-Aware Training (QAT). This paper walks through the key challenges we encountered and how we addressed them, offering practical insights for those pursuing sub-4-bit model deployment.

First, we show why Post-Training Quantization (PTQ) fails for extreme low-bit quantization and why QAT is a more viable path. Next, we explore the difficulty of replicating the training dataset used for Whisper STT (which is not open), especially for non-English voice commands, and share our solutions for dataset gathering.

We also address accuracy issues like Out-of-Vocabulary, Out-of-Domain and Negation errors and demonstrate how synthetic data generated by LLMs helped improve robustness. Furthermore, we highlight the lack of inference backends supporting 1.58-bit models and share practical tips for writing your own efficient kernel or optimizing memory usage in embedded environments.

Finally, we conclude by emphasizing the importance of extreme low-bit inference and its applicability to broader edge use cases like RAG.

Home Sessions Speakers Exhibitors Floor Plan