ENERZAi is an Edge AI startup that recently developed a Spoken Language Understanding pipeline using 1.58-bit quantization via Quantization-Aware Training (QAT). This paper walks through the key challenges we encountered and how we addressed them, offering practical insights for those pursuing sub-4-bit model deployment.
First, we show why Post-Training Quantization (PTQ) fails for extreme low-bit quantization and why QAT is a more viable path. Next, we explore the difficulty of replicating the training dataset used for Whisper STT (which is not open), especially for non-English voice commands, and share our solutions for dataset gathering.
We also address accuracy issues like Out-of-Vocabulary, Out-of-Domain and Negation errors and demonstrate how synthetic data generated by LLMs helped improve robustness. Furthermore, we highlight the lack of inference backends supporting 1.58-bit models and share practical tips for writing your own efficient kernel or optimizing memory usage in embedded environments.
Finally, we conclude by emphasizing the importance of extreme low-bit inference and its applicability to broader edge use cases like RAG.