embedded world NA 2025

Speaker‑Aware Local Linformer Masking for Ultra‑Lightweight Noise‑Robust Speech Front‑Ends (Room 303A)

04 Nov 25
3:25 PM - 3:50 PM

Tracks: Embedded Software - HMI

Speaker(s): Naohiro Kohmu
Hands‑free voice interfaces on factory floors and in warehouses must recognize a single operator’s speech despite intense background noise and harsh energy budgets. We present Speaker‑Aware Local Linformer Masking (SALM), a≤400 kB neural front‑end that pushes state‑of‑the‑art noise suppression and speaker selectivity onto microcontrollers.
SALM first extracts a 128‑D embedding from a few seconds of enrolment audio and broadcasts this vector into a locally restricted Linformer self‑attention layer that operates along the mel‑frequency axis. The layer learns low‑rank query/key/value projections within narrow frequency windows, generating a dynamic mask M(t,b) that both attenuates noise (M < 1) and enhances speaker‑specific energy (M > 1) in real‑time.
On an 8 kHz LibriSpeech + NoiseX‑92 corpus spanning –5 dB to 15 dB SNR, SALM cuts the mean‑squared error between noisy and clean spectrograms by 29% on average and up to 42.7% at 15 dB SNR. INT8 quantization compresses the speaker embedder and masking module to ≈226 kB and ≈171 kB, respectively, with sub‑10 ms latency for 40‑mel frames, enabling always‑listening operation within the static RAM and power envelope of typical embedded MCUs.
The talk details the network architecture, training pipeline, deployment tool‑chain, and integration with CTC‑based recognizers, providing a practical recipe for battery‑operated, noise‑robust speech interfaces in noisy industrial environments.