Audio & Speech at the Edge on Embedded Systems Development

Audio Feature Extraction for ML

Mon, 01 Jan 0001 00:00:00 +0000

Audio Feature Extraction for ML#

Raw PCM audio is a poor direct input to machine learning models. A single second of 16 kHz mono audio produces 16,000 samples — a 16,000-dimensional input vector that contains temporal structure the model cannot easily exploit. Adjacent samples are highly correlated, frequency information is implicit rather than explicit, and the sheer dimensionality demands unnecessarily large models. The standard approach is to transform raw audio into a compact frequency-domain representation — typically a mel spectrogram or a set of MFCCs — that captures the perceptually relevant information in far fewer dimensions.

Keyword Spotting & Wake Words

Mon, 01 Jan 0001 00:00:00 +0000

Keyword Spotting & Wake Words#

Keyword spotting (KWS) is the task of continuously monitoring an audio stream for a small set of target words — typically 1 to 10 commands — while rejecting all other speech, noise, and silence. It is the foundation of voice-activated systems: the always-on listener that decides whether a full speech recognition pipeline should activate. The engineering challenge is not classification accuracy in isolation — it is achieving high accuracy while running continuously at microwatt-level power for months on a coin cell or years on a small battery.

Speech Recognition

Mon, 01 Jan 0001 00:00:00 +0000

Speech Recognition#

Speech recognition at the edge spans a vast range of complexity — from a Cortex-M4 classifying one of 20 fixed commands in 50 milliseconds, to an NVIDIA Jetson running the full OpenAI Whisper model for open-vocabulary transcription in multiple languages. The right approach depends on vocabulary size, acceptable latency, available hardware, and whether network connectivity is reliable enough for a cloud fallback.

The architecture decision is not just about accuracy. A 100-command voice interface that responds in 200 ms with 95% accuracy feels better to use than one that achieves 99% accuracy but takes 3 seconds. Latency, power, and reliability — not just word error rate — determine whether a speech recognition system is practical at the edge.

Audio Event Classification

Mon, 01 Jan 0001 00:00:00 +0000

Audio Event Classification#

Audio event classification identifies environmental sounds — not speech — from an audio stream: glass breaking, dog barking, siren wailing, machine operating normally versus abnormally. The task differs from speech recognition in several important ways: events are often short and non-stationary, multiple events can overlap in time, and the “vocabulary” of sounds is defined by the physical environment rather than a language. The dominant approach is a convolutional neural network operating on mel spectrograms, either trained from scratch for a narrow domain or fine-tuned from a pre-trained model like YAMNet.