PDM & I2S Audio Interfaces#
Streaming audio data from a digital microphone to an MCU requires a well-defined serial protocol. Two dominate embedded audio: PDM (Pulse Density Modulation), which carries a 1-bit sigma-delta bitstream at megahertz clock rates, and I2S (Inter-IC Sound), which carries multi-bit PCM samples in a synchronous frame format. The choice between them affects peripheral selection, CPU load, memory bandwidth, and firmware complexity. PDM is simpler in hardware (two wires) but requires decimation filtering that consumes either a dedicated hardware peripheral or significant CPU cycles. I2S delivers ready-to-use PCM samples but requires three signal lines and a more complex peripheral configuration.
PDM Protocol#
Signal Lines#
PDM uses exactly two signals:
| Signal | Direction | Description |
|---|---|---|
| CLK | MCU to microphone | Clock signal, typically 1.024-3.072 MHz |
| DAT | Microphone to MCU | 1-bit data output, synchronized to CLK |
The microphone samples its internal sigma-delta modulator on each clock edge and drives the DAT line high or low. The density of high bits over any given window is proportional to the instantaneous audio amplitude β hence “pulse density modulation.”
Stereo on a Single Data Line#
Two PDM microphones share one CLK and one DAT line. Channel selection is set by a hardware pin (typically labeled SEL or L/R) on each microphone:
- Left channel microphone (SEL tied low): outputs data on the falling edge of CLK
- Right channel microphone (SEL tied high): outputs data on the rising edge of CLK
The MCU peripheral captures both edges and de-interleaves the two channels. This means the effective per-channel data rate is half the clock frequency. With a 3.072 MHz clock, each channel receives 1.536 Mbit/s of PDM data.
Clock Rate and Output Sample Rate#
The PDM clock frequency and decimation ratio together determine the final PCM sample rate:
| PDM Clock (MHz) | Decimation Ratio | PCM Sample Rate (kHz) | Typical Use |
|---|---|---|---|
| 1.024 | 64 | 16 | Voice capture, keyword detection |
| 1.536 | 64 | 24 | Wideband voice |
| 2.048 | 64 | 32 | Enhanced voice quality |
| 3.072 | 64 | 48 | Music-quality audio |
| 2.048 | 128 | 16 | Higher oversampling, better SNR |
| 3.072 | 128 | 24 | High-quality voice |
Higher oversampling ratios (128x vs 64x) improve noise shaping performance at the cost of more computation in the decimation filter.
PDM-to-PCM Decimation#
The raw PDM bitstream cannot be used directly for audio processing. Conversion to PCM requires a decimation filter β a multi-stage digital filter that low-pass filters the bitstream and downsamples it to the target sample rate.
CIC Filter (Cascaded Integrator-Comb)#
The workhorse of PDM decimation is the CIC filter, specifically a sincΒ³ (third-order) variant. A CIC filter requires no multiplications β only additions and subtractions β making it efficient for hardware implementation and suitable for resource-constrained MCUs.
A sincΒ³ CIC filter with decimation ratio R consists of three integrator stages operating at the PDM clock rate, a rate reduction by factor R, and three comb stages operating at the output rate. The frequency response has nulls at multiples of the output sample rate, which suppresses the shaped quantization noise from the sigma-delta modulator.
The passband of a sincΒ³ filter is not flat β it droops by approximately 3.4 dB at the Nyquist frequency of the output rate. A compensation FIR filter (typically 16-32 taps) applied after the CIC stage corrects this droop and provides a sharper transition band.
Hardware Decimation: STM32 DFSDM#
STM32 families with a DFSDM (Digital Filter for Sigma-Delta Modulators) peripheral handle PDM decimation in hardware. The DFSDM includes:
- A serial input that accepts the PDM bitstream directly
- A configurable sinc filter (sinc1 through sinc5, with selectable order and decimation ratio)
- DMA output of filtered PCM samples to memory
/* STM32 HAL β DFSDM channel and filter configuration for MP34DT05 PDM mic */
#include "stm32l4xx_hal.h"
DFSDM_Channel_HandleTypeDef hdfsdm_ch;
DFSDM_Filter_HandleTypeDef hdfsdm_flt;
static int32_t pcm_buffer[PCM_BUFFER_SIZE];
void dfsdm_pdm_init(void)
{
/* Channel configuration β PDM input on DATIN0 */
hdfsdm_ch.Instance = DFSDM1_Channel0;
hdfsdm_ch.Init.OutputClock.Activation = ENABLE;
hdfsdm_ch.Init.OutputClock.Selection = DFSDM_CHANNEL_OUTPUT_CLOCK_SYSTEM;
hdfsdm_ch.Init.OutputClock.Divider = 24; /* 80 MHz / 24 = 3.33 MHz PDM clock */
hdfsdm_ch.Init.Input.Multiplexer = DFSDM_CHANNEL_EXTERNAL_INPUTS;
hdfsdm_ch.Init.Input.DataPacking = DFSDM_CHANNEL_STANDARD_MODE;
hdfsdm_ch.Init.Input.Pins = DFSDM_CHANNEL_SAME_CHANNEL_PINS;
hdfsdm_ch.Init.SerialInterface.Type = DFSDM_CHANNEL_SPI_RISING;
hdfsdm_ch.Init.SerialInterface.SpiClock = DFSDM_CHANNEL_SPI_CLOCK_INTERNAL;
hdfsdm_ch.Init.Awd.FilterOrder = DFSDM_CHANNEL_SINC1_ORDER;
hdfsdm_ch.Init.Awd.Oversampling = 1;
hdfsdm_ch.Init.Offset = 0;
hdfsdm_ch.Init.RightBitShift = 2;
HAL_DFSDM_ChannelInit(&hdfsdm_ch);
/* Filter configuration β sinc3, decimation 64 */
hdfsdm_flt.Instance = DFSDM1_Filter0;
hdfsdm_flt.Init.RegularParam.Trigger = DFSDM_FILTER_SW_TRIGGER;
hdfsdm_flt.Init.RegularParam.FastMode = ENABLE;
hdfsdm_flt.Init.RegularParam.DmaMode = ENABLE;
hdfsdm_flt.Init.FilterParam.SincOrder = DFSDM_FILTER_SINC3_ORDER;
hdfsdm_flt.Init.FilterParam.Oversampling = 64;
hdfsdm_flt.Init.FilterParam.IntOversampling = 1;
HAL_DFSDM_FilterInit(&hdfsdm_flt);
/* Associate channel with filter */
HAL_DFSDM_FilterConfigRegChannel(&hdfsdm_flt,
DFSDM_CHANNEL_0, DFSDM_CONTINUOUS_CONV_ON);
}
void dfsdm_start_capture(void)
{
/* Start DMA-driven continuous conversion */
HAL_DFSDM_FilterRegularStart_DMA(&hdfsdm_flt,
pcm_buffer, PCM_BUFFER_SIZE);
}
/* DMA half-complete and complete callbacks for double-buffering */
void HAL_DFSDM_FilterRegConvHalfCpltCallback(
DFSDM_Filter_HandleTypeDef *hdfsdm_filter)
{
/* Process first half of pcm_buffer */
}
void HAL_DFSDM_FilterRegConvCpltCallback(
DFSDM_Filter_HandleTypeDef *hdfsdm_filter)
{
/* Process second half of pcm_buffer */
}I2S Protocol#
Signal Lines#
I2S (Inter-IC Sound, also written IΒ²S) uses three signals for basic operation:
| Signal | Alias | Direction | Description |
|---|---|---|---|
| SCK | BCLK | MCU generates | Serial clock / bit clock |
| WS | LRCLK | MCU generates | Word select β low = left channel, high = right channel |
| SD | DOUT/DIN | Data source to sink | Serial data, MSB first |
An optional fourth signal, MCLK (master clock), runs at 256x or 384x the sample rate and provides a reference for audio codec PLLs. MEMS microphones typically do not require MCLK β they derive their internal timing from BCLK.
Timing and Frame Format#
The I2S Philips standard defines the following timing:
- WS transitions one BCLK cycle before the first bit of each channel’s data
- Data is transmitted MSB first
- Each channel occupies one half of the WS period (left when WS is low, right when WS is high)
- Data bits are valid on the rising edge of BCLK (for standard Philips mode)
The bit clock frequency is: BCLK = sample_rate Γ bits_per_slot Γ 2 (the factor of 2 accounts for left and right channels).
| Sample Rate | Bit Depth | BCLK Frequency |
|---|---|---|
| 8 kHz | 16-bit | 256 kHz |
| 16 kHz | 16-bit | 512 kHz |
| 16 kHz | 32-bit | 1.024 MHz |
| 44.1 kHz | 16-bit | 1.411 MHz |
| 48 kHz | 16-bit | 1.536 MHz |
| 48 kHz | 32-bit | 3.072 MHz |
I2S Modes#
Several variants of the I2S frame format exist:
| Mode | WS Transition | Data Alignment | Common Use |
|---|---|---|---|
| Philips (standard) | 1 BCLK before MSB | Left-justified in slot | Most MEMS microphones, codec defaults |
| Left-justified (MSB) | Aligned with MSB | Left-justified, WS high = left | Japanese audio ICs, some older codecs |
| Right-justified (LSB) | N/A β data at end of slot | Right-justified in slot | Rare, legacy DACs |
| TDM | Multiple slots per frame | Slot-addressed | Multi-channel audio, codec arrays |
Most I2S MEMS microphones use Philips standard. Configuring the MCU peripheral for the wrong mode produces garbled audio β the bits are read at incorrect positions within the frame, resulting in either very quiet output (bits shifted down) or clipped/distorted output (bits shifted up).
I2S Configuration on STM32 (SAI Peripheral)#
STM32F4 and later families provide the SAI (Serial Audio Interface) peripheral, which is more flexible than the basic SPI/I2S peripheral. The SAI supports I2S Philips, left-justified, right-justified, TDM, and PDM modes.
/* STM32 HAL β SAI configured as I2S master receiver, 16 kHz, 16-bit */
#include "stm32f4xx_hal.h"
SAI_HandleTypeDef hsai;
static int16_t audio_dma_buf[DMA_BUF_SIZE];
void sai_i2s_init(void)
{
hsai.Instance = SAI1_Block_A;
hsai.Init.AudioMode = SAI_MODEMASTER_RX;
hsai.Init.Synchro = SAI_ASYNCHRONOUS;
hsai.Init.OutputDrive = SAI_OUTPUTDRIVE_DISABLE;
hsai.Init.NoDivider = SAI_MASTERDIVIDER_ENABLE;
hsai.Init.FIFOThreshold = SAI_FIFOTHRESHOLD_1QF;
hsai.Init.AudioFrequency = SAI_AUDIO_FREQUENCY_16K;
hsai.Init.SynchroExt = SAI_SYNCEXT_DISABLE;
hsai.Init.MonoStereoMode = SAI_STEREOMODE;
hsai.Init.CompandingMode = SAI_NOCOMPANDING;
/* Frame configuration β Philips I2S */
hsai.FrameInit.FrameLength = 32; /* 16 bits per channel x 2 */
hsai.FrameInit.ActiveFrameLength = 16;
hsai.FrameInit.FSDefinition = SAI_FS_CHANNEL_IDENTIFICATION;
hsai.FrameInit.FSPolarity = SAI_FS_ACTIVE_LOW;
hsai.FrameInit.FSOffset = SAI_FS_BEFOREFIRSTBIT;
/* Slot configuration */
hsai.SlotInit.FirstBitOffset = 0;
hsai.SlotInit.SlotSize = SAI_SLOTSIZE_16B;
hsai.SlotInit.SlotNumber = 2;
hsai.SlotInit.SlotActive = SAI_SLOTACTIVE_0 | SAI_SLOTACTIVE_1;
HAL_SAI_Init(&hsai);
}
void sai_start_dma_capture(void)
{
HAL_SAI_Receive_DMA(&hsai, (uint8_t *)audio_dma_buf, DMA_BUF_SIZE);
}
void HAL_SAI_RxHalfCpltCallback(SAI_HandleTypeDef *hsai_ptr)
{
/* Process first half of audio_dma_buf */
}
void HAL_SAI_RxCpltCallback(SAI_HandleTypeDef *hsai_ptr)
{
/* Process second half of audio_dma_buf */
}I2S Configuration on ESP32 (ESP-IDF)#
The ESP32’s I2S peripheral supports both standard I2S and PDM modes. The newer ESP-IDF v5.x driver uses a channel-based API:
#include "driver/i2s_std.h"
#define SAMPLE_RATE 48000
#define DMA_BUF_LEN 512
static i2s_chan_handle_t rx_chan = NULL;
void esp32_i2s_init(void)
{
i2s_chan_config_t chan_cfg = I2S_CHANNEL_DEFAULT_CONFIG(
I2S_NUM_0, I2S_ROLE_MASTER);
chan_cfg.dma_desc_num = 6;
chan_cfg.dma_frame_num = DMA_BUF_LEN;
i2s_new_channel(&chan_cfg, NULL, &rx_chan);
i2s_std_config_t std_cfg = {
.clk_cfg = I2S_STD_CLK_DEFAULT_CONFIG(SAMPLE_RATE),
.slot_cfg = I2S_STD_PHILIPS_SLOT_DEFAULT_CONFIG(
I2S_DATA_BIT_WIDTH_32BIT, I2S_SLOT_MODE_STEREO),
.gpio_cfg = {
.mclk = I2S_GPIO_UNUSED,
.bclk = GPIO_NUM_26,
.ws = GPIO_NUM_25,
.dout = I2S_GPIO_UNUSED,
.din = GPIO_NUM_34,
.invert_flags = { false, false, false },
},
};
i2s_channel_init_std_mode(rx_chan, &std_cfg);
i2s_channel_enable(rx_chan);
}DMA Double-Buffer for Continuous Audio Streaming#
Audio streaming requires uninterrupted data flow. A single DMA buffer causes gaps β while the CPU processes a filled buffer, no DMA target is available, and incoming samples are lost. The standard solution is a double-buffer (also called ping-pong buffer) scheme:
- DMA fills buffer A while the CPU processes buffer B
- When buffer A is full, DMA switches to buffer B and signals the CPU (half-complete interrupt)
- The CPU processes buffer A while DMA fills buffer B
- The cycle repeats indefinitely
Both the STM32 HAL and ESP-IDF I2S drivers support this natively. The STM32 HAL uses a single large buffer with half-complete and complete callbacks β the first half is one “ping” buffer and the second half is the “pong” buffer. The ESP-IDF driver manages multiple DMA descriptors internally.
The critical constraint is that processing must complete within one buffer period. For a 512-sample buffer at 16 kHz, the buffer period is 32 ms. Any processing that takes longer causes the DMA to overwrite data before it is consumed, producing audio glitches (clicks, pops, or gaps).
| Buffer Size (samples) | Sample Rate | Buffer Period | Latency |
|---|---|---|---|
| 128 | 16 kHz | 8 ms | 16 ms (double-buffered) |
| 256 | 16 kHz | 16 ms | 32 ms |
| 512 | 48 kHz | 10.7 ms | 21.3 ms |
| 1024 | 48 kHz | 21.3 ms | 42.7 ms |
Smaller buffers reduce latency but increase the interrupt rate and leave less time for processing per buffer. Larger buffers provide more processing headroom but add latency β relevant for real-time applications like voice effects or echo cancellation.
Sample Rates and Bit Depth#
Common Sample Rates#
| Sample Rate | Bandwidth | Typical Application |
|---|---|---|
| 8 kHz | 4 kHz | Telephone-quality voice, wake-word detection |
| 16 kHz | 8 kHz | Wideband voice, speech recognition |
| 22.05 kHz | 11 kHz | AM radio quality |
| 44.1 kHz | 22.05 kHz | CD audio, music playback |
| 48 kHz | 24 kHz | Professional audio, video soundtracks |
For embedded voice applications (keyword spotting, voice command), 16 kHz is the standard β it captures the full speech bandwidth (300 Hz to 8 kHz) with margin. Music applications require 44.1 or 48 kHz to capture the full audible spectrum.
Bit Depth#
| Bit Depth | Dynamic Range | Notes |
|---|---|---|
| 16-bit | 96 dB | Sufficient for most embedded audio; CD quality |
| 24-bit | 144 dB | Exceeds any microphone’s dynamic range; used for processing headroom |
| 32-bit | 192 dB | Typically 24-bit data padded to 32-bit alignment for DMA efficiency |
Most I2S MEMS microphones output 24-bit data in a 32-bit frame. The lower 8 bits are padding (zeros or noise). Processing pipelines that operate on 32-bit integers should account for this β the valid data occupies bits [31:8] of each word, with bit 31 as the sign bit.
Tips#
- Always verify the I2S mode (Philips vs left-justified) in both the microphone datasheet and the MCU peripheral configuration β a mode mismatch is the most common cause of garbled or silent audio
- Use DMA for all audio streaming β polling the I2S data register in a loop cannot sustain the required throughput at 48 kHz and introduces jitter that corrupts the sample stream
- Size DMA buffers to match the processing block size of downstream algorithms (e.g., 512 samples for a 512-point FFT) to avoid unnecessary copying between buffers
- When using PDM microphones on STM32, prefer the DFSDM peripheral over software decimation β hardware filtering runs at zero CPU cost and produces consistent timing
- For debugging audio issues, capture raw DMA buffer contents to a log or SD card and inspect them in a tool like Audacity β this reveals whether the problem is in the capture path or the processing path
Caveats#
- Clock accuracy affects audio pitch β An I2S bit clock derived from an imprecise internal RC oscillator shifts the effective sample rate, causing audio to play back slightly fast or slow. For audio-critical applications, derive the I2S clock from a crystal-referenced PLL
- MCLK requirements vary by device β Some audio codecs require MCLK at exactly 256x the sample rate. MEMS microphones generally do not need MCLK, but connecting one to a codec on the same I2S bus may require generating it
- DMA buffer alignment β On ARM Cortex-M, DMA buffers must be aligned to the data width (4-byte alignment for 32-bit transfers). Misaligned buffers cause hard faults or silently corrupt adjacent memory
- I2S peripheral clock tree limitations β Not all sample rates are achievable with integer division from the MCU’s PLL. 44.1 kHz is notoriously difficult on STM32 β the achievable rate may be 44.117 kHz or 44.089 kHz, introducing a small pitch error. 48 kHz divides cleanly from most PLL configurations
- PDM clock must be continuous β Stopping the PDM clock and restarting it causes the microphone’s sigma-delta modulator to lose its operating point, producing a burst of noise (settling transient) that lasts 1-10 ms depending on the part
In Practice#
- An I2S capture that produces audio at half the expected pitch and double the duration indicates the MCU is configured for stereo but the microphone is mono β the driver interleaves zero-valued samples for the empty channel, effectively halving the apparent sample rate when played back as mono
- Audio that sounds correct but contains periodic clicks at a regular interval (e.g., every 32 ms) points to a DMA buffer underrun β the processing callback is not completing before the next buffer is ready, causing a brief gap or repeated segment in the audio stream
- A PDM microphone that produces usable audio at 16 kHz but excessive noise at 48 kHz suggests the decimation filter order is too low for the higher output rate β the quantization noise that the sigma-delta modulator pushes above the audio band is not being sufficiently attenuated at the wider bandwidth
- Raw I2S data from an INMP441 that appears as very small values (near zero) when interpreted as 16-bit integers is almost always a bit-width mismatch β the 24-bit data is left-justified in a 32-bit frame, and reading it as 16-bit captures only the MSBs, which for quiet signals are all zeros or sign-extension bits
- Connecting a logic analyzer to the I2S bus and verifying that WS transitions occur exactly one BCLK before the first data bit is the definitive way to diagnose mode configuration issues β many hours of firmware debugging can be saved by a five-minute logic analyzer capture