Error Recovery & Bus Reset#
I²C errors fall into two categories: protocol-level errors that the hardware peripheral detects (NACK, arbitration loss, bus error) and bus-level failures that the peripheral cannot self-recover from (stuck SDA, phantom busy flag). The first category is handled by error callbacks and retry logic. The second requires manual intervention — toggling GPIOs, resetting the peripheral, or power-cycling the bus. Firmware that does not handle both categories will eventually lock up in the field, because every I²C bus will eventually encounter a stuck condition, and no amount of software reset will clear a slave device that is holding SDA low mid-byte.
Common I²C Error Conditions#
| Error | Flag (STM32) | Cause | Recoverable? |
|---|---|---|---|
| Address NACK | AF (Acknowledge Failure) | Device not present, wrong address, device busy | Yes — retry |
| Data NACK | AF | Device rejected data (write protect, invalid register) | Yes — check data |
| Bus Error | BERR | START/STOP at illegal position, noise glitch | Yes — peripheral reset |
| Arbitration Lost | ARLO | Another master won arbitration | Yes — retry later |
| Bus Busy | BUSY flag stuck | Slave holding SDA low, incomplete transaction | Requires bus recovery |
| Overrun/Underrun | OVR | DMA or interrupt latency | Yes — peripheral reset |
NACK Detection and Handling#
A NACK on the address byte means the slave did not respond — the device is missing, powered down, or the address is wrong. A NACK on a data byte means the slave received the address but rejected the data. The STM32 HAL reports both through HAL_I2C_ERROR_AF:
void HAL_I2C_ErrorCallback(I2C_HandleTypeDef *hi2c) {
uint32_t err = hi2c->ErrorCode;
if (err & HAL_I2C_ERROR_AF) {
// Acknowledge failure — device did not respond
// Check: is the device address correct? Is it powered?
i2c_nack_count++;
}
if (err & HAL_I2C_ERROR_BERR) {
// Bus error — reset the peripheral
HAL_I2C_DeInit(hi2c);
HAL_I2C_Init(hi2c);
}
if (err & HAL_I2C_ERROR_ARLO) {
// Arbitration lost — retry after delay
i2c_arb_lost_count++;
}
if (err & HAL_I2C_ERROR_OVR) {
// Overrun — reset peripheral, increase interrupt priority
HAL_I2C_DeInit(hi2c);
HAL_I2C_Init(hi2c);
}
}The Stuck SDA Problem#
The most notorious I²C failure mode occurs when a slave device holds SDA low indefinitely. This happens when a transaction is interrupted mid-byte — for example, by a master reset during a read transfer. The slave was sending a data byte with SDA low (a zero bit), saw the clock stop, and is now waiting for SCL transitions to finish the byte. From the slave’s perspective, the transaction is still active. From the master’s perspective (after reset), the bus is busy and cannot be used.
The I²C specification defines a recovery procedure: the master must toggle SCL manually (as a GPIO, not through the I²C peripheral) to clock the slave through the remainder of its byte. After at most 9 clock pulses, the slave will have shifted out all remaining bits and released SDA. At that point, the master can generate a STOP condition to reset the bus.
Bus Recovery Implementation#
A complete bus recovery function for STM32:
#define I2C_SCL_PIN GPIO_PIN_6
#define I2C_SDA_PIN GPIO_PIN_7
#define I2C_GPIO_PORT GPIOB
void i2c_bus_recovery(I2C_HandleTypeDef *hi2c) {
GPIO_InitTypeDef gpio = {0};
// Step 1: Disable the I2C peripheral
HAL_I2C_DeInit(hi2c);
// Step 2: Configure SCL as push-pull output, SDA as input
gpio.Pin = I2C_SCL_PIN;
gpio.Mode = GPIO_MODE_OUTPUT_PP;
gpio.Pull = GPIO_NOPULL;
gpio.Speed = GPIO_SPEED_FREQ_HIGH;
HAL_GPIO_Init(I2C_GPIO_PORT, &gpio);
gpio.Pin = I2C_SDA_PIN;
gpio.Mode = GPIO_MODE_INPUT;
gpio.Pull = GPIO_PULLUP;
HAL_GPIO_Init(I2C_GPIO_PORT, &gpio);
// Step 3: Toggle SCL up to 9 times, checking SDA after each
for (int i = 0; i < 9; i++) {
HAL_GPIO_WritePin(I2C_GPIO_PORT, I2C_SCL_PIN, GPIO_PIN_RESET);
HAL_Delay(1); // ~1 ms half-period (well below 100 kHz)
HAL_GPIO_WritePin(I2C_GPIO_PORT, I2C_SCL_PIN, GPIO_PIN_SET);
HAL_Delay(1);
// Check if SDA is released
if (HAL_GPIO_ReadPin(I2C_GPIO_PORT, I2C_SDA_PIN) == GPIO_PIN_SET) {
break;
}
}
// Step 4: Generate a STOP condition (SDA low->high while SCL high)
gpio.Pin = I2C_SDA_PIN;
gpio.Mode = GPIO_MODE_OUTPUT_PP;
HAL_GPIO_Init(I2C_GPIO_PORT, &gpio);
HAL_GPIO_WritePin(I2C_GPIO_PORT, I2C_SDA_PIN, GPIO_PIN_RESET);
HAL_Delay(1);
HAL_GPIO_WritePin(I2C_GPIO_PORT, I2C_SCL_PIN, GPIO_PIN_SET);
HAL_Delay(1);
HAL_GPIO_WritePin(I2C_GPIO_PORT, I2C_SDA_PIN, GPIO_PIN_SET); // STOP
HAL_Delay(1);
// Step 5: Reconfigure pins as I2C alternate function and reinit
HAL_I2C_Init(hi2c);
}The HAL_Delay(1) calls produce timing well below the 100 kHz I²C maximum, which is intentional — the recovery sequence needs to be slow enough for any slave to track. Faster toggling risks missing a slave that is in an unknown internal state.
Timeout-Based Error Detection#
The HAL’s built-in timeout (the last parameter in HAL_I2C_Mem_Read and similar functions) is the first layer of defense. But the timeout only catches the case where the bus goes idle — it does not catch the case where the bus is continuously busy from the start.
A second layer checks the BUSY flag before initiating any transaction:
#define I2C_BUSY_TIMEOUT_MS 50
HAL_StatusTypeDef i2c_wait_until_ready(I2C_HandleTypeDef *hi2c) {
uint32_t start = HAL_GetTick();
while (__HAL_I2C_GET_FLAG(hi2c, I2C_FLAG_BUSY)) {
if ((HAL_GetTick() - start) > I2C_BUSY_TIMEOUT_MS) {
return HAL_TIMEOUT;
}
}
return HAL_OK;
}When this function returns HAL_TIMEOUT, the bus recovery procedure should be triggered.
Retry Logic with Backoff#
A single NACK or bus error often resolves on its own — the device was temporarily busy, or a noise glitch caused a bus error. A simple retry with linear backoff handles these transient conditions:
#define I2C_MAX_RETRIES 5
#define I2C_RETRY_BASE_MS 2
HAL_StatusTypeDef i2c_read_with_retry(I2C_HandleTypeDef *hi2c,
uint16_t dev_addr,
uint16_t reg_addr,
uint8_t *data,
uint16_t len) {
HAL_StatusTypeDef status;
for (int attempt = 0; attempt < I2C_MAX_RETRIES; attempt++) {
// Check bus availability
if (i2c_wait_until_ready(hi2c) != HAL_OK) {
i2c_bus_recovery(hi2c);
continue;
}
status = HAL_I2C_Mem_Read(hi2c, dev_addr, reg_addr,
I2C_MEMADD_SIZE_8BIT,
data, len, 100);
if (status == HAL_OK) {
return HAL_OK;
}
// Delay before retry: 2, 4, 8, 16, 32 ms
HAL_Delay(I2C_RETRY_BASE_MS << attempt);
// Reset peripheral on bus error
if (hi2c->ErrorCode & (HAL_I2C_ERROR_BERR | HAL_I2C_ERROR_OVR)) {
HAL_I2C_DeInit(hi2c);
HAL_I2C_Init(hi2c);
}
}
// All retries exhausted — attempt full bus recovery
i2c_bus_recovery(hi2c);
return HAL_ERROR;
}Exponential backoff is sometimes recommended but rarely necessary for I²C — the failure modes are typically either transient (resolved in 1-2 retries) or persistent (requiring bus recovery, not longer delays). Linear or power-of-two backoff up to about 32 ms is sufficient. Delays beyond 50 ms indicate a stuck bus condition that no amount of waiting will resolve.
Peripheral Software Reset#
The STM32 I2C peripheral has a software reset bit (SWRST in I2C_CR1 on F4, or the RCC peripheral reset register) that resets the internal state machine without affecting GPIO configuration:
// Method 1: SWRST bit (STM32F4)
I2C1->CR1 |= I2C_CR1_SWRST;
HAL_Delay(1);
I2C1->CR1 &= ~I2C_CR1_SWRST;
// Must reconfigure all I2C registers after SWRST
// Method 2: RCC reset (works on all STM32)
__HAL_RCC_I2C1_FORCE_RESET();
HAL_Delay(1);
__HAL_RCC_I2C1_RELEASE_RESET();
// Must reconfigure all I2C registers after RCC reset
// Method 3: HAL DeInit/Init (safest, handles all cleanup)
HAL_I2C_DeInit(&hi2c1);
HAL_I2C_Init(&hi2c1);The RCC reset method is the most thorough — it resets every register in the peripheral to its default value. The SWRST method resets the state machine but may leave some configuration registers intact depending on the silicon revision. HAL_I2C_DeInit followed by HAL_I2C_Init is the safest approach from application code.
ESP-IDF Error Recovery#
On ESP32, the I2C driver provides timeout handling internally, but bus recovery requires manual intervention:
esp_err_t i2c_bus_recover(i2c_port_t port) {
// Uninstall the driver to release GPIO control
i2c_driver_delete(port);
// Configure SCL as GPIO output
gpio_set_direction(GPIO_NUM_22, GPIO_MODE_OUTPUT);
gpio_set_direction(GPIO_NUM_21, GPIO_MODE_INPUT);
gpio_set_pull_mode(GPIO_NUM_21, GPIO_PULLUP_ONLY);
// Clock out up to 9 pulses
for (int i = 0; i < 9; i++) {
gpio_set_level(GPIO_NUM_22, 0);
esp_rom_delay_us(5);
gpio_set_level(GPIO_NUM_22, 1);
esp_rom_delay_us(5);
if (gpio_get_level(GPIO_NUM_21) == 1) {
break;
}
}
// Generate STOP condition
gpio_set_direction(GPIO_NUM_21, GPIO_MODE_OUTPUT);
gpio_set_level(GPIO_NUM_21, 0);
esp_rom_delay_us(5);
gpio_set_level(GPIO_NUM_22, 1);
esp_rom_delay_us(5);
gpio_set_level(GPIO_NUM_21, 1);
// Reinstall the I2C driver
i2c_config_t conf = { /* ... original config ... */ };
i2c_param_config(port, &conf);
return i2c_driver_install(port, conf.mode, 0, 0, 0);
}The pattern is identical to the STM32 approach: take GPIO control away from the peripheral, clock out the stuck byte, generate STOP, then reinitialize.
Tips#
- Always implement a bus recovery function — even if the bus works perfectly during development, field conditions (power glitches, ESD events, connector vibration) will eventually produce a stuck bus condition
- Keep the BUSY flag timeout short (25-50 ms) — a genuinely busy bus with a single master resolves in under 10 ms at 100 kHz for the longest possible transaction. A BUSY flag that persists beyond 50 ms is a stuck condition, not a legitimate transfer
- Limit retries to 3-5 attempts before escalating to bus recovery — infinite retry loops mask real hardware problems and make the firmware appear to hang
- Maintain error counters (NACK count, bus error count, recovery count) and expose them through a debug interface — these counters provide early warning of degrading bus health before hard failures occur
- After bus recovery, re-read any cached sensor configuration registers — some devices reset their internal configuration when they see the recovery STOP condition, reverting to power-on defaults
Caveats#
- A software reset of the I²C peripheral does not release a stuck slave — The peripheral reset clears the master’s state machine, but the slave is a separate device still holding SDA low. Bus recovery (SCL toggling) is required to unstick the slave. Calling
HAL_I2C_Initrepeatedly without GPIO-level recovery is a common mistake that never resolves the stuck condition - The STM32F4 BUSY flag can become permanently set due to a silicon bug — Errata sheet ES0182 documents that certain interrupt sequences can leave the BUSY flag asserted even with no activity on the bus. The only workaround is the GPIO-level SCL toggle sequence followed by a peripheral reset via the RCC register
- Bus recovery while other slaves are mid-transaction can corrupt their state — The 9 SCL pulses that unstick one device look like valid clock transitions to every other device on the bus. Devices that were idle are unaffected, but any device that was in the middle of a transaction (e.g., an EEPROM write in progress) may interpret the recovery clocks as data. In multi-device systems, bus recovery should be followed by re-reading the status of all devices
- HAL_I2C_ErrorCallback is only called in interrupt/DMA mode — In polling mode (which
HAL_I2C_Mem_Readuses by default), errors are reported only through the return value andhi2c->ErrorCode. Code that relies on the error callback for polling-mode error handling will never see it fire - Timeout values in milliseconds hide the real timing — A 100 ms timeout at 400 kHz allows over 40,000 clock cycles of bus activity. Genuine timeouts should be 2-5x the expected transaction duration, not an order of magnitude larger
In Practice#
- A bus that works for hours and then locks up permanently — requiring a power cycle to recover — is the classic stuck SDA symptom. The logic analyzer shows SDA held low with no SCL activity. This commonly appears after a firmware update or debugging session that reset the master while a sensor was mid-read
- Intermittent NACK errors that correlate with a specific device on a multi-device bus usually indicate that the failing device is slow to release clock stretching. The master’s timeout expires before the device releases SCL, and the master interprets the incomplete transfer as a NACK. Increasing the timeout or reducing the I²C clock speed resolves this, but the root cause is insufficient clock-stretch tolerance in the master configuration
- A bus that shows continuous traffic on the logic analyzer — repeating START conditions every few milliseconds — without ever completing a transaction indicates a firmware retry loop that is not escalating to bus recovery. Each retry encounters the same stuck condition, fails, and retries immediately. This shows up as 100% bus utilization with zero successful transactions
- An I2C peripheral that reports BUSY immediately after initialization — before any transaction is attempted — indicates that SDA or SCL is being held low by external hardware. The most common cause is a missing pull-up resistor: with no pull-up, the open-drain bus floats low, and the peripheral sees a continuous bus-busy condition. Checking SDA and SCL levels with a multimeter (both should read close to VCC with pull-ups) is the fastest diagnostic