fix(firmware): node_id early capture + SPI cache crash fix + provision.py compat#397
Conversation
The original defensive copy in csi_collector_init() (line 172 of main.c)
runs AFTER wifi_init_sta() (line 147), which on some ESP32-S3 devices
corrupts g_nvs_config.node_id back to the Kconfig default of 1.
Reproduced on device 80:b5:4e:c1:be:b8 (ESP32-S3 QFN56 rev v0.2):
- NVS provisioned with node_id=5
- Release firmware (no fix): seed receives node_id=1 (clobbered)
- This patch: seed receives node_id=5 (correct)
Changes:
- Add csi_collector_set_node_id() called from main.c immediately
after nvs_config_load(), before wifi_init_sta() runs
- csi_collector_init() now detects and logs the clobber if early
capture disagrees with current g_nvs_config value
- Fallback path preserved: if set_node_id() is never called,
init() still captures from g_nvs_config (backwards compatible)
Co-Authored-By: claude-flow <ruv@ruv.net>
The CSI callback reads g_nvs_config.filter_mac_set and filter_mac on every invocation (100-500 Hz). If wifi_init_sta() corrupts g_nvs_config (same root cause as the node_id clobber), the callback reads garbage from the struct, leading to Core 0 LoadProhibited panic after ~2400 callbacks (~70 seconds of operation). Extends the early-capture pattern from the node_id fix to also copy filter_mac_set and filter_mac into module-local statics before WiFi init runs. Adds canary logging to detect filter_mac corruption. Observed on device 80:b5:4e:c1:be:b8 via serial: CSI cb #2400 → Guru Meditation Error: Core 0 panic'ed (LoadProhibited) → TG0WDT_SYS_RST → reboot → crash again at ~2900 callbacks Refs ruvnet#232 ruvnet#375 ruvnet#385 ruvnet#386 ruvnet#390 Co-Authored-By: Ruflo & AQE
The WiFi driver's wDev_ProcessFiq interrupt handler crashes with LoadProhibited in cache_ll_l1_resume_icache when promiscuous mode captures MGMT+DATA frames (100-500 interrupts/sec). The high interrupt rate races with SPI flash cache operations, corrupting cache state. Changes: - Promiscuous filter: MGMT+DATA → MGMT-only (~10 Hz beacons) - CSI config: disable htltf_en and stbc_htltf2_en (LLTF-only) LLTF provides 64 subcarriers (HT20) — sufficient for presence, breathing, and fall detection. The 10 Hz beacon rate eliminates the SPI flash cache contention that caused the crash. Verified on device 80:b5:4e:c1:be:b8: - Before: LoadProhibited crash at ~1600-2400 callbacks (every ~70s) - After: 2700+ callbacks over 4.7 minutes, zero crashes Backtrace decode confirmed crash in ESP-IDF closed-source WiFi blob: _xt_lowint1 → wDev_ProcessFiq → spi_flash_restore_cache → cache_ll_l1_resume_icache → EXCVADDR=0x00000004 (NULL deref) Co-Authored-By: Ruflo & AQE
esptool v5+ rejects hyphenated subcommands. The provision script used 'write-flash' which fails with "invalid choice". Changed to 'write_flash' (underscore) which works with both old and new esptool. Co-Authored-By: Ruflo & AQE
- Add early rate gate in wifi_csi_callback at 50 Hz (defense-in-depth, does not prevent crash alone but reduces callback execution time) - Add null-data injection timer infrastructure (disabled — TX adds interrupt pressure that triggers the SPI cache crash, RuView#396) - sdkconfig.defaults: add CONFIG_ESP_WIFI_EXTRA_IRAM_OPT=y - sdkconfig.defaults: document SPIRAM XIP attempt (crashes differently) Co-Authored-By: Ruflo & AQE
Post-merge investigation — corrected stability data and v5.4.4 / v5.5.4 testsWhile Ruv was deciding what to do with this PR, we ran a much longer soak than the 4-minute window we reported originally. The earlier claim "Stable — 3 nodes, 1.44M+ frames, zero crashes / 4+ minutes per node" did not hold up to longer observation. Corrected data below. What was wrong with the original test
Real 30-min soak — PR #397 firmware (MGMT-only + fix #1 stack-static) on ESP-IDF v5.4.0
4 crashes in 30 min → ~1 crash every 7.5 min at 10 Hz CSI. Recovery is <2 s, edge processing re-calibrates within 1200 frames. Acceptable for product but not "stable". Additional experiments today (all Ruv-only baseline, MGMT+DATA promiscuous)
Same backtrace across 3 IDF versions: Fix attempts that didn't work
ConclusionThe bug is in the ESP-IDF WiFi binary blob on ESP32-S3 + QSPI-display hardware (Waveshare AMOLED 1.8″). It reproduces on v5.4.0, v5.4.4, and v5.5.4 with identical signatures. Application-level mitigations (filter reduction, stack hygiene, IRAM pinning) only change when it fires, not whether it fires. Recommendation: merge PR #397 as-is. The MGMT-only filter is the only mitigation that keeps crash frequency tolerable for product use (1 crash per ~7 min at 10 Hz CSI, 2 s recovery). Seeds' edge-processing adaptive calibration handles the brief gaps. We'll add a CSI-starvation watchdog in a follow-up PR to catch the silent TG0WDT hangs and turn them into faster reboots. Upstream bug report to Espressif + detailed crash dumps → #396 updated. What I got wrong
Happy to backfill longer soaks or run any additional configs if useful. |
Summary
Five fixes for the ESP32-S3 CSI firmware, tested on a 3-node fleet with 3 Pi Zero seeds.
Builds on top of merged PR #393 (v0.6.1). Addresses two bugs:
node_idclobber that fix(firmware): defensive node_id capture prevents runtime clobber (#390) #393 didn't fully fix (late capture after WiFi init)LoadProhibitedcrash in promiscuous mode (RuView#396)Commits
1.
fix(firmware): move defensive node_id capture before wifi_init_sta()PR #393's defensive copy at
csi_collector_init()runs AFTERwifi_init_sta(), which corruptsg_nvs_configon our hardware (MAC80:b5:4e:c1:be:b8). Addscsi_collector_set_node_id()called immediately afternvs_config_load(), before WiFi init.Verified: NVS node_id=5 → seed receives node_id=5 (was receiving 1 with #393's fix).
2.
fix(firmware): defensive copy of filter_mac to prevent callback crashThe CSI callback reads
g_nvs_config.filter_mac_seton every invocation (100-500 Hz). Same struct corruption from WiFi init. Extends the defensive-copy pattern tofilter_mac.3.
fix(firmware): MGMT-only promiscuous filter to prevent SPI cache crashThe core crash fix.
wDev_ProcessFiq(ESP-IDF WiFi blob) crashes incache_ll_l1_resume_icachewhen promiscuous mode captures MGMT+DATA frames at 100-500 Hz. Reduces filter to MGMT-only (~10 Hz beacons). See #396 for the full 10-test investigation.Also re-enables
htltf_enandstbc_htltf2_enfor full CSI quality (128/256/384 byte frames with LLTF+HT-LTF+STBC).4.
fix(provision): write-flash → write_flash for esptool v5 compatesptool v5+rejects hyphenated subcommands.5.
fix(firmware): 50 Hz callback rate gate + sdkconfig extra IRAM optDefense-in-depth: early rate gate drops excess callbacks before processing.
CONFIG_ESP_WIFI_EXTRA_IRAM_OPT=yin sdkconfig.defaults. Includes disabled null-data injection timer infrastructure for future use.Test results
Tested on 3x Waveshare ESP32-S3 AMOLED 1.8" (QFN56 rev v0.2, 8MB PSRAM, 16MB flash).
Full test matrix (10 configurations tested) in #396.
Impact on CSI rate
CSI rate drops from ~500 Hz to ~10 Hz (beacons only). This matches the cog sample rate (10 Hz) and satisfies Nyquist for heart rate (2.0 Hz) and breathing (0.5 Hz). The
sample_rateconstant inedge_processing.c:718should be updated from 20.0 to 10.0 to match — left for a separate commit since it's in Ruv's code.Refs
Test plan
Co-Authored-By: Ruflo & AQE