Skip to content

Fix tend-thread busy-loop on NodeValidator silent failure#16

Merged
zuchmanski merged 1 commit into
v5-developfrom
inf-684-fix-tend-thread-crash
Apr 27, 2026
Merged

Fix tend-thread busy-loop on NodeValidator silent failure#16
zuchmanski merged 1 commit into
v5-developfrom
inf-684-fix-tend-thread-crash

Conversation

@zuchmanski
Copy link
Copy Markdown

@zuchmanski zuchmanski commented Apr 27, 2026

Fixes INF-684.

TL;DR

Production incident on 2026-04-24: a single Ruby pid's Aerospike tend thread busy-looped at ~50 errors/sec for ~1 hour during a peer-node flap, returning 5XX from /v1/risk until the pid was rolled. Three layered fixes restore self-healing.

Root cause

A bare rescue in NodeValidator#get_hosts silently swallows connection / info-request errors. When a server hangs, the constructor returns a half-initialized nv (@name = nil, @aliases = []) that escapes into Cluster#create_node → produces a Node with @host = nil → next tend cycle calls host.name on nil → NoMethodError.

The tend loop's sleep sat on the success path of an implicit begin/rescue, so the rescue path skipped backoff and pegged CPU at ~50 iterations/sec.

Production log signals

Two lines from pid 361 (tend thread 26450112) captured the bug:

# Single occurrence — half-init nv entering @cluster_nodes (peers.rb:47):
[2026-04-24T18:08:58.190Z] WARN -- : Peer node BB92118BAB14E12 is different than actual node ␣ for host 172.31.73.252:3000

# ~50/sec for ~1 hour, 6,532 occurrences in a 2.5-min capture (cluster.rb:376):
[2026-04-24T18:08:59.394Z] ERROR -- : Exception occured during tend: undefined method 'name' for nil

is annotation; the raw log has a literal space — "actual node #{nv.name}" with nv.name = nil renders as a double-space gap, invisible in any rendered view. That's why the warning slipped past triage: it looked like a benign spacing artifact, not a state-corruption signal.

Four silent-failure paths in NodeValidator

@name is set only inside get_hosts, after a successful info round-trip. Four points can leave it nil — all silent:

# Failure point Resulting state
1 Cluster::CreateConnection raises name=nil, aliases=[]
2 Info.request raises (incident's path) name=nil, aliases=[]
3 info_map['node'] is nil name=nil, aliases=[]
4 info_map[address_command].split raises name=set, aliases=[]

Path 4 is special. @name is assigned before the NoMethodError, so the existing nv.name != peer.node_name mismatch check at peers.rb:46 does not fire and the half-init nv falls through to cluster.create_node. This is why Fix B's guard tests both predicates and must run before the mismatch check.

Fixes

A — lib/aerospike/node_validator.rb

Post-condition raising Aerospike::Exceptions::InvalidNode when half-initialized; bare rescue becomes rescue => e with debug-level logging. Closes all four paths at the source. InvalidNode is already an Aerospike::Exceptions::Aerospike subclass, so the existing rescue at peers.rb:60 catches it for free.

B — lib/aerospike/node/refresh/peers.rb

Defense-in-depth guard before the mismatch check:

if nv.name.nil? || nv.aliases.empty?
  ::Aerospike.logger.warn("Skipping peer #{peer.node_name} at #{host}: ...")
  next
end

next (not break) — other hosts of the same peer might validate fine.

C — lib/aerospike/cluster.rb

Restructure launch_tend_thread from implicit begin/rescue (with sleep on the success path) to explicit begin/rescue with sleep outside, so backoff runs unconditionally. Without this, any future tend pathology — not just this bug — pegs CPU.

Self-healing

With these fixes, a hung peer no longer corrupts @cluster_nodes. Each tend cycle re-attempts validation; the client recovers automatically once the server-side cluster prunes the bad peer. No process restart needed.

Tests

8 new examples; each verified to fail when its corresponding fix is reverted:

Fix Spec Cases Smoking gun on revert
A node_validator_spec.rb 2 (paths 1–3 + path 4) expected InvalidNode but nothing raised
B peers_spec.rb (new) 4 (happy + 2 guard cases + next vs break) peers.nodes contained {"BB92118BAB14E12" => nil}
C cluster_spec.rb 1 56,698 tend calls vs 0 sleep calls in 2s — production busy-loop reproduced

CI's global Support.client tend thread caused intermittent mock conflicts in 4 pre-existing node_validator_spec tests; switched those to allow(...) (from expect(...).to receive) to tolerate background callers without losing assertions.

Notes

  • Bug exists in aerospike/aerospike-client-ruby master too — an upstream version bump would not fix it.
  • Operational mitigation if it recurs before merge: roll the affected pids.
  • Full incident analysis (timeline, why-only-one-pid, log evidence) in INF-684.

@zuchmanski zuchmanski force-pushed the inf-684-fix-tend-thread-crash branch from 5179ed3 to 8fb3ac9 Compare April 27, 2026 09:12
A bare rescue in NodeValidator#get_hosts lets a half-initialized
validator escape to Cluster#create_node, producing a Node with
@host = nil that crashes every subsequent tend cycle. The tend
loop's sleep on the success path of begin/rescue turned that into
a CPU-pegged busy-loop at ~50 errors/sec until pid restart.

- NodeValidator: post-condition raises InvalidNode if @name nil or
  @Aliases empty; log swallowed errors at debug instead of bare rescue.
- Node::Refresh::Peers: integration-point guard skips half-initialized
  validators before cluster.create_node (covers the @name-set,
  aliases-empty path the existing mismatch check cannot catch).
- Cluster#launch_tend_thread: explicit begin/rescue with outer sleep
  prevents busy-loop on persistent tend failures.

Specs cover all four NodeValidator silent-failure paths, both half-
init shapes in Peers refresh, and rescue-path-sleep behavior; each
fix verified by revert-and-rerun.

INF-684
@zuchmanski zuchmanski force-pushed the inf-684-fix-tend-thread-crash branch from 8fb3ac9 to 0d667dc Compare April 27, 2026 09:36
@zuchmanski zuchmanski requested a review from afterdesign April 27, 2026 10:30
@zuchmanski zuchmanski marked this pull request as ready for review April 27, 2026 10:30
@zuchmanski zuchmanski requested a review from mknapik April 27, 2026 10:32
@zuchmanski zuchmanski merged commit 96f8236 into v5-develop Apr 27, 2026
4 checks passed
@afterdesign afterdesign deleted the inf-684-fix-tend-thread-crash branch April 28, 2026 06:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants