Skip to content

fix(validator): serialize hotkey contract writes to prevent nonce collisions#457

Open
anderdc wants to merge 1 commit into
testfrom
fix/serialize-contract-writes-nonce
Open

fix(validator): serialize hotkey contract writes to prevent nonce collisions#457
anderdc wants to merge 1 commit into
testfrom
fix/serialize-contract-writes-nonce

Conversation

@anderdc

@anderdc anderdc commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

What

Share one write lock across the forward-loop contract client and the axon-handler contract client, held from nonce-fetch through inclusion, so the validator hotkey's nonce sequence can't collide.

Why

The validator signs contract writes with the same hotkey over two separate substrate connections — the forward loop (self.contract_client on self.subtensor: confirm/timeout/extensions) and the axon handlers (self.axon_contract_client on self.axon_subtensor: vote_reserve/vote_activate).

create_signed_extrinsic auto-fetches the nonce via AccountNonceApi.account_nonce — the best-block nonce, which doesn't count pending pool txs. Within a block window every fetch returns the same number. With two uncoordinated signers on one account, both can grab nonce N; one lands, the other is rejected and the tx-pool bans it (1012 Transaction is temporarily banned).

During a halt this became constant: the axon side floods vote_reserve attempts, which contended/advanced the nonce and starved the forward loop's confirm_swap votes — delivered swaps blew past timeout_block and got slashed (e.g. swaps 3728/3729). This contention is not halt-specific; sustained reserve volume can reproduce it.

A dedicated subtensor node does not fix this — the node returns the same best-block nonce to both connections; it detects the duplicate, it doesn't prevent it. The fix is client-side coordination.

How

  • AllwaysContractClient takes an optional shared write_lock; exec_contract_raw holds it across nonce-fetch → submit → (wait_for_inclusion) so the best-block nonce advances before the next signer composes.
  • neurons/validator.py creates one write_lock and passes it to both clients.
  • Reads stay on their per-connection recv locks, so they remain parallel — no change to axon responsiveness.

Lock order is axon_lock → write_lock (axon writes) and write_lock → main lock (forward writes); no path takes write_lock → axon_lock, so no cycle.

Scope / tradeoff

Small, contained safeguard — core swap logic untouched, only the low-level submit path gains a lock. Writes now serialize across connections (~1 write/block with wait-for-inclusion), which is far above current load. The planned force_batch single-writer flush is the throughput follow-up; this lands first as the safe guard.

Tests

  • TestWriteLockSerialization: shared vs private lock wiring, and that the lock is held during submit but not during the account read (reads stay parallel).
  • Full suite: 692 passed; ruff check + ruff format clean.

…lisions

The forward loop and axon handlers both sign contract writes with the
validator hotkey over separate substrate connections. Each auto-fetches
the best-block nonce independently, so two concurrent writes can grab the
same nonce; one lands and the other is rejected and pool-banned (1012).
A reserve flood (e.g. during a halt) made this constant and starved
confirm/timeout votes until swaps blew past their deadline.

Share one write lock across both clients, held across nonce-fetch ->
submit -> inclusion so the nonce advances before the next signer composes.
Reads stay on their per-connection locks and remain parallel.
jaso0n0818 added a commit to jaso0n0818/allways that referenced this pull request Jun 9, 2026
…lisions

The validator signs contract writes with the same hotkey over two separate
substrate connections: the forward loop (contract_client / self.subtensor)
and the axon handlers (axon_contract_client / axon_subtensor). Both call
create_signed_extrinsic which auto-fetches the nonce via AccountNonceApi —
the best-block nonce, which does not count pending pool txs. When both
clients race within the same block window they fetch the same nonce N, one
tx lands and the other is banned (1012 Transaction is temporarily banned),
starving the forward loop and causing delivered swaps to be slashed.

Add an optional write_lock parameter to AllwaysContractClient.__init__.
exec_contract_raw acquires the lock across nonce-fetch + submit + inclusion,
so the best-block nonce is guaranteed to advance before the sibling client
composes its next extrinsic. The pre-flight balance read is intentionally
left outside the lock so reads remain parallel.

In neurons/validator.py, create one threading.Lock as self._write_lock and
pass it to both contract_client and axon_contract_client at construction.

Lock ordering: axon_lock -> write_lock (axon handlers) and write_lock ->
substrate_lock (forward loop). No path takes write_lock -> axon_lock, so
no deadlock cycle.

Backward compat: write_lock defaults to None; omitting it produces a
_NullContext no-op so existing call sites and tests are unaffected.

New tests/test_write_lock_serialization.py (6 tests) verifies wiring
(shared lock stored on both clients), lock held during submit, no error
without write_lock, and balance read fires before write_lock is acquired.

Closes entrius#457
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant