fix(validator): serialize hotkey contract writes to prevent nonce collisions#457
Open
anderdc wants to merge 1 commit into
Open
fix(validator): serialize hotkey contract writes to prevent nonce collisions#457anderdc wants to merge 1 commit into
anderdc wants to merge 1 commit into
Conversation
…lisions The forward loop and axon handlers both sign contract writes with the validator hotkey over separate substrate connections. Each auto-fetches the best-block nonce independently, so two concurrent writes can grab the same nonce; one lands and the other is rejected and pool-banned (1012). A reserve flood (e.g. during a halt) made this constant and starved confirm/timeout votes until swaps blew past their deadline. Share one write lock across both clients, held across nonce-fetch -> submit -> inclusion so the nonce advances before the next signer composes. Reads stay on their per-connection locks and remain parallel.
This was referenced Jun 9, 2026
jaso0n0818
added a commit
to jaso0n0818/allways
that referenced
this pull request
Jun 9, 2026
…lisions The validator signs contract writes with the same hotkey over two separate substrate connections: the forward loop (contract_client / self.subtensor) and the axon handlers (axon_contract_client / axon_subtensor). Both call create_signed_extrinsic which auto-fetches the nonce via AccountNonceApi — the best-block nonce, which does not count pending pool txs. When both clients race within the same block window they fetch the same nonce N, one tx lands and the other is banned (1012 Transaction is temporarily banned), starving the forward loop and causing delivered swaps to be slashed. Add an optional write_lock parameter to AllwaysContractClient.__init__. exec_contract_raw acquires the lock across nonce-fetch + submit + inclusion, so the best-block nonce is guaranteed to advance before the sibling client composes its next extrinsic. The pre-flight balance read is intentionally left outside the lock so reads remain parallel. In neurons/validator.py, create one threading.Lock as self._write_lock and pass it to both contract_client and axon_contract_client at construction. Lock ordering: axon_lock -> write_lock (axon handlers) and write_lock -> substrate_lock (forward loop). No path takes write_lock -> axon_lock, so no deadlock cycle. Backward compat: write_lock defaults to None; omitting it produces a _NullContext no-op so existing call sites and tests are unaffected. New tests/test_write_lock_serialization.py (6 tests) verifies wiring (shared lock stored on both clients), lock held during submit, no error without write_lock, and balance read fires before write_lock is acquired. Closes entrius#457
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Share one write lock across the forward-loop contract client and the axon-handler contract client, held from nonce-fetch through inclusion, so the validator hotkey's nonce sequence can't collide.
Why
The validator signs contract writes with the same hotkey over two separate substrate connections — the forward loop (
self.contract_clientonself.subtensor: confirm/timeout/extensions) and the axon handlers (self.axon_contract_clientonself.axon_subtensor: vote_reserve/vote_activate).create_signed_extrinsicauto-fetches the nonce viaAccountNonceApi.account_nonce— the best-block nonce, which doesn't count pending pool txs. Within a block window every fetch returns the same number. With two uncoordinated signers on one account, both can grab nonceN; one lands, the other is rejected and the tx-pool bans it (1012 Transaction is temporarily banned).During a halt this became constant: the axon side floods
vote_reserveattempts, which contended/advanced the nonce and starved the forward loop'sconfirm_swapvotes — delivered swaps blew pasttimeout_blockand got slashed (e.g. swaps 3728/3729). This contention is not halt-specific; sustained reserve volume can reproduce it.A dedicated subtensor node does not fix this — the node returns the same best-block nonce to both connections; it detects the duplicate, it doesn't prevent it. The fix is client-side coordination.
How
AllwaysContractClienttakes an optional sharedwrite_lock;exec_contract_rawholds it acrossnonce-fetch → submit → (wait_for_inclusion)so the best-block nonce advances before the next signer composes.neurons/validator.pycreates onewrite_lockand passes it to both clients.Lock order is
axon_lock → write_lock(axon writes) andwrite_lock → main lock(forward writes); no path takeswrite_lock → axon_lock, so no cycle.Scope / tradeoff
Small, contained safeguard — core swap logic untouched, only the low-level submit path gains a lock. Writes now serialize across connections (~1 write/block with wait-for-inclusion), which is far above current load. The planned
force_batchsingle-writer flush is the throughput follow-up; this lands first as the safe guard.Tests
TestWriteLockSerialization: shared vs private lock wiring, and that the lock is held during submit but not during the account read (reads stay parallel).ruff check+ruff formatclean.