Conversation
…sure versions are generated differently from independent HLC's
There was a problem hiding this comment.
Pull request overview
This PR aims to make HLV-based topology tests deterministic on Windows by spacing peer writes to avoid ties caused by Windows’ lower wall-clock precision (independent HLCs producing identical versions).
Changes:
- Add Windows-only
time.Sleepdelays before peer document mutations (create/update/delete) to help ensure unique HLV versions across peers. - Introduce
runtime/timeusage in the topology HLV test helpers to gate this behavior by OS.
|
As discussed offline, see if we can use |
torcolvin
left a comment
There was a problem hiding this comment.
I like this fix but I think I'd like to move the hlc implementation to sg-bucket since we use this in rosmar too!
|
@bbrks can you have a look at this review? This is just uptaking the change from sg-bucket. I originally reviewed this code in sync_gateway that Greg did and then I merged this to sg-bucket. |
CBG-5460
Adds change to windows builds to get HLC time with GetSystemTimePreciseAsFileTime, reading the high-resolution system clock live on every call.
The Hybrid Logical Clock derives a version's physical component from the wall clock, then clears the low 16 logical bits — making each physical "slot" ~65 µs wide. On Linux/macOS, time.Now() resolves to ~1 ns, so this is a non-issue. On Windows, time.Now() is backed by the coarse system timer (~0.5–15 ms). In topology tests, peers writing in quick succession produced identical HLC versions, because many writes fell inside a single coarse tick.
The initial fix used QueryPerformanceCounter (QPC): snapshot a (QPC ticks, time.Now()) anchor pair once at init(), then compute anchorNanos + elapsedTicks. This resolves to ~100 ns which resolves the resolution issue.
The two clocks ended up with a fixed, never-correcting offset of up to ~15 ms between them — and because it depends on startup timing, the offset is random per process run. Constantly tripping the cv.ver <= cas re-stamp path and producing intermittent topology-test failures.
Switched to use GetSystemTimePreciseAsFileTime which fixes the issues as offers same precision but both base and rosmar read the same system clock
Sync Gateway's HLC and Couchbase Server's CAS are genuinely independent clocks on separate nodes, and the system is designed to converge under that skew — the CAS re-stamp exists precisely for it. When the topology test failed, the data still converged correctly (every peer agreed on the same cv and pv); only the test harness's predicted version was wrong.
Pre-review checklist
fmt.Print,log.Print, ...)base.UD(docID),base.MD(dbName))docs/apiDependencies (if applicable)
Integration Tests