Skip to content

perf: add join-hash probe-chain validation before kc<=2 hash specialization #807

Description

@justinjoy

Background

Issue #739 records a negative finding from PR #731: replacing the current FNV-1a join hash path for kc <= 2 with a splitmix64 finalizer regressed CRDT W=1 by +24%, even though a 16-bit chi-squared distribution test looked acceptable.

The current code still uses the FNV-1a path in wirelog/columnar/ops.c::col_join_hash_rel_keys().

Problem

The failed experiment showed that synthetic 16-bit bucket distribution is not enough. Any future join-hash specialization needs to prove behavior at the actual table sizes and key distributions used by the engine.

Proposed direction

Add a validation/benchmark harness for candidate join hash functions before attempting another engine swap:

  • Measure bucket distribution at actual join table sizes used by target workloads, especially CRDT's 18-19 bit hash tables.
  • Measure probe-chain length, including mean, p95/p99, and max.
  • Run on structured workload keys, not only random small-range pairs.
  • Compare against the current FNV-1a implementation as the baseline.
  • Only after the harness shows a clear win should an engine hash specialization be attempted.

Acceptance criteria

  • Do not reintroduce the splitmix64 kc <= 2 specialization from do-not-retry: kc=2 splitmix64 join hash specialization (CRDT W=1 +24%) #739 without this proof.
  • Add repeatable measurements for CRDT and at least one non-CRDT workload such as DOOP or CSPA.
  • Report tuple-count preservation and wall-time impact.
  • If a new hash candidate is proposed, show bucket/probe-chain metrics at the actual table sizes, not only 16-bit chi-squared results.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions