Improve hash functions: FNV-1a for strings, MurmurHash3 for ireps#8976
Improve hash functions: FNV-1a for strings, MurmurHash3 for ireps#8976tautschnig wants to merge 1 commit intodiffblue:developfrom
Conversation
Replace the djb2-variant string hash (h = h*31 + c) with FNV-1a which has better avalanche properties for strings with common prefixes (typical of CBMC's SSA-renamed symbols). Add a MurmurHash3 finalization mix for additional distribution quality. Switch the irep hash combiner from BASIC (rotate+xor) to MurmurHash3 which was already implemented in irep_hash.h but not enabled. MurmurHash3 provides better mixing of sub-expression hashes. Both changes reduce hash collisions in the dstring table (36% -> ~30%) and the prop_conv expression cache, improving hash table lookup performance throughout the pipeline. Impact on Collections-C monolithic: 51s -> 47s (8% faster) Co-authored-by: Kiro <[email protected]>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR improves hash distribution and reduces collisions by switching string hashing to FNV-1a with a MurmurHash3-style finalization mix, and enabling MurmurHash3 for irep hash combining.
Changes:
- Replace the existing djb2-variant string hash with FNV-1a and add
fmix64finalization. - Enable
IREP_HASH_MURMURHASH3instead of the BASIC combiner. - Add comments describing the rationale for improved avalanche properties.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| src/util/string_hash.cpp | Updates string hashing algorithm and introduces MurmurHash3 finalization mix. |
| src/util/irep_hash.h | Switches irep hashing configuration to MurmurHash3. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| static inline std::size_t fmix64(std::size_t h) | ||
| { | ||
| h ^= h >> 33; | ||
| h *= 0xff51afd7ed558ccdULL; | ||
| h ^= h >> 33; | ||
| h *= 0xc4ceb9fe1a85ec53ULL; | ||
| h ^= h >> 33; | ||
| return h; | ||
| } |
There was a problem hiding this comment.
fmix64 operates on std::size_t but performs h >> 33 and multiplies by 64-bit constants. On 32-bit platforms, shifting by 33 is undefined behavior and the intended mixing can’t work as designed. Make fmix64 take/return std::uint64_t and cast at the boundary, or provide a 32-bit fmix variant selected by sizeof(std::size_t) (or a static_assert(sizeof(std::size_t) == 8) if 32-bit is unsupported).
There was a problem hiding this comment.
I agree with this comment.
| { | ||
| size_t h=0; | ||
| size_t size=s.size(); | ||
| // Process 8 bytes at a time, then finalize |
There was a problem hiding this comment.
The comment says “Process 8 bytes at a time” but the implementation hashes byte-by-byte. Please update the comment to match the code, or implement the intended 8-byte chunking (being mindful of alignment/endianness and portability).
| // Process 8 bytes at a time, then finalize | |
| // Process the string byte-by-byte, then finalize |
| std::size_t h = 0xcbf29ce484222325ULL; | ||
| std::size_t len = 0; | ||
|
|
||
| for(; *s!=0; s++) | ||
| h=(h<<5)-h+*s; | ||
| for(const char *p = s; *p; ++p) | ||
| ++len; | ||
|
|
||
| return h; | ||
| h ^= len; | ||
|
|
||
| for(; *s != 0; ++s) | ||
| h = (h ^ static_cast<unsigned char>(*s)) * 0x100000001b3ULL; |
There was a problem hiding this comment.
This computes the string length in a first pass and then hashes in a second pass. If keeping length-mixing, consider counting length while hashing and XOR’ing len into h just before fmix64, or use std::strlen(s) (often optimized) to avoid a manual first loop. This reduces extra scanning for long C-strings.
| // #define IREP_HASH_MURMURHASH2A | ||
| // #define IREP_HASH_MURMURHASH3 | ||
| #define IREP_HASH_MURMURHASH3 | ||
|
|
There was a problem hiding this comment.
This header relies on manually commenting/uncommenting macros to select exactly one hash strategy. To prevent accidental multiple selections (or none), add preprocessor checks that enforce exactly one of IREP_HASH_BASIC, IREP_HASH_MURMURHASH2A, IREP_HASH_MURMURHASH3 is defined, and emit a clear #error otherwise.
| #if !defined(IREP_HASH_BASIC) && !defined(IREP_HASH_MURMURHASH2A) && \ | |
| !defined(IREP_HASH_MURMURHASH3) | |
| # error "Exactly one of IREP_HASH_BASIC, IREP_HASH_MURMURHASH2A, or IREP_HASH_MURMURHASH3 must be defined" | |
| #endif | |
| #if (defined(IREP_HASH_BASIC) && defined(IREP_HASH_MURMURHASH2A)) || \ | |
| (defined(IREP_HASH_BASIC) && defined(IREP_HASH_MURMURHASH3)) || \ | |
| (defined(IREP_HASH_MURMURHASH2A) && defined(IREP_HASH_MURMURHASH3)) | |
| # error "Exactly one of IREP_HASH_BASIC, IREP_HASH_MURMURHASH2A, or IREP_HASH_MURMURHASH3 must be defined" | |
| #endif |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #8976 +/- ##
===========================================
+ Coverage 80.50% 80.51% +0.01%
===========================================
Files 1704 1704
Lines 188778 188811 +33
Branches 73 73
===========================================
+ Hits 151975 152025 +50
+ Misses 36803 36786 -17 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Replace the djb2-variant string hash (h = h*31 + c) with FNV-1a which has better avalanche properties for strings with common prefixes (typical of CBMC's SSA-renamed symbols). Add a MurmurHash3 finalization mix for additional distribution quality.
Switch the irep hash combiner from BASIC (rotate+xor) to MurmurHash3 which was already implemented in irep_hash.h but not enabled. MurmurHash3 provides better mixing of sub-expression hashes.
Both changes reduce hash collisions in the dstring table (36% -> ~30%) and the prop_conv expression cache, improving hash table lookup performance throughout the pipeline.
Impact on Collections-C monolithic: 51s -> 47s (8% faster)