Conversation
There was a problem hiding this comment.
Pull request overview
Adds support for a new RecSplit index format (v2) that uses a sharded fusefilter existence filter to reduce false positives and potentially improve lookup efficiency, while extending test coverage to exercise the new version.
Changes:
- Introduce sharded fusefilter writer/reader (256 shards by top byte of hash) and wire it into RecSplit/Index for dataStructureVersion ≥ 2.
- Refactor fusefilter blob header handling and add regression tests around header field offsets and sharded round-trips.
- Extend RecSplit tests to run the existing two-layer index test suite against v2.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| db/recsplit/recsplit_test.go | Runs the two-layer index test for RecSplit version 2. |
| db/recsplit/recsplit.go | Creates/uses a sharded existence filter writer for v2 and closes it on cleanup. |
| db/recsplit/index.go | Adds sharded existence filter reader initialization and lookup gating for v2+. |
| db/datastruct/fusefilter/fusefilter_writer_test.go | Adds round-trip, truncation, and regression tests for sharded writer/reader and header parsing. |
| db/datastruct/fusefilter/fusefilter_writer.go | Adds sharded writer implementation and refactors header serialization helpers. |
| db/datastruct/fusefilter/fusefilter_reader.go | Adds sharded reader implementation and refactors header parsing. |
Comments suppressed due to low confidence (1)
db/datastruct/fusefilter/fusefilter_reader.go:101
- NewReaderOnBytes slices m into header/data without checking len(m) >= filterBlobHeaderSize, and then slices data[:fingerprintsLen] without validating fingerprintsLen against len(data). With truncated/corrupt input this will panic instead of returning an error; please add explicit bounds checks and return a descriptive error.
func NewReaderOnBytes(m []byte, fName string) (*Reader, int, error) {
filter := &xorfilter.BinaryFuse[uint8]{}
const headerSize = filterBlobHeaderSize
header, data := m[:headerSize], m[headerSize:]
v, features, err := parseHeaderFeatures(header, fName)
if err != nil {
return nil, 0, err
}
filter.SegmentCount = binary.BigEndian.Uint32(header[4:])
filter.SegmentCountLength = binary.BigEndian.Uint32(header[4+4:])
filter.Seed = binary.BigEndian.Uint64(header[4+4+4:])
filter.SegmentLength = binary.BigEndian.Uint32(header[4+4+4+8:])
filter.SegmentLengthMask = binary.BigEndian.Uint32(header[4+4+4+8+4:])
fingerprintsLen := int(binary.BigEndian.Uint64(header[4+4+4+8+4+4:]))
filter.Fingerprints = data[:fingerprintsLen]
return &Reader{inner: filter, version: v, features: features, m: m}, headerSize + fingerprintsLen, nil
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
This PR extends RecSplit’s “less false positives” existence filtering by adding a sharded fusefilter implementation and wiring it into RecSplit/Index for data structure version 2.
Changes:
- Add v2 existence filter plumbing in
recsplit(writer side) andindex(reader side) using a new sharded fusefilter format. - Implement
WriterSharded/ReaderShardedfor fusefilter blobs and refactor common header/feature parsing. - Expand test coverage to exercise RecSplit v2 and the new sharded fusefilter round-trips.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| db/recsplit/recsplit_test.go | Adds TestTwoLayerIndex coverage for version 2. |
| db/recsplit/recsplit.go | Adds v2 existence filter writer and writes it during build. |
| db/recsplit/index.go | Adds v2 existence filter reader and uses it during Lookup/ForceInMem. |
| db/datastruct/fusefilter/fusefilter_writer_test.go | Adds new sharded writer/reader tests and a regression test for segment count fields. |
| db/datastruct/fusefilter/fusefilter_writer.go | Refactors header feature init; implements WriterSharded and shared serialization helper. |
| db/datastruct/fusefilter/fusefilter_reader.go | Refactors header parsing; implements ReaderSharded. |
Comments suppressed due to low confidence (1)
db/datastruct/fusefilter/fusefilter_reader.go:100
NewReaderOnBytesslicesm[:headerSize]and laterdata[:fingerprintsLen]without any length checks. A truncated/corrupted fusefilter blob (or a shard blob insideNewReaderShardedOnBytes) will panic with an out-of-bounds slice instead of returning an error. Please add upfront bounds checks (len(m) >= headerSize, and headerSize+fingerprintsLen <= len(m)) and return a descriptive error when the blob is too small or claims an impossible fingerprints length.
func NewReaderOnBytes(m []byte, fName string) (*Reader, int, error) {
filter := &xorfilter.BinaryFuse[uint8]{}
const headerSize = filterBlobHeaderSize
header, data := m[:headerSize], m[headerSize:]
v, features, err := parseHeaderFeatures(header, fName)
if err != nil {
return nil, 0, err
}
filter.SegmentCount = binary.BigEndian.Uint32(header[4:])
filter.SegmentCountLength = binary.BigEndian.Uint32(header[4+4:])
filter.Seed = binary.BigEndian.Uint64(header[4+4+4:])
filter.SegmentLength = binary.BigEndian.Uint32(header[4+4+4+8:])
filter.SegmentLengthMask = binary.BigEndian.Uint32(header[4+4+4+8+4:])
fingerprintsLen := int(binary.BigEndian.Uint64(header[4+4+4+8+4+4:]))
filter.Fingerprints = data[:fingerprintsLen]
return &Reader{inner: filter, version: v, features: features, m: m}, headerSize + fingerprintsLen, nil
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
No description provided.