Skip to content

[wip] rs: sharded fusefilter v2#20644

Open
AskAlexSharov wants to merge 18 commits intomainfrom
alex/sharded_fuse2_35
Open

[wip] rs: sharded fusefilter v2#20644
AskAlexSharov wants to merge 18 commits intomainfrom
alex/sharded_fuse2_35

Conversation

@AskAlexSharov
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for a new RecSplit index format (v2) that uses a sharded fusefilter existence filter to reduce false positives and potentially improve lookup efficiency, while extending test coverage to exercise the new version.

Changes:

  • Introduce sharded fusefilter writer/reader (256 shards by top byte of hash) and wire it into RecSplit/Index for dataStructureVersion ≥ 2.
  • Refactor fusefilter blob header handling and add regression tests around header field offsets and sharded round-trips.
  • Extend RecSplit tests to run the existing two-layer index test suite against v2.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
db/recsplit/recsplit_test.go Runs the two-layer index test for RecSplit version 2.
db/recsplit/recsplit.go Creates/uses a sharded existence filter writer for v2 and closes it on cleanup.
db/recsplit/index.go Adds sharded existence filter reader initialization and lookup gating for v2+.
db/datastruct/fusefilter/fusefilter_writer_test.go Adds round-trip, truncation, and regression tests for sharded writer/reader and header parsing.
db/datastruct/fusefilter/fusefilter_writer.go Adds sharded writer implementation and refactors header serialization helpers.
db/datastruct/fusefilter/fusefilter_reader.go Adds sharded reader implementation and refactors header parsing.
Comments suppressed due to low confidence (1)

db/datastruct/fusefilter/fusefilter_reader.go:101

  • NewReaderOnBytes slices m into header/data without checking len(m) >= filterBlobHeaderSize, and then slices data[:fingerprintsLen] without validating fingerprintsLen against len(data). With truncated/corrupt input this will panic instead of returning an error; please add explicit bounds checks and return a descriptive error.
func NewReaderOnBytes(m []byte, fName string) (*Reader, int, error) {
	filter := &xorfilter.BinaryFuse[uint8]{}

	const headerSize = filterBlobHeaderSize
	header, data := m[:headerSize], m[headerSize:]

	v, features, err := parseHeaderFeatures(header, fName)
	if err != nil {
		return nil, 0, err
	}

	filter.SegmentCount = binary.BigEndian.Uint32(header[4:])
	filter.SegmentCountLength = binary.BigEndian.Uint32(header[4+4:])
	filter.Seed = binary.BigEndian.Uint64(header[4+4+4:])
	filter.SegmentLength = binary.BigEndian.Uint32(header[4+4+4+8:])
	filter.SegmentLengthMask = binary.BigEndian.Uint32(header[4+4+4+8+4:])
	fingerprintsLen := int(binary.BigEndian.Uint64(header[4+4+4+8+4+4:]))

	filter.Fingerprints = data[:fingerprintsLen]
	return &Reader{inner: filter, version: v, features: features, m: m}, headerSize + fingerprintsLen, nil
}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread db/recsplit/index.go Outdated
Comment thread db/datastruct/fusefilter/fusefilter_reader.go
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends RecSplit’s “less false positives” existence filtering by adding a sharded fusefilter implementation and wiring it into RecSplit/Index for data structure version 2.

Changes:

  • Add v2 existence filter plumbing in recsplit (writer side) and index (reader side) using a new sharded fusefilter format.
  • Implement WriterSharded / ReaderSharded for fusefilter blobs and refactor common header/feature parsing.
  • Expand test coverage to exercise RecSplit v2 and the new sharded fusefilter round-trips.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
db/recsplit/recsplit_test.go Adds TestTwoLayerIndex coverage for version 2.
db/recsplit/recsplit.go Adds v2 existence filter writer and writes it during build.
db/recsplit/index.go Adds v2 existence filter reader and uses it during Lookup/ForceInMem.
db/datastruct/fusefilter/fusefilter_writer_test.go Adds new sharded writer/reader tests and a regression test for segment count fields.
db/datastruct/fusefilter/fusefilter_writer.go Refactors header feature init; implements WriterSharded and shared serialization helper.
db/datastruct/fusefilter/fusefilter_reader.go Refactors header parsing; implements ReaderSharded.
Comments suppressed due to low confidence (1)

db/datastruct/fusefilter/fusefilter_reader.go:100

  • NewReaderOnBytes slices m[:headerSize] and later data[:fingerprintsLen] without any length checks. A truncated/corrupted fusefilter blob (or a shard blob inside NewReaderShardedOnBytes) will panic with an out-of-bounds slice instead of returning an error. Please add upfront bounds checks (len(m) >= headerSize, and headerSize+fingerprintsLen <= len(m)) and return a descriptive error when the blob is too small or claims an impossible fingerprints length.
func NewReaderOnBytes(m []byte, fName string) (*Reader, int, error) {
	filter := &xorfilter.BinaryFuse[uint8]{}

	const headerSize = filterBlobHeaderSize
	header, data := m[:headerSize], m[headerSize:]

	v, features, err := parseHeaderFeatures(header, fName)
	if err != nil {
		return nil, 0, err
	}

	filter.SegmentCount = binary.BigEndian.Uint32(header[4:])
	filter.SegmentCountLength = binary.BigEndian.Uint32(header[4+4:])
	filter.Seed = binary.BigEndian.Uint64(header[4+4+4:])
	filter.SegmentLength = binary.BigEndian.Uint32(header[4+4+4+8:])
	filter.SegmentLengthMask = binary.BigEndian.Uint32(header[4+4+4+8+4:])
	fingerprintsLen := int(binary.BigEndian.Uint64(header[4+4+4+8+4+4:]))

	filter.Fingerprints = data[:fingerprintsLen]
	return &Reader{inner: filter, version: v, features: features, m: m}, headerSize + fingerprintsLen, nil

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread db/datastruct/fusefilter/fusefilter_reader.go Outdated
Comment thread db/datastruct/fusefilter/fusefilter_reader.go Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants