Add support for Equality Deletes on DeleteFileIndex#3285
Add support for Equality Deletes on DeleteFileIndex#3285rambleraptor wants to merge 4 commits intoapache:mainfrom
Conversation
|
@rambleraptor I think we should add a regression test for schema evolution here. This pruning path assumes the current table type for an equality field is the same type that was used when the data file and equality delete were written, which is not always true after a legal promotion like For reference, Iceberg Java had to address the same schema-evolution issue in apache/iceberg#15268, where the fix was to avoid assuming the current schema is always the right one for equality-delete field resolution. |
|
@geruh @kevinjqliu @Fokko please take a look when you can! |
|
I've successfully tried this out with Flink (thanks @Fokko for the tip!) and it's working as I expect it to. Is it worth checking in the files created by Flink? |
geruh
left a comment
There was a problem hiding this comment.
Nice, thanks for opening @rambleraptor!!! Left some comments below.
Also, +1 to add to the flink testing and I believe there were talks about this being added to the TCK! While working on #2255, we tested all delete file combinations with flink.
| @@ -1693,7 +1693,12 @@ def _task_to_record_batches( | |||
|
|
|||
| def _read_all_delete_files(io: FileIO, tasks: Iterable[FileScanTask]) -> dict[str, list[ChunkedArray]]: | |||
There was a problem hiding this comment.
Yeah, we'll see some crashes if we don't have this here. Later on (in to_record_batches I believe), we try to grab invalid fields on them (since that part of the codebase doesn't expect Equality Deletes).
That feels more natural to be part of the follow-up for handling FileScanTasks. In the mean time, we'll filter them out here to avoid a crash. Doesn't impact how the indexes work.
78800f7 to
3a7413a
Compare
|
@geruh thanks for the review! Could not agree more on the Flink testing! I'll leave that for a follow-up PR if that's alright, since we haven't stood up Flink testing yet. I don't want to pollute this PR too much |
Part of #3270
Rationale for this change
This adds support for getting equality deletes in the DeleteFileIndex.
I'm very purposefully ignoring them in
_read_all_delete_filesbecause they will crash.Are these changes tested?
I made some equality deletes by-hand and had PyIceberg read them to see the indexes. Worked as expected. If you know a way to create equality deletes, I can test those as well.
Are there any user-facing changes?