Core: Support DVs in DeleteLoader#11481
Conversation
| if (containsDVs(deleteFiles)) { | ||
| DeleteFile dv = Iterables.getOnlyElement(deleteFiles); | ||
| validateDV(dv, filePath); | ||
| return readDV(dv); // TODO: support caching entire DV files |
There was a problem hiding this comment.
A bit of context about how caching works for V2 deletes. If we estimate the content of the entire file to fit into the cache (its in-memory representation), we read the entire file and cache the result. For position delete files, we cache a bitmap for each referenced data file. We can do similar stuff for Puffin. I need to explore the performance impact of not knowing the footer size upfront.
Any early feedback is welcome!
There was a problem hiding this comment.
I am also not sure it should be a blocker.
There was a problem hiding this comment.
I am not sure we can efficiently implement caching with the current Puffin reader, as it would require at least 3 requests (one to find the footer size, one to read the footer, and one for the content). In theory, we can do this in a single read, given that we know the overall size of the Puffin file. Therefore, I suggest that we keep this TODO and revisit it later.
| LOG.trace("Opening DV file {}", dv.location()); | ||
| InputFile inputFile = loadInputFile.apply(dv); | ||
| long offset = dv.contentOffset(); | ||
| int length = dv.contentSizeInBytes().intValue(); |
There was a problem hiding this comment.
The cast is safe as it is covered by validateDV.
| try (SeekableInputStream stream = inputFile.newStream()) { | ||
| byte[] bytes = new byte[length]; | ||
|
|
||
| if (stream instanceof RangeReadable) { |
There was a problem hiding this comment.
@danielcweeks, could you double check this part as you added these interfaces?
There was a problem hiding this comment.
the only minor comment I have here is that the same code exists in PuffinReader, so it might be eventually good to re-use the code but this shouldn't be a blocker for this PR
There was a problem hiding this comment.
I'll follow up to refactor the common part.
| // verify correctness with DVs and position delete files | ||
| assertRows(ImmutableList.of(toRow(3, "aaa"), toRow(6, "aaa"))); | ||
|
|
||
| // verify the position delete file applies only to the data file without the DV |
There was a problem hiding this comment.
Love the context comments. This makes it very easy to follow what's going on in the test.
| @Override | ||
| public PositionDeleteIndex loadPositionDeletes( | ||
| Iterable<DeleteFile> deleteFiles, CharSequence filePath) { | ||
| if (containsDVs(deleteFiles)) { |
There was a problem hiding this comment.
The containsDVs check here seems more broad than the assumption that there is one and only one DV. Should we make the contains check assert those expectations? (e.g. containsSingleDV?)
There was a problem hiding this comment.
Eduard had the same question when reviewing the prototype PR. Let me think how to make the code more obvious. Our Iterables.getOnlyElement below will fail if there are multiple files.
jbonofre
left a comment
There was a problem hiding this comment.
just a nit on containsDV method.
| return schema.columns().stream().mapToInt(TypeUtil::estimateSize).sum(); | ||
| } | ||
|
|
||
| private boolean containsDVs(Iterable<DeleteFile> deleteFiles) { |
There was a problem hiding this comment.
Maybe this method should be named containsSingleDV() ?
|
@nastra @danielcweeks @jbonofre, what do you folks think about #11486 as an alternative implementation? Am I overcomplicating or is that one cleaner? |
|
@aokolnychyi Thanks for this great work ! I prefer this PR over #11486 : imho, this impl is more "straight forward". |
|
Thanks for reviewing, @jbonofre @nastra @danielcweeks! |
This PR adds support for reading DVs in
BaseDeleteLoader, the only loader implementation we have.This work is part of #11122.