Skip to content

Core: Support DVs in DeleteLoader#11481

Merged
aokolnychyi merged 3 commits into
apache:mainfrom
aokolnychyi:dv-reader
Nov 8, 2024
Merged

Core: Support DVs in DeleteLoader#11481
aokolnychyi merged 3 commits into
apache:mainfrom
aokolnychyi:dv-reader

Conversation

@aokolnychyi

Copy link
Copy Markdown
Contributor

This PR adds support for reading DVs in BaseDeleteLoader, the only loader implementation we have.

This work is part of #11122.

@github-actions github-actions Bot added the data label Nov 6, 2024
if (containsDVs(deleteFiles)) {
DeleteFile dv = Iterables.getOnlyElement(deleteFiles);
validateDV(dv, filePath);
return readDV(dv); // TODO: support caching entire DV files

@aokolnychyi aokolnychyi Nov 6, 2024

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of context about how caching works for V2 deletes. If we estimate the content of the entire file to fit into the cache (its in-memory representation), we read the entire file and cache the result. For position delete files, we cache a bitmap for each referenced data file. We can do similar stuff for Puffin. I need to explore the performance impact of not knowing the footer size upfront.

Any early feedback is welcome!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also not sure it should be a blocker.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we can efficiently implement caching with the current Puffin reader, as it would require at least 3 requests (one to find the footer size, one to read the footer, and one for the content). In theory, we can do this in a single read, given that we know the overall size of the Puffin file. Therefore, I suggest that we keep this TODO and revisit it later.

LOG.trace("Opening DV file {}", dv.location());
InputFile inputFile = loadInputFile.apply(dv);
long offset = dv.contentOffset();
int length = dv.contentSizeInBytes().intValue();

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cast is safe as it is covered by validateDV.

Comment thread data/src/main/java/org/apache/iceberg/data/BaseDeleteLoader.java
try (SeekableInputStream stream = inputFile.newStream()) {
byte[] bytes = new byte[length];

if (stream instanceof RangeReadable) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielcweeks, could you double check this part as you added these interfaces?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only minor comment I have here is that the same code exists in PuffinReader, so it might be eventually good to re-use the code but this shouldn't be a blocker for this PR

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll follow up to refactor the common part.

// verify correctness with DVs and position delete files
assertRows(ImmutableList.of(toRow(3, "aaa"), toRow(6, "aaa")));

// verify the position delete file applies only to the data file without the DV

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the context comments. This makes it very easy to follow what's going on in the test.

@Override
public PositionDeleteIndex loadPositionDeletes(
Iterable<DeleteFile> deleteFiles, CharSequence filePath) {
if (containsDVs(deleteFiles)) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The containsDVs check here seems more broad than the assumption that there is one and only one DV. Should we make the contains check assert those expectations? (e.g. containsSingleDV?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eduard had the same question when reviewing the prototype PR. Let me think how to make the code more obvious. Our Iterables.getOnlyElement below will fail if there are multiple files.

@jbonofre jbonofre left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a nit on containsDV method.

return schema.columns().stream().mapToInt(TypeUtil::estimateSize).sum();
}

private boolean containsDVs(Iterable<DeleteFile> deleteFiles) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this method should be named containsSingleDV() ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@aokolnychyi

Copy link
Copy Markdown
Contributor Author

@nastra @danielcweeks @jbonofre, what do you folks think about #11486 as an alternative implementation? Am I overcomplicating or is that one cleaner?

@github-actions github-actions Bot added the core label Nov 8, 2024
@jbonofre

jbonofre commented Nov 8, 2024

Copy link
Copy Markdown
Member

@aokolnychyi Thanks for this great work ! I prefer this PR over #11486 : imho, this impl is more "straight forward".

@aokolnychyi aokolnychyi merged commit 166edc7 into apache:main Nov 8, 2024
@aokolnychyi

Copy link
Copy Markdown
Contributor Author

Thanks for reviewing, @jbonofre @nastra @danielcweeks!

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

4 participants