Skip to content

Core: Adjust Jackson settings to handle large metadata json#12224

Merged
nastra merged 4 commits into
apache:mainfrom
bryanck:jackson-setting
Feb 13, 2025
Merged

Core: Adjust Jackson settings to handle large metadata json#12224
nastra merged 4 commits into
apache:mainfrom
bryanck:jackson-setting

Conversation

@bryanck

@bryanck bryanck commented Feb 11, 2025

Copy link
Copy Markdown
Contributor

With very large table metadata json, for example, those with many snapshots with partition summaries, we sometimes encounter errors involving hash collisions when loading the metadata. This PR disables that hash collision check so the metadata can be parsed without error. We have had this set in our internal fork for a while.

In addition, this PR disable string interning of field names which has lead to performance problems for us when parsing metadata. Given partition summary field names and other snapshot properties are often not reused across different metadata, the interning causes more harm than good. This is especially true when using Iceberg in a server which is loading metadata for many tables.

This also fixes a test classpath issue. The Avatica driver is a shadow jar that bundles an old unshaded version of Jackson.

@github-actions github-actions Bot added the core label Feb 11, 2025
@bryanck bryanck marked this pull request as draft February 11, 2025 16:44
@bryanck bryanck marked this pull request as ready for review February 11, 2025 17:38
@stevenzwu

Copy link
Copy Markdown
Contributor

Given partition summary field names and other snapshot properties are often not reused across different metadata, the interning causes more harm than good.

@bryanck I didn't quite get the partition summary field names. were you referring to PartitionFieldSummaryParser? it seems to have just 4 field names.

String.intern can be helpful for some use cases while harmful for some (like the one you encountered). Disabling interning seems to be a safer option considering diverse scenarios that the code can be used (like REST catalog server).

hash collision check

I definitely understand the situation you described. maybe reach out to the Jackson authors too according to the doc?
https://github.com/fasterxml/jackson-core/wiki/JsonFactory-Features

In unlikely event that the exception is triggered for valid data, it may make sense to either disable this feature, or to disable canonicalization. However, Jackson authors would also like to be notified for such usage as it may point to an issue with hashing scheme -- so please file an issue if you encounter this problem.

The doc also mentioned that hash collision check is Only relevant if canonicalization is enabled. wondering if CANONICALIZE_FIELD_NAMES should be disabled too. I imagined it can cause similar memory footprint issue as String interning.

@bryanck

bryanck commented Feb 11, 2025

Copy link
Copy Markdown
Contributor Author

@bryanck I didn't quite get the partition summary field names. were you referring to PartitionFieldSummaryParser? it seems to have just 4 field names.

String.intern can be helpful for some use cases while harmful for some (like the one you encountered). Disabling interning seems to be a safer option considering diverse scenarios that the code can be used (like REST catalog server).

The information for each partition key has a field name unique to the partition (with the prefix partitions.). There is some discussion around intern here with more links. TL;DR is that intern was disabled by default for Jackson 3 (whenever that is released).

I definitely understand the situation you described. maybe reach out to the Jackson authors too according to the doc? https://github.com/fasterxml/jackson-core/wiki/JsonFactory-Features

Sure sounds good, I'll reach out.

The doc also mentioned that hash collision check is Only relevant if canonicalization is enabled. wondering if CANONICALIZE_FIELD_NAMES should be disabled too. I imagined it can cause similar memory footprint issue as String interning.

Canonicalization can help when field names are reused within a single metadata file, so that seemed helpful still.

@stevenzwu

Copy link
Copy Markdown
Contributor

Canonicalization can help when field names are reused within a single metadata file, so that seemed helpful still.

canonicalization lifecycle is scoped to a single metadata file? if it is also JVM lifecycle scope (like String intern), it can also be a problem for large tables and a server handling many tables.

@bryanck

bryanck commented Feb 11, 2025

Copy link
Copy Markdown
Contributor Author

Canonicalization can help when field names are reused within a single metadata file, so that seemed helpful still.

canonicalization lifecycle is scoped to a single metadata file? if it is also JVM lifecycle scope (like String intern), it can also be a problem for large tables and a server handling many tables.

I believe it is scoped to a parser instance, and we generally create a new parser for each AFAIK. (https://github.com/fasterxml/jackson-core/wiki/JsonFactory-Features)

@bryanck

bryanck commented Feb 11, 2025

Copy link
Copy Markdown
Contributor Author

Canonicalization can help when field names are reused within a single metadata file, so that seemed helpful still.

canonicalization lifecycle is scoped to a single metadata file? if it is also JVM lifecycle scope (like String intern), it can also be a problem for large tables and a server handling many tables.

I believe it is scoped to a parser instance, and we generally create a new parser for each AFAIK. (https://github.com/fasterxml/jackson-core/wiki/JsonFactory-Features)

Actually that doesn't seem correct, it is for any parser created by the same factory, so we should probably turn canonicalization off instead.

@bryanck

bryanck commented Feb 11, 2025

Copy link
Copy Markdown
Contributor Author

I made the change to disable canonicalization instead.

@bryanck

bryanck commented Feb 11, 2025

Copy link
Copy Markdown
Contributor Author

I switched back to the original change, to just disable intern and the hash collision check. Disabling canonicalization altogether can impact performance significantly.

@stevenzwu

Copy link
Copy Markdown
Contributor

@bryanck thanks for the experimentation with canonicalization. do you have any micro/jmh benchmark for the parser performance? if yes, maybe it would be useful to add it to the Iceberg repo.

@singhpk234 singhpk234 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bryanck !

do you have any micro/jmh benchmark for the parser performance

+1, size and number of snapshot tuple would be great to experiment with and have it commited.

[for my understanding] I thought we had a way to lazy load metadata in REST, the complete metadata parsing would only be required at the time of commit ? Are all the tables write heavy ?

@bryanck

bryanck commented Feb 12, 2025

Copy link
Copy Markdown
Contributor Author

[for my understanding] I thought we had a way to lazy load metadata in REST, the complete metadata parsing would only be required at the time of commit ? Are all the tables write heavy ?

We have a very high write load and generally have partition summaries turned on.

@nastra nastra merged commit 80a009a into apache:main Feb 13, 2025
@bryanck bryanck added this to the Iceberg 1.8.1 milestone Feb 18, 2025
nastra pushed a commit to nastra/iceberg that referenced this pull request Feb 19, 2025
nastra added a commit that referenced this pull request Feb 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

5 participants