Core: Adjust Jackson settings to handle large metadata json by bryanck · Pull Request #12224 · apache/iceberg

bryanck · 2025-02-11T16:22:11Z

With very large table metadata json, for example, those with many snapshots with partition summaries, we sometimes encounter errors involving hash collisions when loading the metadata. This PR disables that hash collision check so the metadata can be parsed without error. We have had this set in our internal fork for a while.

In addition, this PR disable string interning of field names which has lead to performance problems for us when parsing metadata. Given partition summary field names and other snapshot properties are often not reused across different metadata, the interning causes more harm than good. This is especially true when using Iceberg in a server which is loading metadata for many tables.

This also fixes a test classpath issue. The Avatica driver is a shadow jar that bundles an old unshaded version of Jackson.

stevenzwu · 2025-02-11T18:29:09Z

Given partition summary field names and other snapshot properties are often not reused across different metadata, the interning causes more harm than good.

@bryanck I didn't quite get the partition summary field names. were you referring to PartitionFieldSummaryParser? it seems to have just 4 field names.

String.intern can be helpful for some use cases while harmful for some (like the one you encountered). Disabling interning seems to be a safer option considering diverse scenarios that the code can be used (like REST catalog server).

hash collision check

I definitely understand the situation you described. maybe reach out to the Jackson authors too according to the doc?
https://github.com/fasterxml/jackson-core/wiki/JsonFactory-Features

In unlikely event that the exception is triggered for valid data, it may make sense to either disable this feature, or to disable canonicalization. However, Jackson authors would also like to be notified for such usage as it may point to an issue with hashing scheme -- so please file an issue if you encounter this problem.

The doc also mentioned that hash collision check is Only relevant if canonicalization is enabled. wondering if CANONICALIZE_FIELD_NAMES should be disabled too. I imagined it can cause similar memory footprint issue as String interning.

bryanck · 2025-02-11T18:43:39Z

@bryanck I didn't quite get the partition summary field names. were you referring to PartitionFieldSummaryParser? it seems to have just 4 field names.

String.intern can be helpful for some use cases while harmful for some (like the one you encountered). Disabling interning seems to be a safer option considering diverse scenarios that the code can be used (like REST catalog server).

The information for each partition key has a field name unique to the partition (with the prefix partitions.). There is some discussion around intern here with more links. TL;DR is that intern was disabled by default for Jackson 3 (whenever that is released).

I definitely understand the situation you described. maybe reach out to the Jackson authors too according to the doc? https://github.com/fasterxml/jackson-core/wiki/JsonFactory-Features

Sure sounds good, I'll reach out.

The doc also mentioned that hash collision check is Only relevant if canonicalization is enabled. wondering if CANONICALIZE_FIELD_NAMES should be disabled too. I imagined it can cause similar memory footprint issue as String interning.

Canonicalization can help when field names are reused within a single metadata file, so that seemed helpful still.

stevenzwu · 2025-02-11T19:08:07Z

Canonicalization can help when field names are reused within a single metadata file, so that seemed helpful still.

canonicalization lifecycle is scoped to a single metadata file? if it is also JVM lifecycle scope (like String intern), it can also be a problem for large tables and a server handling many tables.

bryanck · 2025-02-11T19:14:17Z

Canonicalization can help when field names are reused within a single metadata file, so that seemed helpful still.

canonicalization lifecycle is scoped to a single metadata file? if it is also JVM lifecycle scope (like String intern), it can also be a problem for large tables and a server handling many tables.

I believe it is scoped to a parser instance, and we generally create a new parser for each AFAIK. (https://github.com/fasterxml/jackson-core/wiki/JsonFactory-Features)

bryanck · 2025-02-11T19:15:51Z

Canonicalization can help when field names are reused within a single metadata file, so that seemed helpful still.

canonicalization lifecycle is scoped to a single metadata file? if it is also JVM lifecycle scope (like String intern), it can also be a problem for large tables and a server handling many tables.

I believe it is scoped to a parser instance, and we generally create a new parser for each AFAIK. (https://github.com/fasterxml/jackson-core/wiki/JsonFactory-Features)

Actually that doesn't seem correct, it is for any parser created by the same factory, so we should probably turn canonicalization off instead.

bryanck · 2025-02-11T19:26:35Z

I made the change to disable canonicalization instead.

bryanck · 2025-02-11T20:39:31Z

I switched back to the original change, to just disable intern and the hash collision check. Disabling canonicalization altogether can impact performance significantly.

This reverts commit 3c5d438.

stevenzwu · 2025-02-11T22:39:31Z

@bryanck thanks for the experimentation with canonicalization. do you have any micro/jmh benchmark for the parser performance? if yes, maybe it would be useful to add it to the Iceberg repo.

singhpk234

Thanks @bryanck !

do you have any micro/jmh benchmark for the parser performance

+1, size and number of snapshot tuple would be great to experiment with and have it commited.

[for my understanding] I thought we had a way to lazy load metadata in REST, the complete metadata parsing would only be required at the time of commit ? Are all the tables write heavy ?

bryanck · 2025-02-12T00:32:07Z

[for my understanding] I thought we had a way to lazy load metadata in REST, the complete metadata parsing would only be required at the time of commit ? Are all the tables write heavy ?

We have a very high write load and generally have partition summaries turned on.

…2224)

…12330) Co-authored-by: Bryan Keller <bryanck@gmail.com>

Core: Adjust Jackson settings for large metadata

42e3f04

github-actions Bot added the core label Feb 11, 2025

bryanck marked this pull request as draft February 11, 2025 16:44

fix test classpath issue

ceb6a17

github-actions Bot added MR build labels Feb 11, 2025

bryanck marked this pull request as ready for review February 11, 2025 17:38

dramaticlly approved these changes Feb 11, 2025

View reviewed changes

bryanck force-pushed the jackson-setting branch from 7c48a51 to 5cffecd Compare February 11, 2025 19:29

Disable canonicalization instead

3c5d438

bryanck force-pushed the jackson-setting branch from 5cffecd to 3c5d438 Compare February 11, 2025 19:41

Revert "Disable canonicalization instead"

6053374

This reverts commit 3c5d438.

stevenzwu approved these changes Feb 11, 2025

View reviewed changes

singhpk234 approved these changes Feb 12, 2025

View reviewed changes

nastra approved these changes Feb 13, 2025

View reviewed changes

nastra merged commit 80a009a into apache:main Feb 13, 2025

bryanck added this to the Iceberg 1.8.1 milestone Feb 18, 2025

nastra pushed a commit to nastra/iceberg that referenced this pull request Feb 19, 2025

Core: Adjust Jackson settings to handle large metadata json (apache#1…

e1f783b

…2224)

nastra mentioned this pull request Feb 19, 2025

[1.8.x] Core: Adjust Jackson settings to handle large metadata json (#12224) #12330

Merged

nastra added a commit that referenced this pull request Feb 19, 2025

Core: Adjust Jackson settings to handle large metadata json (#12224) (#…

30d4a93

…12330) Co-authored-by: Bryan Keller <bryanck@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Core: Adjust Jackson settings to handle large metadata json#12224

Core: Adjust Jackson settings to handle large metadata json#12224
nastra merged 4 commits into
apache:mainfrom
bryanck:jackson-setting

bryanck commented Feb 11, 2025 •

edited

Loading

stevenzwu commented Feb 11, 2025

bryanck commented Feb 11, 2025 •

edited

Loading

stevenzwu commented Feb 11, 2025

bryanck commented Feb 11, 2025

bryanck commented Feb 11, 2025

bryanck commented Feb 11, 2025 •

edited

Loading

bryanck commented Feb 11, 2025 •

edited

Loading

stevenzwu commented Feb 11, 2025

singhpk234 left a comment

bryanck commented Feb 12, 2025

Labels

5 participants

Uh oh!

Conversation

bryanck commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

stevenzwu commented Feb 11, 2025

bryanck commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

stevenzwu commented Feb 11, 2025

bryanck commented Feb 11, 2025

bryanck commented Feb 11, 2025

bryanck commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

bryanck commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

stevenzwu commented Feb 11, 2025

singhpk234 left a comment

Choose a reason for hiding this comment

bryanck commented Feb 12, 2025

Labels

5 participants

bryanck commented Feb 11, 2025 •

edited

Loading

bryanck commented Feb 11, 2025 •

edited

Loading

bryanck commented Feb 11, 2025 •

edited

Loading

bryanck commented Feb 11, 2025 •

edited

Loading