(WIP) Cleaned Flux2 Klein Implementation, with benchmarking done on v6 TPU by amepas · Pull Request #434 · AI-Hypercomputer/maxdiffusion

amepas · 2026-06-29T20:49:38Z

(WIP) - this will be updated with multi-chip latency and support for Flux2 Klein 9B!

Draft PR for the Flux2 Klein model. Includes a custom implementation of the Qwen3-4B model for getting text embeddings. VAE Decoder, RoPE positional embedder, flow-matching step schedule are all re-used. Light modifications to transformer/attention blocks are used.

Latency for batch-size 4 of 1024 by 1024 images (bfloat16):

Prompt Encoding (Qwen3): 57.67 ms (1.58% of total)
Denoising Loop (Flux 4 steps): 3,181.20 ms (87.09% of total)
- Per-Step Transformer Time: 795.30 ms
VAE Decoding (VAE): 413.77 ms (11.33% of total)
Total: 3.65 seconds

PR includes code for verifying accuracy of implementation. Sharding model is implemented but not tested.

Image generation is only supported so far.

github-actions · 2026-06-29T20:49:45Z

e2e testgrid: https://8bcf50593faf4ea38060e236169827e5-dot-us-central1.composer.googleusercontent.com/dags/maxdiffusion_tpu_e2e/grid

entrpn · 2026-06-30T00:14:32Z

+
+class GenerateFlux2KleinE2ETest(unittest.TestCase):
+
+    def test_end_to_end_parity_and_offloading(self):


this test is very likely to fail in the github runner with the hardcoded values. We usually don't run e2e tests on the github runner, you can mark it so it doesn't run in the runner.

entrpn · 2026-06-30T00:14:41Z

+        every single stage against the golden PyTorch reference.
+        """
+        # Set highest precision for strict mathematical parity checks
+        jax.config.update("jax_default_matmul_precision", "highest")


what is the reason for using highest here?

Perseus14

Hey @amepas, I see this is WIP, added a few comments to help polish the PR

Perseus14 · 2026-07-02T19:18:05Z

Do we need a separate generate file for 9B model? Can this be combined with the generate_flux2klein.py or even the generate_flux.py?

Perseus14 · 2026-07-02T19:20:23Z

We don't need a benchmark code to be added to tests/ folder. We could benchmark the results and mention it in the PR or a doc but no need to add it to the repo

Perseus14 · 2026-07-02T19:21:45Z

      extra_one_step: bool = False,
      reverse_sigmas: bool = False,
+      use_dynamic_shifting: bool = False,
+      time_shift_type: str = "linear",


What are the possible time_shift_types here? Is moving to a Enum data type better?

Perseus14 · 2026-07-02T19:29:25Z

I suspect you can refactor this file and move some of the functions to files under different folders like pipeline, models/flux, max_utils.py

Refer to WAN model related files

Cleaned Flux2 Klein Implementation, with benchmarking done on v6 TPU

379c1a0

amepas requested review from chandrasekhard2 and eltsai June 29, 2026 20:49

amepas requested a review from entrpn as a code owner June 29, 2026 20:49

amepas marked this pull request as draft June 29, 2026 20:53

chandrasekhard2 requested review from Perseus14 and mbohlool June 29, 2026 23:16

entrpn reviewed Jun 30, 2026

View reviewed changes

entrpn requested changes Jun 30, 2026

View reviewed changes

amepas changed the title ~~Cleaned Flux2 Klein Implementation, with benchmarking done on v6 TPU~~ Jun 30, 2026

amepas added 8 commits June 30, 2026 18:00

Fixes to support FSDP

a86a6fe

minor tweaks for hf_cache location

59ee14a

more tiny fixes

20427f2

more tiny fixes

1f83f6d

9B model variant support

ebef267

remove extra file

3a73c2a

Fixes to 9B implementation

f8fac50

tiny fixes to normalization

7204719

Perseus14 reviewed Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

(WIP) Cleaned Flux2 Klein Implementation, with benchmarking done on v6 TPU#434

(WIP) Cleaned Flux2 Klein Implementation, with benchmarking done on v6 TPU#434
amepas wants to merge 9 commits into
mainfrom
flux2klein-onboarding

amepas commented Jun 29, 2026 •

edited

Loading

github-actions Bot commented Jun 29, 2026

entrpn Jun 30, 2026

entrpn Jun 30, 2026

Perseus14 left a comment •

edited

Loading

Perseus14 Jul 2, 2026

Perseus14 Jul 2, 2026

Perseus14 Jul 2, 2026

Perseus14 Jul 2, 2026

Labels

3 participants


		class GenerateFlux2KleinE2ETest(unittest.TestCase):

		def test_end_to_end_parity_and_offloading(self):

Uh oh!

Conversation

amepas commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

github-actions Bot commented Jun 29, 2026

entrpn Jun 30, 2026

Choose a reason for hiding this comment

entrpn Jun 30, 2026

Choose a reason for hiding this comment

Perseus14 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Perseus14 Jul 2, 2026

Choose a reason for hiding this comment

Perseus14 Jul 2, 2026

Choose a reason for hiding this comment

Perseus14 Jul 2, 2026

Choose a reason for hiding this comment

Perseus14 Jul 2, 2026

Choose a reason for hiding this comment

Labels

3 participants

amepas commented Jun 29, 2026 •

edited

Loading

Perseus14 left a comment •

edited

Loading