Log inSign up
Bert Maher
432 posts
user avatar
Bert Maher
@tensorbert
I’m a software engineer building high-performance kernels and compilers at Anthropic! Previously at Facebook/Meta (PyTorch, HHVM, ReDex)
Joined December 2022
402
Following
2,762
Followers

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
  • user avatar
    Bert Maher
    @tensorbert
    Aug 25, 2025
    Hello Anthropic! After 12 awesome years at Meta where I contributed to foundational technology (HHVM, ReDex, PyTorch, Triton) I’ve decided it’s time for a new adventure. I’ve joined the inference team — I’m looking forward to making Claude even faster! 🚀
    191K
  • user avatar
    Bert Maher
    @tensorbert
    Aug 17, 2025
    I love @rasbt’s LLMs from scratch - they’re so hackable. A quick optimization project for someone: modify the weight loader so it doesn’t use double memory. I think there’s a way to do it without changing the model FQNs
    user avatar
    Sebastian Raschka
    @rasbt
    Aug 17, 2025
    Couldn't resist. Here's a pure PyTorch from-scratch re-implementation of Gemma 3 270M in a Jupyter Notebook (uses about 1.49 GB RAM): github.com/rasbt/LLMs-fro…
    15K
  • user avatar
    Bert Maher
    @tensorbert
    Jul 18, 2023
    I just finished reading this great article by @jeremyphoward that refreshes the matrix/vector calculus behind neural networks. As someone who spent the last several years doing systems programming rather than calculus, it was a welcome read!
    The Matrix Calculus You Need For Deep Learning
    From explained.ai
    15K
  • user avatar
    Bert Maher
    @tensorbert
    Aug 24, 2025
    I am curious: folks who know a lot about parallelism in model scaling, do you mainly get your insights from pencil-and-paper reasoning, from actually running lots of measurements, both? I have a good feel for single-device performance, but not large scale yet
    user avatar
    Horace He
    Thinking Machines
    @cHHillee
    Aug 23, 2025
    Replying to @JingyuanLiu123
    This is the advantage of large nvlink domains or TPUs topology - the main reason to do PP is that you are bottlenecked on your DP comms and cannot scale TP further. But if you have high enough bandwidth across a large enough domain (like TPUs or NVL72), you don't need to do PP
    28K
  • user avatar
    Bert Maher
    @tensorbert
    Dec 4, 2023
    Three basic tricks for non-matmul ML CUDA kernels: - vectorize loads/stores to maximize memory bandwidth - use warp shuffle instructions and shared memory for reductions within a block (avoid cross block) - tile and use shared memory for fast transpose (beware bank conflicts)
    3.9K
  • user avatar
    Bert Maher
    @tensorbert
    Dec 1, 2022
    I’m finally on Twitter for real! I’m a long-time compiler hacker, working at FB/Meta for the last 10 years and on ML compilers for the last 5. I realized there’s so much ML action here I have to join! Who should I follow to stay up-to-date with all the best ML insights?
  • user avatar
    Bert Maher
    @tensorbert
    Apr 21, 2023
    Our blog post about optimizing nanoGPT w/ PyTorch 2.0 is up! Includes our new, super fast attention kernels. My contribution: padding matmuls to be GPU friendly. A 3-character change for a 25% win 😁 pytorch.org/blog/accelerat…
    2.1K
  • user avatar
    Bert Maher
    @tensorbert
    Aug 17, 2025
    This is, in fact, a fantastic syllabus for ML perf engineering. Also required, one debugging war story where training loss was just inexplicably too high
    user avatar
    Aaron
    @Norapom04
    Aug 16, 2025
    job descriptions are asking for a bit too much nowadays... (side note this is a pretty good index for ML system ball knowledge)
    2.7K
  • user avatar
    Bert Maher
    @tensorbert
    Aug 27, 2025
    Replying to @cHHillee @rvarm1 and @JingyuanLiu123
    Can you convince someone at Meta to publish that note to the PyTorch blog or something? It was sooo good, I referred to it so many times while learning distributed scaling
    1.8K
  • user avatar
    Bert Maher
    @tensorbert
    Sep 5, 2025
    Kind of excited to see references to this in the wild!
    user avatar
    typedfemale
    @typedfemale
    Sep 5, 2025
    meta's gluon at home
    TLX - Triton Low-level Language Extensions
Introduction
TLX (Triton Low-level Language Extensions) is a low-level, warp-aware, hardware-near extension of the Triton DSL. It offers intrinsics and warp-specialized operations for fine-grained GPU control, hardware-oriented primitives for advanced kernel development, and explicit constructs for GPU memory, computation, and asynchronous control flow. TLX is designed for expert users pushing Triton closer to the metal.

Primarily targeting NVIDIA GPUs (for now), TLX extends Triton to support:

Hardware-specific intrinsics (e.g., wgmma, async_copy, barrier)
Shared and local memory allocation
Instruction-level scheduling and control
Cross-warpgroup synchronization
While this approach places more responsibility on the user, it reduces the compiler's role as a performance bottleneck. Although it may introduce divergence across hardware platforms, it empowers users to perform deeper, architecture-specific optimizations without relying solely on c
    3.4K
  • user avatar
    Bert Maher
    @tensorbert
    Oct 1, 2023
    The work of making a open-source library usable and popular also improves the quality of the code (specifically the API) in ways that help its users move faster. Internal-only code tends to accumulate “glop” because people have no choice but to use it.
    user avatar
    clem 🤗
    @ClementDelangue
    Sep 28, 2023
    Meta starts open-sourcing a lot and is now becoming one of the best companies in the world at shipping AI features. Coincidence? I don’t think so. Contrary to popular belief, a company (or a country) sharing their research, models and datasets publicly in open-source makes them
    11K
  • user avatar
    Bert Maher
    @tensorbert
    Dec 14, 2022
    🧵Maybe it's a mundane use of an amazing AI, but I've been finding ChatGPT amazingly useful for breaking through writer's block. Literally just asking it to make me an outline of something helps get me unblocked and writing. (1/6)
  • user avatar
    Bert Maher
    @tensorbert
    Feb 11, 2023
    Replying to @soumithchintala and @jiayq
    It’s also pretty similar to talking to a child. Like, a lot of parenting books are basically prompt engineering for children!
    9.6K
  • user avatar
    Bert Maher
    @tensorbert
    Feb 4, 2023
    At the risk of rooting my horn, I stumbled on this optimization during a PyTorch 2.0 team chat with @karpathy 😁
    user avatar
    Andrej Karpathy
    @karpathy
    Feb 3, 2023
    The most dramatic optimization to nanoGPT so far (~25% speedup) is to simply increase vocab size from 50257 to 50304 (nearest multiple of 64). This calculates added useless dimensions but goes down a different kernel path with much higher occupancy. Careful with your Powers of 2.
    3.6K