Bert Maher (@tensorbert) / X

Bert Maher

432 posts

Bert Maher

@tensorbert

I’m a software engineer building high-performance kernels and compilers at Anthropic! Previously at Facebook/Meta (PyTorch, HHVM, ReDex)

Joined December 2022

Bert Maher
@tensorbert
Aug 25, 2025
Hello Anthropic! After 12 awesome years at Meta where I contributed to foundational technology (HHVM, ReDex, PyTorch, Triton) I’ve decided it’s time for a new adventure. I’ve joined the inference team — I’m looking forward to making Claude even faster! 🚀
191K
Bert Maher
@tensorbert
Aug 17, 2025
I love @rasbt’s LLMs from scratch - they’re so hackable. A quick optimization project for someone: modify the weight loader so it doesn’t use double memory. I think there’s a way to do it without changing the model FQNs
Sebastian Raschka
@rasbt
Aug 17, 2025
Couldn't resist. Here's a pure PyTorch from-scratch re-implementation of Gemma 3 270M in a Jupyter Notebook (uses about 1.49 GB RAM): github.com/rasbt/LLMs-fro…
15K
Bert Maher
@tensorbert
Jul 18, 2023
I just finished reading this great article by @jeremyphoward that refreshes the matrix/vector calculus behind neural networks. As someone who spent the last several years doing systems programming rather than calculus, it was a welcome read!
The Matrix Calculus You Need For Deep Learning
From explained.ai
15K
Bert Maher
@tensorbert
Aug 24, 2025
I am curious: folks who know a lot about parallelism in model scaling, do you mainly get your insights from pencil-and-paper reasoning, from actually running lots of measurements, both? I have a good feel for single-device performance, but not large scale yet
Horace He
@cHHillee
Aug 23, 2025
Replying to @JingyuanLiu123
This is the advantage of large nvlink domains or TPUs topology - the main reason to do PP is that you are bottlenecked on your DP comms and cannot scale TP further. But if you have high enough bandwidth across a large enough domain (like TPUs or NVL72), you don't need to do PP
28K
Bert Maher
@tensorbert
Dec 4, 2023
Three basic tricks for non-matmul ML CUDA kernels: - vectorize loads/stores to maximize memory bandwidth - use warp shuffle instructions and shared memory for reductions within a block (avoid cross block) - tile and use shared memory for fast transpose (beware bank conflicts)
3.9K
Bert Maher
@tensorbert
Dec 1, 2022
I’m finally on Twitter for real! I’m a long-time compiler hacker, working at FB/Meta for the last 10 years and on ML compilers for the last 5. I realized there’s so much ML action here I have to join! Who should I follow to stay up-to-date with all the best ML insights?
Bert Maher
@tensorbert
Apr 21, 2023
Our blog post about optimizing nanoGPT w/ PyTorch 2.0 is up! Includes our new, super fast attention kernels. My contribution: padding matmuls to be GPU friendly. A 3-character change for a 25% win 😁 pytorch.org/blog/accelerat…
2.1K
Bert Maher
@tensorbert
Aug 17, 2025
This is, in fact, a fantastic syllabus for ML perf engineering. Also required, one debugging war story where training loss was just inexplicably too high
Aaron
@Norapom04
Aug 16, 2025
job descriptions are asking for a bit too much nowadays... (side note this is a pretty good index for ML system ball knowledge)
2.7K
Bert Maher
@tensorbert
Aug 27, 2025
Replying to @cHHillee @rvarm1 and @JingyuanLiu123
Can you convince someone at Meta to publish that note to the PyTorch blog or something? It was sooo good, I referred to it so many times while learning distributed scaling
1.8K
Bert Maher
@tensorbert
Sep 5, 2025
Kind of excited to see references to this in the wild!
typedfemale
@typedfemale
Sep 5, 2025
meta's gluon at home
3.4K
Bert Maher
@tensorbert
Oct 1, 2023
The work of making a open-source library usable and popular also improves the quality of the code (specifically the API) in ways that help its users move faster. Internal-only code tends to accumulate “glop” because people have no choice but to use it.
clem 🤗
@ClementDelangue
Sep 28, 2023
Meta starts open-sourcing a lot and is now becoming one of the best companies in the world at shipping AI features. Coincidence? I don’t think so. Contrary to popular belief, a company (or a country) sharing their research, models and datasets publicly in open-source makes them
11K
Bert Maher
@tensorbert
Dec 14, 2022
🧵Maybe it's a mundane use of an amazing AI, but I've been finding ChatGPT amazingly useful for breaking through writer's block. Literally just asking it to make me an outline of something helps get me unblocked and writing. (1/6)
Bert Maher
@tensorbert
Feb 11, 2023
Replying to @soumithchintala and @jiayq
It’s also pretty similar to talking to a child. Like, a lot of parenting books are basically prompt engineering for children!
9.6K
Bert Maher
@tensorbert
Feb 4, 2023
At the risk of rooting my horn, I stumbled on this optimization during a PyTorch 2.0 team chat with @karpathy 😁
Andrej Karpathy
@karpathy
Feb 3, 2023
The most dramatic optimization to nanoGPT so far (~25% speedup) is to simply increase vocab size from 50257 to 50304 (nearest multiple of 64). This calculates added useless dimensions but goes down a different kernel path with much higher occupancy. Careful with your Powers of 2.
3.6K