Latency numbers everyone should know

April 15, 2026 read post

There's a Google PDF with a bunch of useful latency numbers that software engineers should know. I asked ChatGPT to make it into a markdown file for me.

What, exactly, is the GIL?

April 10, 2026 read post

Most ML code is Python. This is surprising to many performance oriented engineers coming from non-ML communities. Python is, notably slow and has the GIL, which forces it to only execute a single thread at a time. The GIL primarily exists to make CPython’s memory management thread-safe. It has also become a load-bearing part of Python's API, as, in a classic example of Hyrum's Law, much Python code relies on the thread safety that it has created.

[1]: Funnily enough, if you google "when does Ray release the GIL?" one of the top results is from the Beyblade wiki.

Making RL Fast

April 03, 2026 read post

For Olmo 3, I was put in charge of our post-training infrastucture. We made the decision to move from a synchronous RL setup to an asynchronous one to enable us to scale. In doing that work, I was fortunate enough to find a series of optimizations which made our RL setup 4x faster. As we used roughly 250k H100 hours running RL on Olmo 3, these optimizations saved us approximately 750k H100 hours (~$1.5M) at current market prices. These changes were detailed in the paper, but I wanted to write more about them here.

The Bitter Lesson

June 26, 2025 read post

The Bitter Lesson is an excellent essay which is overwhelmingly misunderstood. The point of the bitter lesson is that, over time, methods which scale with compute will outperform methods that do not.

Request for research: Monte Carlo Tree Search for reasoning, with PUCT

January 21, 2025 read post

In the recent wave of research studying reasoning models, by which we means models like O1 which are able to use long streams of tokens to "think" and thereby generate better results, MCTS has been discussed a lot as a potentially useful tool. However, some papers, like the DeepSeek R1 paper, have tried MCTS without any success.

RESP advice

November 08, 2024 read post

A few people have asked me for RESP advice, so here is my generic answer, with the disclaimer that I'm not a financial advisor.

Installing Docker on a new VM

February 02, 2024 read post

I consistently run into the same issue when install Docker on a new Ubuntu VM.

Five years of GPT progress --- Amii talk

May 29, 2023 read post

I recently gave a talk at Amii about the history of GPT models.

Deriving the DALL-E lower bound

April 05, 2023 read post

Five years of GPT progress

March 27, 2023 read post

If you want to read more of my writing, I have a Substack.

How is LLaMa.cpp possible?

March 16, 2023 read post

If you want to read more of my writing, I have a Substack. Articles will be posted simultaneously to both places.

A step towards self-improving LLMs

March 07, 2023 read post

There's a Substack version of this post, if you prefer that over my ~~amateurish~~ artisan HTML.

Papers I've read this week (March 4th, 2023)

March 04, 2023 read post

I’m going to try to write a weekly summary of the most interesting papers I’ve read that week. I’d love to hear what papers you’ve been reading, if you agree/disagree about my conclusions for each paper, and/or suggestions for what papers I should read next!

The Sigmoid: a metaphor for technological progress

March 02, 2023 read post

I regularly reference the “s-curve”, or sigmoid, as a metaphor for progress. Here, I explain what I mean, so that I can just link to this post.

Large language models aren't trained enough.

February 27, 2023 read post

I have a Substack if you want to be notified when I write.

A pure Python (well, Numpy) implementation of back-propagation

January 29, 2023 read post

I realized over the weekend that, unfortunately, I didn't know how back-propagation actually works (I just relied on JAX to do it for me).

Pointer Networks

September 20, 2017 read post

Link to paper [arXiv], [code].

Do deep networks generalise or just memorise?

July 04, 2017 read post

There's a brilliant paper out of Google Brain 1 which claimed that DNNs just memorise the training data, and a response 2, which claims that they don't.

Outrageously Large Neural Networks: The sparsely-gated Mixture-of-Experts layer

July 01, 2017 read post

Abstract

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

June 20, 2017 read post

Abstract

Random Search for Hyper-Parameter Optimization

March 01, 2017 read post

Abstract

Useful Bash One-liners

January 20, 2017 read post

I have a file in my home folder that contains Bash oneliners that I use regularly (I'm a huge nerd, naturally). I found most of them elsewhere online; I wrote very few of these from scratch.

A Deep Hierarchical Approach to Lifelong Learning in Minecraft

January 03, 2017 read post

Abstract

Larry Ellison on consulting costs

December 06, 2016 read post

I'm currently reading Softwar, a book about Oracle's rise. The book is brilliant, and it descibes at length Larry Ellison's sales process. There was a passage describing a meeting that Larry had that explains far more about enterprise sales than it should:

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5mb model size

November 10, 2016 read post

Abstract

Conditional image synthesis with auxiliary classifier GANs

November 08, 2016 read post

Abstract

Generative Adversarial Imitation Learning

November 08, 2016 read post

Abstract

Minimal example of how to do model selection in Python

October 26, 2016 read post

I've had a few people ask me how to do model selection correctly. Here's a minimal example with sklearn in Python.

Representation Learning: A Review and New Perspectives

October 20, 2016 read post

Abstract

Full Resolution Image Compression with Recurrent Neural Networks

October 19, 2016 read post

Abstract

Generative Adversarial Networks and Actor-Critic methods

October 19, 2016 read post

Abstract

Using simulated data to train robots

October 18, 2016 read post

Abstract

Safe and Efficient Off-Policy Reinforcement Learning

October 18, 2016 read post

Abstract

XGBoost: A scalable tree boosting system

September 20, 2016 read post

Abstract

Excellent description of how hashtables work

August 15, 2015 read post

I'm working through the Algorithm Design Manual to improve the efficiency of my coding.

Full example for using JSONcpp on Unix

September 06, 2014 read post

I've been trying to parse JSON files with C++, and I've found a distinct lack of full examples on how to do so. Specifically, I've struggled to find the proper commands to actually compile the code. For future reference (and to help any beginners out), here's a full example of how to use JSONcpp in your code (N.B. You're supposed to enter all of the following code in your terminal).

ARIMA, ARMA, what's the difference?

April 21, 2014 read post

I'm working through TSA, and I noticed that some of my classmates are struggling to understand the difference between an ARIMA process, an AR process, and a MA process, not to mention seasonal version of the above.

Solving Partial Autocorrelation Functions

March 03, 2014 read post

I've been studying time series through TSA. The book presents a structured approach to time series analysis, and covers the material fairly well; I was impressed with the description of what a partial autocorrelation function (PACF) is, as the book explained it more intuitively than the lecture notes did. I did find the description of how to actually solve for the PACF a bit confusing, so I wrote my own explanation.

Blog

Latency numbers everyone should know

What, exactly, is the GIL?

Making RL Fast

The Bitter Lesson

Request for research: Monte Carlo Tree Search for reasoning, with PUCT

RESP advice

Installing Docker on a new VM

Five years of GPT progress --- Amii talk

Deriving the DALL-E lower bound

Five years of GPT progress

How is LLaMa.cpp possible?

A step towards self-improving LLMs

Papers I've read this week (March 4th, 2023)

The Sigmoid: a metaphor for technological progress

Large language models aren't trained enough.

A pure Python (well, Numpy) implementation of back-propagation

Pointer Networks

Do deep networks generalise or just memorise?

Outrageously Large Neural Networks: The sparsely-gated Mixture-of-Experts layer

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Random Search for Hyper-Parameter Optimization

Useful Bash One-liners

A Deep Hierarchical Approach to Lifelong Learning in Minecraft

Larry Ellison on consulting costs

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5mb model size

Conditional image synthesis with auxiliary classifier GANs

Generative Adversarial Imitation Learning

Minimal example of how to do model selection in Python

Representation Learning: A Review and New Perspectives

Full Resolution Image Compression with Recurrent Neural Networks

Generative Adversarial Networks and Actor-Critic methods

Using simulated data to train robots

Safe and Efficient Off-Policy Reinforcement Learning

XGBoost: A scalable tree boosting system

Excellent description of how hashtables work

Full example for using JSONcpp on Unix

ARIMA, ARMA, what's the difference?

Solving Partial Autocorrelation Functions