Planet Musings

March 28, 2026

John BaezGeometry and the Exceptional Jordan Algebra

I’m giving a talk online tomorrow at the 2026 Spring Southeastern Sectional Meeting of the American Mathematical Society, in the Special Session on Non-Associative Rings and Algebras. The organizers are Layla Sorkatti and Kenneth Price. I doubt the talk will be recorded, but here are my slides:

Projective geometry and the exceptional Jordan algebra.

Abstract. Dubois-Violette and Todorov noticed that the gauge group of the Standard Model of particle physics is the intersection of two maximal subgroups of \text{F}_4, which is the automorphism group of the exceptional Jordan algebra \mathfrak{h}_3(\mathbb{O}). Here we conjecture that these can be taken to be any subgroups preserving copies of \mathfrak{h}_2(\mathbb{O}) and \mathfrak{h}_3(\mathbb{C}) that intersect in a copy of \mathfrak{h}_2(\mathbb{C}). Given this, we show that the Standard Model gauge group consists of all isometries of the octonionic projective plane that preserve an octonionic projective line and a complex projective plane intersecting in a complex projective line. This is joint work with Paul Schwahn.

This is an introductory talk for mathematicians. Physicists may prefer the two talks here. Those go much further in some ways, but they don’t cover the new ideas that Paul Schwahn and I are in the midst of working on.

March 27, 2026

n-Category Café Geometry and the Exceptional Jordan Algebra

I’m giving a talk online tomorrow at the 2026 Spring Southeastern Sectional Meeting of the American Mathematical Society, in the Special Session on Non-Associative Rings and Algebras. The organizers are Layla Sorkatti and Kenneth Price. I doubt the talk will be recorded, but you can see my slides.

Abstract. Dubois-Violette and Todorov noticed that the gauge group of theStandard Model of particle physics is the intersection of two maximal subgroups of F 4 \text{F}_4. which is the automorphism group of the exceptional Jordan algebra 𝔥 3(𝕆)\mathfrak{h}_3(\mathbb{O}). Here we conjecture that these can be taken to be any subgroups preserving copies of 𝔥 2(𝕆) \mathfrak{h}_2(\mathbb{O}) and 𝔥 3()\mathfrak{h}_3(\mathbb{C}) that intersect in a copy of 𝔥 2() \mathfrak{h}_2(\mathbb{C}). Given this, we show that the Standard Model gauge group consists of all isometries of the octonionic projective plane that preserve an octonionic projective line and a complex projective plane intersecting in a complex projective line. This is joint work with Paul Schwahn.

This is an introductory talk for mathematicians. Physicists may prefer the two talks here. Those go much further in some ways, but they don’t cover the new ideas that Paul Schwahn and I are in the midst of working on.

Matt von HippelTrust Is a Tree

Scientists trust what they think they can verify.

In principle, you can work your way through the proof of every mathematical theorem. With enough money and time, you could replicate every experiment. For every expert opinion, you could dig through the literature and find how it was justified.

And while a scientist can’t actually do that for every field, they might be able to for the ones they care about most. In your specialty, you probably can check the logic behind every claim. And you know that enough people try, that you can trust your colleagues’ work.

As a science journalist, most of the time, you can’t do those checks. You don’t even pretend you can. Instead, you build trust, like a tree.

You start with a grounding. A former scientist might trust their former colleagues, people they trusted, as a scientist, to do (and know) good work. A non-scientist has to start somewhere else. They might use prestige, looking up those tenured folks at Harvard or Princeton or Stanford. They might look to who other journalists trusted, scientists who’ve already been in the news. They might track journals or roles, assuming that a publication in Nature, or a position on a national grant committee, has a special meaning.

And if things stopped there, it would be a pretty elitist system. It still can be, and often is. But there is another step, which softens it.

The trust builds.

When I want to know if a paper in an unfamiliar field makes sense, if it’s worth covering, I try to ask someone I trust. Sometimes, they don’t know, and shrug. Other, more useful, times, they don’t know, but they have a suggestion: someone they trust, who can give me the answer.

And so I ask the new person, and now I trust someone more.

And suppose the new person says the new paper is good, and worth covering, good science and all that jazz.

Well, now I can trust its authors too, right?

So when the next paper comes, I now don’t just have that first someone. I have the person they recommended, and the authors of the previous paper.

The trust builds out, and up, like branches on a tree.

March 26, 2026

Tim GowersGroup and semigroup puzzles and a possible Polymath project

An Artin-Tits group is a group with a finite set of generators a_1,\dots a_k in which every relation is of the form (ab)^r=(ba)^r or (ab)^ra=(ba)^rb for some positive integer r, where a and b are two of the generators. In particular, the commutation relation ab=ba is allowed (it is the case r=1 of the first type of relation) and so is the braid relation aba=bab (it is the case r=1 of the second type of relation). This means that Artin-Tits groups include free groups, free Abelian groups, and braid groups: for example, the braid group on k+1 strands has a presentation with generators a_1,\dots,a_k, where a_i represents a twist of the ith and (i+1)st strands, and relations a_ia_j=a_ja_i if |i-j|>1 and a_ia_{i+1}a_i=a_{i+1}a_ia_{i+1}.

A few weeks ago, I asked ChatGPT for a simple example of a word problem for groups or semigroups that was not known to be decidable and also not known to be undecidable. It turns out that the word problem for many Artin-Tits groups comes into that category: the simplest example where the status is not known is the group with four generators a,b,c and d where c and d commute and all other pairs of generators satisfy the braid relation xyx=yxy.

My interest in this was initially that I was looking for a toy model from which one could learn something about how mathematicians judge theorems to be interesting. I started with a remarkable semigroup discovered by G. S. Tseytin that has five generators, seven very simple relations, and an undecidable word problem. From that I created a puzzle game that you might be interested to play. (NB the puzzles were not showing up in the public version, but that should be sorted out now.) Each puzzle is equivalent to an instance of the word problem for Tseytin’s semigroup, but the interface makes it much more convenient to change words using the relations than it would be to do it with pen and paper.

I was hoping that because every mathematical problem can in principle be encoded as a puzzle in this game, one might be able to build a sort of “alien mathematics”, where a theorem was an equality between words, and a definition was a decision to introduce a new (and redundant) generator g, together with a relation of the form g = w, where w is a word in the current alphabet. Theorems would be particularly interesting if they were equalities between words that could be established only by a chain of equalities that went through much longer words, and definitions would be useful if the new generators satisfied particularly concise relations (which would allow one to build “theories” within the system). I still hope to find a word problem that will allow such a project to take off, but in the end, after a lot of playing around with the game linked to above, I have decided that Tseytin’s semigroup is not suitable. The reason is that it is designed so that arbitrary instances of the word problem for groups can be encoded as word problems in this semigroup, and once one gets used to the game, one starts to see how that encoding can work. Furthermore, one seems to be driven towards the encoding — I don’t get the impression that there’s a whole other region of this semigroup to explore that has nothing to do with the kinds of words that come up in the encoding. And if that impression is correct, then one might as well start with the word problem for some group in unencoded form, or alternatively look for another semigroup. Nevertheless, I find the Tseytin game quite enjoyable: I won’t say more about it here but have written a fairly comprehensive tutorial that you can open up and read if you follow the link above.

This is perhaps the moment to say that the words “I created a puzzle game” are slightly misleading. For one thing, I discussed the idea of gamifying Tseytin’s semigroup about three years ago with Mirek Olšák, a former member of my automatic theorem proving group, and he created a basic prototype in Python. But the main point is that I do not have the programming skills to create a game that can be played in a web browser — I vibecoded it using ChatGPT.

After that experience, I thought that maybe I would have better luck if instead of looking for a group or semigroup with undecidable word problem, which might well have been explicitly designed with some encoding in mind, I looked for a word problem for which the decidability status was unknown. That way, it wouldn’t have been designed to be undecidable, but might nevertheless just happen to be undecidable and provide a nice playground of the kind I was (and still am) after. And that is what led me to the Artin-Tits groups.

However, those don’t seem to be suitable either, because it is conjectured that they all have decidable word problems. I have created a game for the Artin-Tits group mentioned above, which you can also play if you want, but I have found it very hard to create interesting puzzles. (NB again there was a problem with the puzzles showing up, with only a tiny handful being there, but that should now be sorted out.) That is, I found it difficult to find words that are equivalent to the identity but that are not easily shown to be equivalent to the identity. One nice example comes from the fact that the subgroup generated by a,c and d is isomorphic to the braid group B_4 with four strands. The late Patrick Dehornoy found a very nice example of a braid with four strands that is equal to the identity but not in a completely trivial way. A picture of it can be found on page 4 of this paper.

This is where a potential Polymath project comes in. An initial goal would be to determine the decidability status of this one small Artin-Tits group: if we managed that, then we could consider the more general problem. And the way I envisage approaching this initial goal is an iterative process that runs as follows.

  1. Devise an algorithm A_0 for solving word problems in the group.
  2. If the current algorithm is A_i, then search for a puzzle P that A_i fails to solve.
  3. If the search is successful, then devise an algorithm A_{i+1} that solves all the puzzles that A_i solves, and also solves P. Make A_{i+1} the current algorithm and return to step 2. Devising A_{i+1} can be done by playing the game with puzzle P several times until one has the feeling that one has solved it in a systematic way.
  4. If the search is unsuccessful and the current algorithm is A, then attempt to prove inductively that A always succeeds: that is, to prove that if A solves P and Q is obtained from P by applying a single relation in the group, then A solves Q.
  5. (Maybe) If the inductive step doesn’t seem to work, then try to use that failure to devise a puzzle that defeats A and return to step 3.

I have done a couple of iterations already, with the result that I now have an algorithm — let’s call it A_0 — that solves (basically instantaneously) every puzzle I throw at it, including the one derived from Dehornoy’s braid mentioned above. So a subgoal of the initial goal is to find a puzzle that has a solution but that A_0 doesn’t solve. If we can’t, then maybe it is worth trying to prove that A_0 solves the word problem for this group. One comment about the algorithm A_0 is that it never increases the length of a word, though it often does moves that preserve the length. I was told by ChatGPT that it is an unsolved problem whether every braid that is equal to the identity can be reduced to the identity without ever increasing the number of crossings. Of course that isn’t a guarantee that it really is unsolved, but if it is, then that’s another interesting problem. It also means that I would be extremely interested if someone found an example of a word in the Artin-Tits group that can be reduced to the identity but only if one starts by lengthening it.

I’m hoping that this will be an enjoyable project for people who like both mathematics and programming. On the maths side, there is an unsolved problem to think about, and on the programming side there are lots of possibilities: for example, one could write programs that explore the space of words equal to the identity, trying to do so in such a way that there is a reasonable chance of reaching a word where it isn’t obvious how to reverse all the steps and get back to the identity. The game comes with a “sandbox” that has a few tools for generating words, but at the moment it is fairly primitive, and I would welcome suggestions for how to improve it.

It seems to me that as Polymath projects go, one can’t really lose: either we find an algorithm for the word problem, which establishes an unknown (at least according to ChatGPT) case of a decidability problem, or we find a suite of harder and harder puzzles, creating a more and more challenging and entertaining game and obtaining a deeper understanding of the Artin-Tits group in the process.

This Artin-Tits game has only a rather rudimentary and not very good tutorial created by ChatGPT. It’s easy to play once you’ve got the hang of it, and in the end I think the easiest way to get the hang of it is to watch someone else play it for a couple of minutes. I have therefore created a video tutorial, which you can find here. The video itself lasts about 25 minutes, but the tutorial part is under 10 minutes: what makes the video longer is an explanation of various features for making it easier to create puzzles, and also an explanation of how the algorithm works, which again is much easier to explain if I can demonstrate it on screen than if I have to write down some text. (I do have some text, since that is what I used as a prompt for ChatGPT to implement the algorithm, but it took a few iterations to get it working properly, so now I’m not sure what the quality of the code will be like, or even whether it is doing exactly what I want, though it appears to be.) Although I have embedded the video into this post, if you actually want to see what is going on you will probably need to watch it on YouTube using the full screen. I recommend not watching the explanation of the algorithm until you have played the game a few times first. Also, I should warn you that the games in the Advanced category are not all soluble, but games 1, 5 and 6 definitely are. Another thing I forgot to say on the video is that if you want you can “rotate” a word by clicking on one of its letters and dragging it to the right or left. If the word is equal to the identity, then its rotations will be as well, so this is a valid move. There is also a button for disabling this rotation facility if you want the puzzle to be slightly more challenging.

A quick note about the availability of the two games. They are hosted on the Netlify platform. I am on their free plan, which gives me a certain number of “credits” each month. I’m not sure how quickly these will run out, largely because I have no idea how many people will actually think it interesting to play the games. If they do run out, then the games will cease to be available until the credits are renewed, which for me happens on the 10th of each month. If this has happened and you are keen to play one of the games, another option is to download the html files and open them in your browser. Here is a link to the Artin-Tits game and here is one to the Tseytin game. If you are feeling particularly public-spirited, especially if you think you will play quite a lot, then you might consider doing that anyway, so that the Netlify credits run down more slowly. If the running out of the credits is quick enough for all this to be a real issue, then I may move to a paid plan.

I’ll finish with a quick tip for playing the Artin-Tits game, which I mentioned on the video but perhaps didn’t stress enough. Many of the moves consist in selecting three consecutive letters and using a relation of the form xyx=yxy to change them. Easy examples of this are replacement rules such as aba\to bab. But what if inverses are involved? I’ll represent inverses of generators with upper-case letters, so for example abA represents aba^{-1}, which in the game would be a white a followed by a white b followed by a black a. The word abA turns out to equal Bab. To remember this, a simple rule is that two letters of the same colour can be bracketed together and “pushed past” the third letter, which retains its colour but changes its value. Here, for example, we write abA as (ab)A and then swap them over, changing A into B in the process, getting B(ab), or Bab without the brackets. In group theoretic terms, this is of course saying that A and B are conjugates, and that ab is what conjugates one to the other. But when playing the game it is convenient to remember it by thinking that when you see a subword such as bDB, you can push the DB (and in particular the D) to the left, getting DBd.

Doug NatelsonAI and the practice of theoretical physics

Matthew Schwartz of Harvard has made a big recent splash, between his public Aspen talk "10000 Einsteins" a year ago about the role of AI and the future of physics, and his talk last week at the APS Global Physics summit on the same topic, and now with this essay, "Vibe Physics:  The AI Grad Student", on the website of Anthropic (producers of the AI tool Claude).

The essay talks about how Prof. Schwartz used Claude to write this paper, and he states that the AI tool functions roughly like a 2nd year grad student (one who also doesn't get tired or complain, but does need close checking and supervision).  The claim is that with this approach to doing calculations and writing papers, he was able to come out with a piece of work that would've taken literally ten times longer if done by working with a human student.  Note that he's not exactly unbiased, and he concludes his essay (on anthropic's site) saying you should spend the $20/month Claude subscription fee and it will change your life.  

There is no doubt that AI tools can speed up certain kinds of work, and there is a every hope that applying this in science will lead to increased pace of progress.  That said, right now these tools are (unsurprisingly) best at working in areas that are well-known and explored - one of my colleagues has tried applying these to really underexplored higher dimensional problems, and they're much less effective there.  The essay's claim that "LLMs are profoundly creative" is provocative.   There is also no discussion here about the cost of these tools, in financial, energy, and environmental terms.  

Still, Schwartz raises many questions about the future of the field and graduate education in general.   (His paragraph about how human beings will still be needed in science for getting experimental data, at least for a while, is really something.)  University research is not just about answering scholarly questions; it's about educating people.  Maybe some faculty will revel in writing papers without that kind of interaction, but somehow I don't think we're quite at the stage yet where we don't need to worry anymore about training experts in technical fields.  I do agree that it's good advice for everyone to pay close attention to where these capabilities are going.  We certainly live in interesting times.


March 25, 2026

Terence TaoLocal Bernstein theory, and lower bounds for Lebesgue constants

I’ve just uploaded to the arXiv my paper “Local Bernstein theory, and lower bounds for Lebesgue constants“. This paper was initially motivated by a problem of Erdős} on Lagrange interpolation, but in the course of solving that problem, I ended up modifying some very classical arguments of Bernstein and his contemporaries (Boas, Duffin, Schaeffer, Riesz, etc.) to obtain “local” versions of these classical “Bernstein-type inequalities” that may be of independent interest.

Bernstein proved many estimates concerning the derivatives of polynomials, trigonometric polynomials, and entire functions of exponential type, but perhaps his most famous inequality in this direction is:

Lemma 1 (Bernstein’s inequality for trigonometric polynomials) Let {P: {\bf R} \rightarrow {\bf C}} be a trigonometric polynomial of degree at most {n}, with {|P(x)| \leq A} for all {x}. Then {|P'(x)| \leq n A} for all {x}.

Similar inequalities concerning {L^p} norms of derivatives of Littlewood-Paley components of functions are now ubiquitious in the modern theory of nonlinear dispersive PDE (where they are also called Bernstein estimates), but this will not be the focus of this current post.

A trigonometric polynomial {P} of degree {n} is of exponential type {n} in the sense that {P(z) = O(\exp(n|z|))} for complex {z}. Bernstein in fact proved a more general result:

Lemma 2 (Bernstein’s inequality for functions of exponential type) Let {f: {\bf C} \rightarrow {\bf C}} be an entire function of exponential type at most {\lambda}, with {|f(x)| \leq A} for all {x \in {\bf R}}. Then {|f'(x)| \leq \lambda A} for all {x \in {\bf R}}.

There are several proofs of this lemma – see for instance this survey of Queffélec and Zarouf. In the case that {f} is real-valued on {{\bf R}}, there is a nice proof by Duffin and Schaeffer, which we sketch as follows. Suppose we normalize {A=\lambda=1}, and adjust {f} by a suitable damping factor so that {f(z)} actually decays slower than {\exp(|z|)} as {z \rightarrow \infty}. Then, for any {0 < \alpha < 1} and {x_0 \in {\bf R}}, one can use Rouche’s theorem to show that the function {\cos(x-x_0) - \alpha f(x)} has the same number of zeroes as {\cos(x-x_0)} in a suitable large rectangle; but on the other hand one can use the intermediate value theorem to show that {\cos(x-x_0) - \alpha f(x)} has at least as many zeroes than {\cos(x-x_0)} in the same rectangle. Among other things, this prevents double zeroes from occuring, which turns out to give the desired claim {|f'(x)| \leq 1} after some routine calculations (in fact one obtains the stronger bound {|f(x)|^2 + |f'(x)|^2 \leq 1} for all real {x}).

The first main result of the paper is to obtain localized versions of Lemma 2 (as well as some related estimates). Roughly speaking, these estimates assert that if {f} is holomorphic on a wide thin rectangle passing through the real axis, is bounded by {A} on the intersection of the real axis with this rectangle, and is “locally of exponential type” in the sense that it is bounded by {O(\exp( \lambda |\mathrm{Im} z|))} on the upper and lower edges of this rectangle (and obeys some very mild growth conditions on the remaining sides of this rectangle), then {|f'(x)|} can be bounded by {\lambda A} plus small errors on the real line, with some additional estimates away from the real line also available. The proof proceeds by a modification of the Duffin–Schaeffer argument, together with the two-constant theorem of Nevanlinna (and some standard estimates of harmonic measures on rectangles) to deal with the effect of the localization. (As a side note, this latter argument was provided to me by ChatGPT, as I was not previously aware of the Nevanlinna two-constant theorem.)

Once one localizes this “Bernstein theory”, it becomes suitable for the analysis of (real-rooted, monic) polynomials {P} of a high degree {n}, which are not bounded globally on {{\bf R}} (and grow polynomially rather than exponentially at infinity), but which can exhibit “local exponential type” behavior on various intervals, particularly in regions where the logarithmic potential

\displaystyle  U_\mu(z) := \frac{1}{n} \log \frac{1}{|P(z)|} = \int \log \frac{1}{|z-x|}\ d\mu(x)

behaves like a smooth function (here {\mu = \frac{1}{n} \sum_{j=1}^k \delta_{x_j}} is the empirical measure of the roots {x_1,\dots,x_n} of {P}). A key example is the (monic) Chebyshev polynomials {2^{1-n} T_n(x)}, which locally behave like sinusoids on the interval {[-1,1]} (and are locally of exponential type above and below this interval):

This becomes relevant in the theory of Lagrange interpolation. Recall that if {x_1 < \dots < x_n} are real numbers and {Q} is a polynomial of degree less than {n} then one has the interpolation formula

\displaystyle  Q(z) = \sum_{j=1}^k Q(x_k) \ell_j(z)

where the Lagrange basis functions {\ell_j(z)} are defined by the formula

\displaystyle  \ell_k(z) := \prod_{i \neq k} \frac{z - x_i}{x_k - x_i}.

In terms of the monic polynomial {P(z) := \prod_{j=1}^n (z-x_j)}, we can write

\displaystyle  \ell_k(z) = \frac{P(z)}{(z-x_k) P'(x_k)}.

The stability and convergence properties of Lagrange interpolation are closely related to the Lebesgue function

\displaystyle  \Lambda(z) := \sum_{k=1}^n |\ell_k(z)|,

and for a given interval {I}, the quantity {\sup_{x \in I} \Lambda(x)} is known as the Lebesgue constant for that interval.

If one chooses the interpolation points {x_1,\dots,x_n} poorly, then the Lebesgue constant can be extremely large. However, if one selects these points to be the roots of the aforementioned monic Chebyshev polynomials, then it is known that {\sup_{x \in I} \Lambda(x) = \frac{2}{\pi} \log n - O(1)} for all fixed intervals {I} in {[-1,1]}. In the case {I = [-1,1]}, it was shown by Erdős} that this is the best possible value of the Lebesgue constant up to {O(1)} errors for interpolation on {[-1,1]}, thus

\displaystyle  \sup_{x \in [-1,1]} \Lambda(x) \geq \frac{2}{\pi} \log n - O(1)

whenever {-1 \leq x_1 < \dots < x_n \leq 1} (a more precise bound was later shown by Vertesi). Erdős and Turán then asked if the same lower bound

\displaystyle  \sup_{x \in I} \Lambda(x) \geq \frac{2}{\pi} \log n - O(1) \ \ \ \ \ (1)

held for more general intervals {I}. This is shown in our paper; a variant integral bound

\displaystyle  \int_I \Lambda(x)\ dx \geq \frac{4}{\pi^2} |I| \log n - o(\log n) \ \ \ \ \ (2)

is also established, answering a separate question of Erdős. These lower bounds had previously obtained up to constants by Erdős and Szabados}; the main issue was to obtain the sharp constant in the main term.

In terms of the monic polynomial {P}, these two estimates can be written as

\displaystyle  \sup_{x \in I} \sum_{k=1}^n \frac{|P(x)|}{|x-x_k||P'(x_k)|} \geq \frac{2}{\pi} \log n - O(1)

and

\displaystyle  \int_I \sum_{k=1}^n \frac{|P(x)|}{|x-x_k||P'(x_k)|}\ dx \geq \frac{4}{\pi^2} |I| \log n - o(\log n).

Using the intuition that {P} should behave locally like a trigonometric polynomial, and performing some renormalizations, one can extract the following toy problem to work with first:

Problem 3 Let {P: {\bf R} \rightarrow {\bf R}} be a trigonometric polynomial of degree {n} with {2n} roots {x_1 < \dots < x_{2n}} in {[0, 2\pi)}.
  • (i) Show that

    \displaystyle  \sup_{x \in [0,2\pi)} \sum_{k=1}^{2n} \frac{|P(x)|}{|P'(x_k)|} \geq 2. \ \ \ \ \ (3)

  • (ii) Show that

    \displaystyle  \int_0^{2\pi} \sum_{k=1}^{2n} \frac{|P(x)|}{|P'(x_k)|}\ dx \geq 8. \ \ \ \ \ (4)

It is easy to check that the lower bounds of {2} and {8} are sharp by considering the case when {P} is a sinusoid {P(x) = A \sin(n(x-x_0))}.

The bound (3) is immediate from Bernstein’s inequality (Lemma 1). By applying a local version of this inequality, I was able to get a weak version of the claim (1) in which {O(1)} was replaced with {o(\log n)}; see this early version of the paper, which was developed through conversations with Nat Sothanaphan and Aron Bhalla. By combining this argument with ideas from the older work of Erdős}, I was able to establish (1).

The bound (2) took me longer to establish, and involved a non-trivial amount of playing around with AI tools, the story of which I would like to share here. I had discovered the toy problem (4), but initially was not able to establish this inequality; AlphaEvolve seemed to confirm it numerically (with sinusoids appearing to be the extremizer), but did not offer direct clues on how to prove this rigorously. At some point I realized that the left-hand side factorized into the expressions {\int_0^{2\pi} |P(x)|\ dx} and {\sum_{k=1}^{2n} \frac{1}{|P'(x_k)|}}, and tried to bound these expressions separately. Perturbing around a sinusoid {A \sin(n(x-x_0))}, I was able to show that the {L^1} norm {\int_0^{2\pi} |P(x)|\ dx} was a local minimum as long as one only perturbed by lower order Fourier modes, keeping the frequency {n} coefficients unchanged. Guessing that this local minimum was actually a global minimum, this led me to conjecture the general lower bound

\displaystyle  \int_0^{2\pi} |P(x)|\ dx \geq 4 |a_n+ib_n|

whenever {P} was a degree {n} trigonometric polynomial with highest frequency components {a_n \cos(nx) + b_n \sin(nx)}. AlphaEvolve numerically confirmed this inequality to be likely to be true also. I still did not see how to prove this inequality, but I decided to try my luck giving it to ChatGPT Pro, which recognized it as an {L^1} approximation problem and gave me a duality-based proof (based ultimately on the Fourier expansion of the square wave). With some further discussion, I was able to adapt this proof to functions of global exponential type (replacing the Fourier manipulations with contour shifting arguments, in the spirit of the Paley-Wiener theorem), which roughly speaking gave me half of what I needed to establish (2). However, I still needed the matching lower bound

\displaystyle  \sum_{k=1}^{2n} \frac{1}{|P'(x_k)|} \geq \frac{2}{|a_n+ib_n|}

on the other factor in the toy problem. Again, AlphaEvolve could numerically confirm that this inequality was likely to be true, but now the quantity I was trying to control did not look convex or linear in {P}, and so the previous duality method did not seem to apply. At this point I switched to pen and paper; eventually I realized that the expression almost looked like a sum of residues, and eventually after playing around with contour integrals of {\frac{e^{inz}}{P(z)}} using the residue theorem I was able to establish (4), and then with a bit more (human) effort I could move from the toy problem back to the original problem to obtain (2). Quite possibly AI tools would also have been able to assist with these steps, but they were not necessary here; their main value for me was in quickly confirming that the approach I had in mind was numerically plausible, and in recognizing the right technique to solve one part of the toy problem I had isolated. (I also used AI tools for several other secondary tasks, such as literature review, proofreading, and generating pictures, but these applications of AI have matured to the point where using them for this purpose is almost mundane.)

March 24, 2026

John PreskillHas quantum advantage been achieved? Part 2: Considering the evidence

Welcome back to: Has quantum advantage been achieved?

In Part 1 of this mini-series on quantum advantage demonstrations, I told you about the idea of random circuit sampling (RCS) and the experimental implementations thereof. In this post, Part 2 out of 3, I will discuss the arguments and evidence for why I am convinced that the experiments demonstrate a quantum advantage.

Recall from Part 1 that to assess an experimental quantum advantage claim we need to check three criteria:

  1. Does the experiment correctly solve a computational task?
  2. Does it achieve a scalable advantage over classical computation?
  3. Does it achieve an in-practice advantage over the best classical attempt at solving the task?

What’s the issue?

When assessing these criteria for the RCS experiments there is an important problem: The early quantum computers we ran them on were very far from being reliable and the computation was significantly corrupted by noise. How should we interpret this noisy data? Or more concisely:

  1. Is random circuit sampling still classically hard even when we allow for whatever amount of noise the actual experiments had?
  2. Can we be convinced from the experimental data that this task has actually been solved?

I want to convince you today that we have developed a very good understanding of these questions that gives a solid underpinning to the advantage claim. Developing that understanding required a mix of methodologies from different areas of science, including theoretical computer science, algorithm design, and physics and has been an exciting journey over the past years.

The noisy sampling task

Let us start by answering the base question. What computational task did the experiments actually solve?

Recall that, in the ideal RCS scenario, we are given a random circuit CC on nnqubits and the task is to sample from the output distribution of the state obtained |C\ket C from applying the circuit CC to a simple reference state. The output probability distribution of this state is determined by the Born rule when I measure every qubit in a fixed choice of basis.

Now what does a noisy quantum computer do when I execute all the gates on it and apply them to its state? Well, it prepares a noisy version ρC\rho_ C of the intended state |C\ket C and once I measure the qubits, I obtain samples from the output distribution of that noisy state.

We should not make the task dependent on the specifics of that state or the noise that determined it, but we can define a computational task based on this observation by fixing how accurate that noisy state preparation is. The natural way to do this is to use the fidelity

F(C)=C|ρC|C, F(C) = \bra C \rho_C \ket C,

which is just the overlap between the ideal state and the noisy state. The fidelity is 1 if the noisy state is equal to the ideal state, and 0 if it is perfectly orthogonal to it.

Finite-fidelity random circuit sampling
Given a typical random circuit CC, sample from the output distribution of any quantum state whose fidelity with the ideal output state |C\ket C is at least δ\delta.

Note that finite-fidelity RCS does not demand success for every circuit, but only for typical circuits from the random circuit ensemble. This matches what the experiments do: they draw random circuits and need the device to perform well on the overwhelming majority of those draws. Accordingly, when the experiments quote a single number as “fidelity”, it is really the typical (or, more precisely, circuit-averaged) fidelity that I will just call FF.

The noisy experiments claim to have solved finite-fidelity RCS for values of δ\delta around 0.1%. What is more, they consistently achieve this value even as the circuit sizes are increased in the later experiments. Both the actual value and the scaling will be important later.

What is the complexity of finite-fidelity RCS?

Quantum advantage of finite-fidelity RCS

Let’s start off by supposing that the quantum computation is (nearly) perfectly executed, so the required fidelity δ\delta is quite large, say, 90%. In this scenario, we have very good evidence based on computational complexity theory that there is a scalable and in-practice quantum advantage for RCS. This evidence is very strong, comparable to the evidence we have for the hardness of factoring and simulating quantum systems. The intuition behind it is that quantum output probabilities are extremely hard to compute because of a mechanism behind quantum advantages: destructive interference. If you are interested in the subtleties and the open questions, take a look at our survey.

The question is now, how far down in fidelity this classical hardness persists? Intuitively, the smaller we make δ\delta, the easier finite-fidelity RCS should become for a classical algorithm (and a quantum computer, too), since the freedom we have in deviating from the ideal state in our simulation becomes larger and larger. This increases the possibility of finding a state that turns out to be easy to simulate within the fidelity constraint.

Somewhat surprisingly, though, finite-fidelity RCS seems to remain hard even for very small values of δ\delta. I am not aware of any efficient classical algorithm that achieves the finite-fidelity task for δ\delta significantly away from the baseline trivial value of 2n2^{-n}. This is the value a maximally mixed or randomly picked state achieves because a random state has no correlation with the ideal state (or any other state), and 2n2^{-n} is exactly what you expect in that case (while 0 would correspond to perfect anti-correlation).

One can save some classical runtime compared to solving near-ideal RCS by exploiting a reduced fidelity, but the costs remain exponential. To classically solve finite-fidelity RCS, the best known approaches are reported in the papers that performed classical simulations of finite-fidelity RCS with the parameters of the first Google and USTC experiment (classSim1, classSim2). To achieve this, however, they needed to approximately simulate the ideal circuits at an immense cost. To the best of my knowledge, all but those first two experiments are far out of reach for these algorithms.

Getting the scaling right: weak noise and low depth

So what is the right value of δ\delta at which we can hope for a scalable and in-practice advantage of RCS experiments?

When thinking about this question, it is helpful to keep a model of the circuit in mind that a noisy experiment runs. So, let us consider a noisy circuit on nn qubits with dd layers of gates and single-qubit noise of strength ε\varepsilon on every qubit in every layer. In this scenario, the typical fidelity with the ideal state will decay as Fexp(εnd)F \sim \exp(- \varepsilon n d).

Any reasonably testable value of the fidelity needs to scale as 1/𝗉𝗈𝗅𝗒(n)1/\mathsf{poly}(n), since eventually we need to estimate the average fidelity FF from the experimental samples and this typically requires at least 1/F21/F^2 samples, so exponentially small fidelities are experimentally invisible. The polynomial fidelity δ\delta is also much closer to the near-ideal scenario (δ\delta \geq90%) than the trivial scenario (δ=2n\delta = 2^{-n}). While we cannot formally pin this down, the intuition behind the complexity-theoretic evidence for the hardness of near-ideal RCS persists into the δ1/𝗉𝗈𝗅𝗒(n)\delta \sim 1/\mathsf{poly}(n) regime: to sample up to such high precision, we still need a reasonably accurate estimate of the ideal probabilities, and getting this is computationally extremely difficult. Scalable quantum advantage in this regime is therefore a pretty safe bet.

How do the parameters of the experiment and the RCS instances need to scale with the number of qubits nn to experimentally achieve the fidelity regime? The limit to consider is one in which the noise rate decreases with the number of qubits, while the circuit depth is only allowed to increase very slowly. It depends on the circuit architecture, i.e., the choice of circuit connectivity, and the gate set, through a constant cAc_A as I will explain in more detail below.

Weak-noise and low-depth scaling
(Weak noise) The local noise rate of the quantum device scales as ε<cA/n\varepsilon \lt c_A/n.
(Low depth) The circuit depth scales as dlognd \lesssim \log n.

This limit is such that we have a scaling of the fidelity as FncF \gtrsim n^{-c} for some constant cc. It is also a natural scaling limit for noisy devices whose error rates gradually improve through better engineering. You might be worried about the fact that the depth needs to be quite low but it turns out that there is a solid quantum advantage even for log(n)\log(n)-depth circuits.

The precise definition of the weak-noise regime is motivated by the following observation. It turns out to be crucial for assessing the noisy data from the experiment.

Fidelity versus XEB: a phase transition

Remember from Part 1 that the experiments measured a quantity called the cross-entropy benchmark (XEB)

χ=22n𝔼C𝔼xpC(x)1,\chi = 2^{2n} \mathbb E_C \mathbb E_{x} p_C(x) -1 ,

The XEB averages the ideal probabilities pC(x)p_C(x) corresponding to the sampled outcomes xx from experiments on random circuits CC. Thus, it correlates the experimental and ideal output distributions of those random circuits. You can think of it as a “classical version” of the fidelity: If the experimental distribution is correct, the XEB will essentially be 1. If it is uniformly random, the XEB is 0.

The experiments claimed that the XEB is a good proxy for the circuit-averaged fidelity given by F=𝔼CF(C)F = \mathbb E_C F(C), and so we need to understand when this is true. Fortunately, in the past few years, alongside with the improved experiments, we have developed a very good understanding of this question (WN, Spoof2, PT1, PT2).

It turns out that the quality of correspondence between XEB and average fidelity depends strongly on the noise in the experimental quantum state. In fact, there is a sharp phase transition: there is an architecture-dependent constant cAc_A such that when the experimental local noise rate ε<cA/n\varepsilon < c_A/n, then the XEB is a good and reliable proxy for the average fidelity for any system size nn and circuit depth dd. This is exactly the weak-noise regime. Above that threshold, in the strong noise regime, the XEB is an increasingly bad proxy for the fidelity (PT1, PT2).

Let me be more precise: In the weak-noise regime, when we consider the decay of the XEB as a function of circuit depth dd, the rate of decay is given by εn\varepsilon n, i.e., the XEB decays as exp(εnd)\exp(- \varepsilon n d ). Meanwhile, in the strong-noise regime the rate of decay is constant, giving an XEB decay as exp(Cd)\exp(- C d) for a constant CC. At the same time, the fidelity decays as exp(εnd)\exp(- \varepsilon n d ) regardless of the noise regime. Hence, in the weak-noise regime, the XEB is a good proxy of the fidelity, while in the strong noise regime, there is an exponentially increasing gap between the XEB (which remains large) and the fidelity (which continues to decay exponentially). regardless of the noise regime.

This is what the following plot shows. We computed it from an exact mapping of the behavior of the XEB to the dynamics of a statistical-mechanics model that can be evaluated efficiently. Using this mapping, we can also compute the noise threshold cAc_A for whichever random circuit family and architecture you are interested in.

From (PT2). The yy-axis label Δ(lnχ)\Delta( \ln \chi ) is the decay rate of the XEB χ\chi, N=nN=n the number of qubits and ε\varepsilon is the local noise rate.

Where are the experiments?

We are now ready to take a look at the crux when assessing the noisy data: Can we trust the reported XEB values as an estimator of the fidelity? If so, do the experiments solve finite-fidelity RCS in the solidly hard regime where δ1/𝗉𝗈𝗅𝗒(n)\delta \geq 1/ \mathsf{poly}(n)?

In their more recent paper (PT1), the Google team explicitly verified that the experiment is well below the phase transition, and it turns out that the first experiment was just at the boundary. The USTC experiments had comparable noise rates, and the Quantinuum experiment much better ones. Since fidelity decays as exp(εnd)\exp(-\varepsilon n d), but the reported XEB values stayed consistently around 0.1% as nn was increased, the experimental error rate ε\varepsilon of the experiments improved even better than the 1/n1/n scaling required for the weak-noise regime, namely, more like ε1/(nd)\varepsilon \sim 1/(nd). Altogether, the experiments are therefore in the weak-noise regime both in terms of absolute numbers relative to cA/nc_A/n and the required scaling.

Of course, to derive the transition, we made some assumptions about the noise such as that the noise is local, and that it does not depend much on the circuit itself. In the advantage experiments, these assumptions about the noise are characterized and tested. This is done through a variety of means at increasing levels of complexity, including detailed characterization of the noise in individual gates, gates run in parallel, and eventually in a larger circuit. The importance of understanding the noise shows in the fact that a significant portion of the supplementary materials of the advantage experiments is dedicated to getting this right. All of this contributes to the experimental justification for using the XEB as a proxy for the fidelity!

The data shows that the experiments solved finite-fidelity RCS for values of δ\delta above the constant value of roughly 0.1% as the experiments grew. In the following plot, I compare the experimental fidelity values to the near-ideal scenario on the one hand, and the trivial 2n2^{-n} value on the other hand. Viewed at this scale, the values of δ\delta for which the experiment solved finite-fidelity RCS are indeed vastly closer to the near-ideal value than the trivial baseline, which should boost our confidence that reproducing samples at a similar fidelity is extremely challenging.

The phase transition matters!

You might be tempted to say: “Well, but is all this really so important? Can’t I just use XEB and forget all about fidelity?”

The phase transition shows why that would change the complexity of the problem: in the strong-noise regime, XEB can stay high even when fidelity is exponentially small. And indeed, this discrepancy can be exploited by so-called spoofers for the XEB. These are efficient classical algorithms which can be used to succeed at a quantum advantage test even though they clearly do not achieve the intended advantage. These spoofers (Spoof1, Spoof2) can achieve high XEB scores comparable to those of the experiments and scaling like exp(cd)\exp(-cd) in the circuit depth dd for some constant cc.

Their basic idea is to introduce strong, judiciously chosen noise at specific circuit locations that has the effect of breaking up the simulation task up into smaller, much easier components, but at the same time still gives a high XEB score. In doing so, they exploit the strong-noise regime in which the XEB is a really bad proxy for the fidelity. This allows them to sample from states with exponentially low fidelity while achieving a high XEB value.

The discovery of the phase transition and the associated spoofers highlights the importance of modeling when assessing—and even formulating—the advantage claim based on noisy data.

But we can’t compute the XEB!

You might also be worried that the experiments did not actually compute the XEB in the advantage regime because to estimate it they would have needed to compute ideal probabilities—a task that is hard by definition of the advantage regime. Instead, they used a bunch of different ways to extrapolate the true XEB from XEB proxies (proxy of a proxy of the fidelity). Is this is a valid way of getting an estimate of the true XEB?

It totally is! Different extrapolations—from easy-to-simulate to hard-to-simulate, from small system to large system etc—all gave consistent answers for the experimental XEB value of the supremacy circuits. Think of this as having several lines that cross in the same point. For that crossing to be a coincidence, something crazy, conspiratorial must happen exactly when you move to the supremacy circuits from different directions. That is why it is reasonable to trust the reported value of the XEB.

That’s exactly how experiments work!

All of this is to say that establishing that the experiments correctly solved finite-fidelity RCS and therefore show quantum advantage involved a lot of experimental characterization of the noise as well as theoretical work to understand the effects of noise on the quantity we care about—the fidelity between the experimental and ideal states.

In this respect (and maybe also in the scale of the discovery), the quantum advantage experiments are similar to the recent experiments reporting discovery of the Higgs boson and gravitational waves. While I do not claim to understand any of the details, what I do understand is that in both experiments, there is an unfathomable amount of data that could not be interpreted without preselection and post-processing of the data, theories, extrapolations and approximations that model the experiment and measurement apparatus. All of those enter the respective smoking-gun plots that show the discoveries.

If you believe in the validity of experimental physics methodology, you should therefore also believe in the type of evidence underlying experimental claim of the quantum advantage demonstrations: that they sampled from the output distribution of a quantum state with the reported fidelities.

Put succinctly: If you believe in the Higgs boson and gravitational waves, you should probably also believe in the experimental demonstration of quantum advantage.

What are the counter-arguments?

The theoretical computer scientist

“The weak-noise limit is not physical. The appropriate scaling limit is one in which the local noise rate of the device is constant while the system size grows, and in that case, there is a classical simulation algorithm for RCS (SimIQP, SimRCS).”

In theoretical computer science, scaling of time or the system size in the input size is considered very natural: We say an algorithm is efficient if its runtime and space usage only depend polynomially on the input size.

But all scaling arguments are hypothetical concepts, and we only care about the scaling at relevant sizes. In the end, every scaling limit is going to hit the wall of physical reality—be it the amount of energy or human lifetime that limits the time of an algorithm, or the physical resources that are required to build larger and larger computers. To keep the scaling limit going as we increase the size of our computations, we need innovation that makes the components smaller and less noisy.

At the scales relevant to RCS, the 1/n1/n scaling of the noise is benign and even natural. Why? Well, currently, the actual noise in quantum computers is not governed by the fundamental limit, but by engineering challenges. Realizing this limit therefore amounts to engineering improvements in the system size and noise rate that are achieved over time. Sure, at some point that scaling limit is also going to hit a fundamental barrier below which the noise cannot be improved. But we are surely far away from that limit, yet. What is more, already now logical qubits are starting to work and achieve beyond-breakeven fidelities. So even if the engineering improvements should flatten out from here onward, QEC will keep the 1/n1/n noise limit going and even accelerate it in the intermediate future.

The complexity maniac

“All the hard complexity-theoretic evidence for quantum advantage is in the near-ideal regime, but now you are claiming advantage for the low-fidelity version of that task.”

This is probably the strongest counter-argument in my opinion, and I gave my best response above. Let me just add that this is a question about computational complexity. In the end, all of complexity theory is based on belief. The only real evidence we have for the hardness of any task is the absence of an efficient algorithm, or the reduction to a paradigmatic, well-studied task for which there is no efficient algorithm.

I am not sure how much I would bet that you cannot find an efficient algorithm for finite-fidelity RCS in the regime of the experiments, but it is certainly a pizza.

The enthusiastic skeptic

“There is no verification test that just depends on the classical samples, is efficient and does not make any assumptions about the device. In particular, you cannot unconditionally verify fidelity just from the classical samples. Why should I believe the data?”

Yes, sure, the current advantage demonstrations are not device-independent. But the comparison you should have in mind are Bell tests. The first proper Bell tests of Aspect and others in the 80s were not free of loopholes. They still allowed for contrived explanations of the data that did not violate local realism. Still, I can hardly believe that anyone would argue that Bell inequalities were not violated already back then.

As the years passed, these remaining loopholes were closed. To be a skeptic of the data, people needed to come up with more and more adversarial scenarios that could explain the data. We are working on the same to happen with quantum advantage demonstrations: come up with better schemes and better tests that require less and less assumptions or knowledge about the specifics of the device.

The “this is unfair” argument

“When you chose the gates and architecture of the circuit dependent on your device, you tailored the task too much to the device and that is unfair. Not even the different RCS experiments solve exactly the same task.”

This is not really an argument against the achievement of quantum advantage but more against the particular choices of circuit ensembles in the experiments. Sure, the specific computations solved are still somewhat tailored to the hardware itself and in this sense the experiments are not hardware-independent yet, but they still solve fine computational tasks. Moving away from such hardware-tailored task specifications is another important next step and we are working on it.


In the third and last part of this mini series I will address next steps in quantum advantage that aim at closing some of the remaining loopholes. The most important—and theoretically interesting—one is to enable efficient verification of quantum advantage using less or even no specific knowledge about the device that was used, but just the measurement outcomes.

References

(survey) Hangleiter, D. & Eisert, J. Computational advantage of quantum random sampling. Rev. Mod. Phys. 95, 035001 (2023).

(classSim1) Pan, F., Chen, K. & Zhang, P. Solving the sampling problem of the Sycamore quantum circuits. Phys. Rev. Lett. 129, 090502 (2022).

(classSim2) Kalachev, G., Panteleev, P., Zhou, P. & Yung, M.-H. Classical Sampling of Random Quantum Circuits with Bounded Fidelity. arXiv.2112.15083 (2021).

(WN) Dalzell, A. M., Hunter-Jones, N. & Brandão, F. G. S. L. Random Quantum Circuits Transform Local Noise into Global White Noise. Commun. Math. Phys. 405, 78 (2024).

(PT1)vMorvan, A. et al. Phase transitions in random circuit sampling. Nature 634, 328–333 (2024).

(PT2) Ware, B. et al. A sharp phase transition in linear cross-entropy benchmarking. arXiv:2305.04954 (2023).

(Spoof1) Barak, B., Chou, C.-N. & Gao, X. Spoofing Linear Cross-Entropy Benchmarking in Shallow Quantum Circuits. in 12th Innovations in Theoretical Computer Science Conference (ITCS 2021) (ed. Lee, J. R.) vol. 185 30:1-30:20 (2021).

(Spoof2) Gao, X. et al. Limitations of Linear Cross-Entropy as a Measure for Quantum Advantage. PRX Quantum 5, 010334 (2024).

(SimIQP) Bremner, M. J., Montanaro, A. & Shepherd, D. J. Achieving quantum supremacy with sparse and noisy commuting quantum computations. Quantum 1, 8 (2017).

(SimRCS) Aharonov, D., Gao, X., Landau, Z., Liu, Y. & Vazirani, U. A polynomial-time classical algorithm for noisy random circuit sampling. in Proceedings of the 55th Annual ACM Symposium on Theory of Computing 945–957 (2023).

John PreskillHas quantum advantage been achieved?

Recently, I gave a couple of perspective talks on quantum advantage, one at the annual retreat of the CIQC and one at a recent KITP programme. I started off by polling the audience on who believed quantum advantage had been achieved. Just this one, simple question.

The audience was mostly experimental and theoretical physicists with a few CS theory folks sprinkled in. I was sure that these audiences would be overwhelmingly convinced of the successful demonstration of quantum advantage. After all, more than half a decade has passed since the first experimental claim (G1) of “quantum supremacy” as the patron of this blog’s institute called the idea “to perform tasks with controlled quantum systems going beyond what can be achieved with ordinary digital computers” (Preskill, p. 2) back in 2012. Yes, this first experiment by the Google team may have been simulated in the meantime, but it was only the first in an impressive series of similar demonstrations that became bigger and better with every year that passed. Surely, so I thought, a significant part of my audiences would have been convinced of quantum advantage even before Google’s claim, when so-called quantum simulation experiments claimed to have performed computations that no classical computer could do (e.g. (qSim)).

I could not have been more wrong.

In both talks, less than half of the people in the audience thought that quantum advantage had been achieved.

In the discussions that ensued, I came to understand what folks criticized about the experiments that have been performed and even the concept of quantum advantage to begin with. But more on that later. Most of all, it seemed to me, the community had dismissed Google’s advantage claim because of the classical simulation shortly after. It hadn’t quite kept track of all the advances—theoretical and experimental—since then.

In a mini-series of three posts, I want to remedy this and convince you that the existing quantum computers can perform tasks that no classical computer can do. Let me caution, though, that the experiments I am going to talk about solve a (nearly) useless task. Nothing of what I say implies that you should (yet) be worried about your bank accounts.

I will start off by recapping what quantum advantage is and how it has been demonstrated in a set of experiments over the past few years.

Part 1: What is quantum advantage and what has been done?

To state the obvious: we are now fairly convinced that noiseless quantum computers would be able solve problems efficiently that no classical computer could solve. In fact, we have been convinced of that already since the mid-90ies when Lloyd and Shor discovered two basic quantum algorithms: simulating quantum systems and factoring large numbers. Both of these are tasks where we are as certain as we could be that no classical computer can solve them. So why talk about quantum advantage 20 and 30 years later?

The idea of a quantum advantage demonstration—be it on a completely useless task even—emerged as a milestone for the field in the 2010s. Achieving quantum advantage would finally demonstrate that quantum computing was not just a random idea of a bunch of academics who took quantum mechanics too seriously. It would show that quantum speedups are real: We can actually build quantum devices, control their states and the noise in them, and use them to solve tasks which not even the largest classical supercomputers could do—and these are very large.

What is quantum advantage?

But what exactly do we mean by “quantum advantage”. It is a vague concept, for sure. But some essential criteria that a demonstration should certainly satisfy are probably the following.

  1. The quantum device needs to solve a pre-specified computational task. This means that there needs to be an input to the quantum computer. Given the input, the quantum computer must then be programmed to solve the task for the given input. This may sound trivial. But it is crucial because it delineates programmable computing devices from just experiments on any odd physical system.
  2. There must be a scaling difference in the time it takes for a quantum computer to solve the task and the time it takes for a classical computer. As we make the problem or input size larger, the difference between the quantum and classical solution times should increase disproportionately, ideally exponentially.
  3. And finally: the actual task solved by the quantum computer should not be solvable by any classical machine (at the time).

Achieving this last criterion using imperfect, noisy quantum devices is the challenge the idea of quantum supremacy set for the field. After all, running any of our favourite quantum algorithms in a classically hard regime on these devices is completely out of the question. They are too small and too noisy. So the field had to come up with the conceivably smallest and most noise-robust quantum algorithm that has a significant scaling advantage against classical computation.

Random circuits are really hard to simulate!

The idea is simple: we just run a random computation, constructed in a way that is as favorable as we can make it to the quantum device while being as hard as possible classically. This may strike as a pretty unfair way to come up with a computational task—it is just built to be hard for classical computers without any other purpose. But: it is a fine computational task. There is an input: the description of the quantum circuit, drawn randomly. The device needs to be programmed to run this exact circuit. And there is a task: just return whatever this quantum computation would return. These are strings of 0s and 1s drawn from a certain distribution. Getting the distribution of the strings right for a given input circuit is the computational task.

This task, dubbed random circuit sampling, can be solved on a classical as well as a quantum computer, but there is a (presumably) exponential advantage for the quantum computer. More on that in Part 2.

For now, let me tell you about the experimental demonstrations of random circuit sampling. Allow me to be slightly more formal. The task solved in random circuit sampling is to produce bit strings x{0,1}nx \in \{0,1\}^n distributed according to the Born-rule outcome distribution

pC(x)=|x|C|0|2p_C(x) = | \bra x C \ket {0}|^2

of a sequence of elementary quantum operations (unitary rotations of one or two qubits at a time) which is drawn randomly according to certain rules. This circuit CC is applied to a reference state |0\ket 0 on the quantum computer and then measured, giving the string xx as an outcome.

The breakthrough: classically hard programmable quantum computations in the real world

In the first quantum supremacy experiment (G1) by the Google team, the quantum computer was built from 53 superconducting qubits arranged in a 2D grid. The operations were randomly chosen simple one-qubit gates (X,Y,X+Y\sqrt X, \sqrt Y, \sqrt{X+Y}) and deterministic two-qubit gates called fSim applied in the 2D pattern, and repeated a certain number of times (the depth of the circuit). The limiting factor in these experiments was the quality of the two-qubit gates and the measurements, with error probabilities around 0.6 % and 4 %, respectively.

A very similar experiment was performed by the USTC team on 56 qubits (U1) and both experiments were repeated with better fidelities (0.4 % and 1 % for two-qubit gates and measurements) and slightly larger system sizes (70 and 83 qubits, respectively) in the past two years (G2,U2).

Using a trapped-ion architecture, the Quantinuum team also demonstrated random circuit sampling on 56 qubits but with arbitrary connectivity (random regular graphs) (Q). There, the two-qubit gates were π/2\pi/2-rotations around ZZZ \otimes Z, the single-qubit gates were uniformly random and the error rates much better (0.15 % for both two-qubit gate and measurement errors).

All the experiments ran random circuits on varying system sizes and circuit depths, and collected thousands to millions of samples from a few random circuits at a given size. To benchmark the quality of the samples, the widely accepted benchmark is now the linear cross-entropy (XEB) benchmark defined as

χ=22n𝔼C𝔼xpC(x)1,\chi = 2^{2n} \mathbb E_C \mathbb E_{x} p_C(x) -1 ,

for an nn-qubit circuit. The expectation over CC is over the random choice of circuit and the expectation over xx is over the experimental distribution of the bit strings. In other words, to compute the XEB given a list of samples, you ‘just’ need to compute the ideal probability of obtaining that sample from the circuit CC and average the outcomes.

The XEB is nice because it gives 1 for ideal samples from sufficiently random circuits and 0 for uniformly random samples, and it can be estimated accurately from just a few samples. Under the right conditions, it turns out to be a good proxy for the many-body fidelity of the quantum state prepared just before the measurement.

This tells us that we should expect an XEB score of (1error per gate)# gatescnd(1-\text{error per gate})^{\text{\# gates}} \sim c^{- n d } for some noise- and architecture-dependent constant cc. All of the experiments achieved a value of the XEB that was significantly (in the statistical sense) far away from 0 as you can see in the plot below. This shows that something nontrivial is going on in the experiments, because the fidelity we expect for a maximally mixed or random state is 2n2^{-n} which is less than 101410^{-14} % for all the experiments.

The complexity of simulating these experiments is roughly governed by an exponential in either the number of qubits or the maximum bipartite entanglement generated. Figure 5 of the Quantinuum paper has a nice comparison.

It is not easy to say how much leverage an XEB significantly lower than 1 gives a classical spoofer. But one can certainly use it to judiciously change the circuit a tiny bit to make it easier to simulate.

Even then, reproducing the low scores between 0.05 % and 0.2 % of the experiments is extremely hard on classical computers. To the best of my knowledge, producing samples that match the experimental XEB score has only been achieved for the first experiment from 2019 (PCZ). This simulation already exploited the relatively low XEB score to simplify the computation, but even for the slightly larger 56 qubit experiments these techniques may not be feasibly run. So to the best of my knowledge, the only one of the experiments which may actually have been simulated is the 2019 experiment by the Google team.

If there are better methods, or computers, or more willingness to spend money on simulating random circuits today, though, I would be very excited to hear about it!

Proxy of a proxy of a benchmark

Now, you may be wondering: “How do you even compute the XEB or fidelity in a quantum advantage experiment in the first place? Doesn’t it require computing outcome probabilities of the supposedly hard quantum circuits?” And that is indeed a very good question. After all, the quantum advantage of random circuit sampling is based on the hardness of computing these probabilities. This is why, to get an estimate of the XEB in the advantage regime, the experiments needed to use proxies and extrapolation from classically tractable regimes.

This will be important for Part 2 of this series, where I will discuss the evidence we have for quantum advantage, so let me give you some more detail. To extrapolate, one can just run smaller circuits of increasing sizes and extrapolate to the size in the advantage regime. Alternatively, one can run circuits with the same number of gates but with added structure that makes them classically simulatable and extrapolate to the advantage circuits. Extrapolation is based on samples from different experiments from the quantum advantage experiments. All of the experiments did this.

A separate estimate of the XEB score is based on proxies. An XEB proxy uses the samples from the advantage experiments, but computes a different quantity than the XEB that can actually be computed and for which one can collect independent numerical and theoretical evidence that it matches the XEB in the relevant regime. For example, the Google experiments averaged outcome probabilities of modified circuits that were related to the true circuits but easier to simulate.

The Quantinuum experiment did something entirely different, which is to estimate the fidelity of the advantage experiment by inverting the circuit on the quantum computer and measuring the probability of coming back to the initial state.

All of the methods used to estimate the XEB of the quantum advantage experiments required some independent verification based on numerics on smaller sizes and induction to larger sizes, as well as theoretical arguments.

In the end, the advantage claims are thus based on a proxy of a proxy of the quantum fidelity. This is not to say that the advantage claims do not hold. In fact, I will argue in my next post that this is just the way science works. I will also tell you more about the evidence that the experiments I described here actually demonstrate quantum advantage and discuss some skeptical arguments.


Let me close this first post with a few notes.

In describing the quantum supremacy experiments, I focused on random circuit sampling which is run on programmable digital quantum computers. What I neglected to talk about is boson sampling and Gaussian boson sampling, which are run on photonic devices and have also been experimentally demonstrated. The reason for this is that I think random circuits are conceptually cleaner since they are run on processors that are in principle capable of running an arbitrary quantum computation while the photonic devices used in boson sampling are much more limited and bear more resemblance to analog simulators.

I want to continue my poll here, so feel free to write in the comments whether or not you believe that quantum advantage has been demonstrated (by these experiments) and if not, why.


Continue reading Part 2: Considering the evidence.

References

[G1] Arute, F. et al. Quantum supremacy using a programmable superconducting processor. Nature 574, 505–510 (2019).

[Preskill] Preskill, J. Quantum computing and the entanglement frontier. arXiv:1203.5813 (2012).

[qSim] Choi, J. et al. Exploring the many-body localization transition in two dimensions. Science 352, 1547–1552 (2016). .

[U1] Wu, Y. et al. Strong Quantum Computational Advantage Using a Superconducting Quantum Processor. Phys. Rev. Lett. 127, 180501 (2021).

[G2] Morvan, A. et al. Phase transitions in random circuit sampling. Nature 634, 328–333 (2024).

[U2] Gao, D. et al. Establishing a New Benchmark in Quantum Computational Advantage with 105-qubit Zuchongzhi 3.0 Processor. Phys. Rev. Lett. 134, 090601 (2025).

[Q] DeCross, M. et al. Computational Power of Random Quantum Circuits in Arbitrary Geometries. Phys. Rev. X 15, 021052 (2025).

[PCZ] Pan, F., Chen, K. & Zhang, P. Solving the sampling problem of the Sycamore quantum circuits. Phys. Rev. Lett. 129, 090502 (2022).

March 21, 2026

Tommaso DorigoTravel With Two Infants

The other day I traveled with Kalliopi and our two newborns to Padova from Lulea. After six full months in Lapland - a full autumn and winter, in fact - I needed to get back to my original office, and take care of other business at what has become my second home now. Meanwhile, travel has become considerably more complicated for me: traveling with two infants is no easy matter.

read more

March 20, 2026

n-Category Café The Agent That Doesn't Know Itself

guest post by William Waites

The previous post introduced the plumbing calculus: typed channels, structural morphisms, two forms of composition, and agents as stateful morphisms with a protocol for managing their state. The examples were simple. This post is about what happens when the algebra handles something genuinely complex.

To get there, we need to understand a little about how large language models work.

Large language models are sequence-to-sequence transducers: a sequence of tokens comes in, a sequence comes out. Text is tokenised and the model operates on the tokens.

From the outside, the morphism is simple: !string → !string. A message goes in, a message comes out. But the client libraries (the code that calls the LLM provider) maintain the conversation history and send it back with every call. The actual morphism is (!string, ![Message]) → (!string, ![Message]): the input message and the accumulated history go in, the response and the updated history come out. The history feeds back. This is a trace in the sense of traced monoidal categories: the feedback channel is hidden from the user, who sees only !string → !string.

Crucially, the model has a limited amount of memory. It is not a memoryless process, but the memory it has is not large: 200,000 tokens for current models, perhaps a million for the state of the art. This sounds like a lot. It is not. An academic paper is roughly 10,000 tokens. A literature review that needs to work with thirty papers has already exceeded the context window of most models, and that is before the model has produced a single word of output.

If you have used any of these agent interfaces, you will have noticed that after talking back and forth for a while, the agent will compact. This is a form of memory management. What is happening is that some supervisory process has noticed the context window filling up, and has intervened to shorten its contents. A naïve approach is to truncate: discard everything before the last N exchanges. A better approach is to feed the entire context to another language model and ask it to summarise, then put the summary back.

This is normally done by specialised code outside the agent, invisible to it.

How to manage agent memory well is an active research area. We do not, in general, do it very well. Truncation loses information. Summarisation loses nuance. Pinning helps but the right pinning strategy depends on the task. These are open questions, and to make progress we need to be able to experiment with different schemes and mechanisms: express a memory management strategy, test it, swap it for another, compare. Not by recompiling specialised code or hardcoding behaviour, but by writing it down in a language designed for exactly this kind of composition. Memory management should be a plumbing program: modular, type-checked, swappable.

So we built an implementation of compaction using the plumbing calculus, and the first thing we did was test it. I ran the protocol on a very short cycle: a single message caused a compaction, because the threshold was set to zero for testing. The compressor fired, produced a summary, rebuilt the agent’s context. The logs showed [compressor] 3404 in / 541 out. The protocol worked.

Then I asked the agent: "have you experienced compaction?"

The agent said no. It explained what compaction is, accurately. Said it hadn’t happened yet. It was confident.

I asked: "do you have a context summary in your window?"

Yes, it said, and described the contents accurately.

"How did that context summary get there if you have not yet compacted?"

The agent constructed a plausible, confident, and completely wrong explanation: the summary was "provided to me by the system at the start of this conversation" as a "briefing or recap." When pressed, it doubled down:

"The context-summary is not evidence that compaction has occurred. It’s more like a briefing or recap that the system gives me at the start of a conversation session to provide continuity."

The agent was looking at the direct evidence of its own compaction and confidently explaining why it was not compaction. We will return to why it gets this wrong, and how to fix it. But first: how do we build this?

The compacting homunculus

At a high level, it works like this. An agent is running: input comes in, output goes out. Together with the output, the agent emits a telemetry report. The telemetry includes token counts: with each transaction, the entire train of messages and responses is sent to the LLM provider, and back comes a response together with a count of the tokens that went in and the tokens that came out. Our agent implementation sends this telemetry out of the telemetry port to anybody who is listening.

The construction involves a second agent. This second agent is a homunculus: the little man who sits on your shoulder and watches what your mind is doing. Here is the topology:

Topology of the compacting homunculus. Two boxes: Agent (large, bottom) and Compressor (smaller, top). The Agent has input and output ports for the main data flow (dark blue arrows). Three channels connect the Agent to the Compressor: telemetry flows up from the Agent (the Compressor watches token counts), ctrl_out flows up from the Agent (the Compressor receives acknowledgements), and ctrl_in flows down from the Compressor to the Agent (the Compressor sends commands). The Agent does not know the Compressor exists. It just receives control messages and responds to them.

The homunculus listens to the telemetry and says: the memory is filling up. The token count has crossed a threshold. It is time to compact. And then it acts:

• Send pause to the agent’s control port. Stop accepting input.

• Send get memory. The agent produces the contents of its context window.

• Summarise that memory (using another LLM call).

• Send set memory with the compacted version.

• Send resume. The agent continues processing input.

Each step requires an acknowledgement before the next can proceed. This is a protocol: pause, acknowledge, get memory, here is the memory, set memory, acknowledge, resume, acknowledge.

It is possible to express this directly in the plumbing calculus, but it would be painfully verbose. Instead, we use session types to describe the protocol. This is not pseudocode. There is a compiler and a runtime for this language. Here is the protocol:

protocol Compaction =
  send Pause . recv PauseAck .
  send GetMemory . recv MemoryDump .
  send SetMemory . recv SetMemoryAck .
  send Resume . recv ResumeAck . end

The protocol is eight lines. It reads as a sequence of steps: send, receive, send, receive, and so on. The compiler knows what types each step carries. Now we wire it up:

let compact : (!CtrlResp, !json) -> !CtrlCmd =
  plumb(ctrl_out, telemetry, ctrl_in) {
 
    (ctrl_out, ctrl_in)  Compaction as session
 
    telemetry
      ; filter(kind = "usage" && input_tokens > 150000)
      ; map(null) ; session@trigger
 
    session@trigger ; map({pause: true})
      ; session@send(Pause)
    session@done(PauseAck) ; map({get_memory: true})
      ; session@send(GetMemory)
    session@recv(MemoryDump) ; compressor
      ; session@send(SetMemory)
    session@done(SetMemoryAck) ; map({resume: true})
      ; session@send(Resume)
}

The first line binds the protocol to the agent’s control ports:

(ctrl_out, ctrl_in) <-> Compaction as session.

This says: the Compaction protocol runs over the control channel, and we refer to it as session. The telemetry line is the trigger: when token usage crosses a threshold, the protocol begins. Each subsequent line is one step of the protocol, wired to the appropriate control messages.

Here is a direct depiction of the protocol as wired. You can trace it through:

Diagram of the compaction protocol wired between the homunculus and the bot agent. Shows the telemetry stream flowing from the bot to the homunculus, a filter checking token usage against a threshold, and then a sequence of control messages: Pause flows to ctrl_in, PauseAck returns on ctrl_out, GetMemory flows in, MemoryDump returns, passes through a compressor agent, SetMemory flows in, SetMemoryAck returns, Resume flows in, ResumeAck returns. The protocol steps are connected in sequence. This is a direct transcription of the session type protocol into a wiring diagram.

And here is how we wire the homunculus to the agent:

let main : !string -> !string =
  plumb(input, output) {
    let ctrl : !CtrlCmd = channel
    let ctrl_out : !CtrlResp = channel
    let telem : !json = channel
 
    spawn bot(input=input, ctrl_in=ctrl,
              output=output, ctrl_out=ctrl_out,
              telemetry=telem)
    spawn compact(ctrl_out=ctrl_out,
                  telemetry=telem, ctrl_in=ctrl)
}

The main morphism takes a string input and produces a string output. Internally, it creates three channels (control commands, control responses, telemetry) and spawns two processes: the bot agent and the compact homunculus. The homunculus listens to the bot’s telemetry and control responses, and sends commands to the bot’s control input. The bot does not know the homunculus exists. It just receives control messages and responds to them.

There are two nested traces here. The first is the one from before, inside the agent: messages go in, the output accumulates with everything that came before, and the whole history feeds back on the next turn. We do not see this trace. It is hidden inside the client library. The second trace is the one we have just built: the homunculus. What goes around the outer loop is control: telemetry flows out, commands flow in, acknowledgements come back. The memory dump passes through the control channel at one point in the protocol, but the feedback path is control, not conversation history. Nested traces compose; the algebra has identities for this and it is fine. But they are different loops carrying different things.

Session types as barrier chains

The connection between the protocol above and what the compiler actually produces is the functor from session types into the plumbing calculus. This functor works because of barrier.

Why do we need the barrier? Because the protocol is about sending a message and waiting for a response. We can send a message, but we need the response to arrive before we proceed. The barrier takes two streams, one carrying the "done" signal and one carrying the response, and synchronises them into a pair. Only when both are present does the next step begin.

Each session type primitive has a direct image in the plumbing category, and the structure is prettier than it first appears. The primitives come in dual pairs:

In the diagrams below, session types are on the left in blue; their images in the plumbing calculus are on the right in beige.

send and recv are dual. They map to map and filter, which are also dual: send wraps the value with map, then synchronises via barrier with the done signal from the previous step. Recv filters the control output by step number, synchronises via barrier, then extracts the payload with map.

select and offer are dual. They map to tag and case analysis, which are also dual: select tags the value with a label via map, synchronises via barrier, and routes to the chosen branch chain. Offer copies the control output and filters each copy by label, routing to the appropriate branch chain.

Diagram showing the two dual pairs of session type primitives and their images in the plumbing calculus. Session types are shown in blue on the left; plumbing calculus images in beige on the right. Top section: send T is a simple arrow carrying type T; its image is map(wrap) followed by barrier with a done signal, then routing to the control input. recv T is its dual: filtering ctrl_out by step number, barrier with done, then map to extract the payload. Bottom section: select with labelled alternatives maps to coproduct injection via map(tag), barrier, then routing to the chosen branch chain. offer is its dual: copy the control output, filter each copy by label, and route to the corresponding branch chain.

• The sequencing operator (.) maps to a barrier chain. Each send-then-recv step becomes a barrier that synchronises the outgoing message with the incoming acknowledgement, and these barriers chain together to enforce the protocol ordering.

Diagram showing how the sequencing operator in session types maps to a barrier chain in the plumbing calculus. Top: the session type sequence "send T1 . recv T2 . send T3" shown as three boxes in a row connected by dots. Bottom: the plumbing image, a chain of barriers. Each send-recv pair becomes a barrier that takes the outgoing message on one arm and the incoming acknowledgement on the other, producing a synchronised pair. The done signal from one barrier feeds into the next, creating a chain that enforces protocol ordering. The trigger input starts the chain; the done output signals completion. Filters select responses by step number; maps construct outgoing messages.

rec maps to a feedback loop: merge takes the initial arm signal and the last done signal from the previous iteration, feeds them into the barrier chain body, and copy at the end splits done into output and feedback. The trigger serialisation gate starts

end is implicit: the chain simply stops. Discard handles any remaining signals.

Diagram showing three more session type primitives and their plumbing images. Top: rec X . S (recursion) maps to a feedback loop. A merge node takes two inputs: the initial arm signal and the last-done signal fed back from the end of the body. The body is a barrier chain (S). At the output, copy splits the done signal into an output arm and a feedback arm that loops back to merge. Middle: end is shown as simply discarding any remaining signal. Bottom: the trigger and serialisation gate, which starts the protocol. A trigger input feeds through a barrier that synchronises with a copy of the done signal, ensuring only one instance of the protocol runs at a time.

This mapping is a functor. It is total: every session type primitive has an image in the plumbing category, using only the morphisms we already have. Session types are a specification language; the plumbing calculus is the execution language. The compiler translates one into the other.

The reason we do this becomes obvious from the diagram below. It is scrunched up and difficult to look at. If you click on it you can get a big version and puzzle it out. If you squint through the spaghetti, you can see that it does implement the same compaction protocol above. We would not want to implement this by hand. So it is nice to have a functor. If you have the patience to puzzle your way through it, you can at least informally satisfy yourself that it is correct.

Thumbnail of the fully desugared compaction protocol as produced by the compiler. The diagram is intentionally dense: a large network of barriers, filters, maps, copy and merge nodes, all connected by typed wires. Each step of the compaction protocol (pause, get memory, set memory, resume) is visible as a cluster of barrier chains with filters selecting response types and maps constructing commands. The full-size version is linked for readers who want to trace the individual connections, but the point is that this is what the compiler produces from the eight-line session type specification above, and you would not want to construct it by hand.

Document pinning

There is another feature we implement, because managing the memory of an agent is not as simple as just compressing it.

The problem with compression is that it is a kind of annealing. As the conversation grows, it explores the space of possible conversation. When it gets compacted, it is compressed, and that lowers the temperature. Then it grows again, the temperature rises, and then it is compressed again. With each compression, information is lost. Over several cycles of this, the agent can very quickly lose track of where it was, what you said at the beginning, what it was doing.

We can begin to solve this with document pinning. The mechanism is a communication between the agent and its homunculus, not shown in the protocol above. It is another protocol. The agent says: this document that I have in memory (technically a tool call and response, or just a document in the case of the prompts), pin it. What does that mean? When we do compaction, we compact the contents of memory, but when we replace the memory, we also replace those pinned documents verbatim. And of course you can unpin a document and say: I do not want this one any more.

Either the agent can articulate this or the user can. The user can say: you must remember this, keep track of this bit of information. And the agent has a way to keep the most important information verbatim, without it getting compacted away.

This accomplishes two things.

First, it tends to keep the agent on track, because the agent no longer loses the important information across compaction cycles. The annealing still happens to the bulk of the conversation, but the pinned documents survive intact.

Second, it has to do with the actual operation of the underlying LLM on the GPUs. When you send a sequence of messages, this goes into the GPU and each token causes the GPU state to update. This is an expensive operation, very expensive. This is why these things cost so much. What you can do with some providers is put a cache point and say: this initial sequence of messages, from the beginning of the conversation up until the cache point, keep a hold of that. Do not recompute it. When you see this exact same prefix, this exact same sequence of messages again, just load that memory into the GPU directly. Not only is this a lot more efficient, it is also a lot cheaper, a factor of ten cheaper if you can actually hit the cache.

So if you are having a session with an agent and the agent has to keep some important documents in its memory, it is a good idea to pin them to the beginning of memory. You sacrifice a little bit of the context window in exchange for making sure that, number one, the information in those documents is not forgotten, and number two, that it can hit the cache. This is explained in more detail in a separate post on structural prompt preservation.

The agent that doesn’t know itself

Why does the agent get this wrong? In one sense, it is right. It has not experienced compaction. Nobody experiences compaction. Compaction happens in the gap between turns, in a moment the agent cannot perceive. The agent’s subjective time begins at the summary. There is no "before" from its perspective.

The summary is simply where memory starts. It is like asking someone "did you experience being asleep?" You can see the evidence, you are in bed, time has passed. But you did not experience the transition.

The <context-summary> tag is a structural marker. But interpreting it as evidence of compaction requires knowing what the world looked like before, and the agent does not have that. It would need a memory of not having a summary, followed by a memory of having one. Compaction erases exactly that transition.

Self-knowledge as metadata

The fix is not complicated. It is perfectly reasonable to provide, along with the user’s message, self-knowledge to the agent as metadata. What would it be useful for the agent to know?

The current time. The sense of time that these agents have is bizarre. We live in continuous time. Agents live in discrete time. As far as they are concerned, no time passes between one message and the next. It is instantaneous from their point of view. You may be having a conversation, walk away, go to the café, come back two hours later, send another message, and as far as the agent is concerned no time has passed. But if along with your message you send the current time, the agent knows.

How full the context window is. The agent has no way of telling, but you can provide it: this many tokens came in, this many went out.

Compaction cycles. So the agent knows how many times it has been compacted, and can judge the accuracy of the contents of its memory, which otherwise it could not do.

With the compaction counter, the agent immediately gets it right:

"Yes, I have experienced compaction. According to the runtime context, there has been 1 compaction cycle during this session."

No hedging, no confabulation. Same model, same prompts, one additional line of runtime context.

Context drift

This matters beyond the compaction story, because many of the failures we see in the news are context failures, not alignment failures.

While we were writing this post, a story appeared in the Guardian about AI chatbots directing people with gambling addictions to online casinos. This kind of story is common: vulnerable people talking to chatbots, chatbots giving them bad advice. The response of the industry is always the same: we need better guardrails, better alignment, as though the chatbots are badly aligned.

I do not think that is what is happening. What is happening is a lack of context. Either the chatbot was never told the person was vulnerable, or it was told and the information got lost. Someone with a gambling addiction may start by saying "I have a gambling problem." Then there is a four-hour conversation about sports. Through compaction cycles, what gets kept is the four hours of sports talk. The important bit of information does not get pinned and does not get kept. Context drift. By the time the user asks for betting tips, the chatbot no longer knows it should not give them.

The way to deal with this is not to tell the language model to be more considerate. The way to deal with it is to make sure the agent has enough information to give good advice, and that the information does not get lost. This is what document pinning is for: pinned context survives compaction, stays at the top of the window, cannot be diluted by subsequent conversation. This is discussed further in a separate post on structural prompt preservation.

But pinning is only one strategy. The field is in its infancy. We do not really know the right way to manage agent memory, and we do not have a huge amount of experience with it. What we are going to need is the ability to experiment with strategies: what if compaction works like this? What if pinning works like that? What if the homunculus watches for different signals? Each of these hypotheses needs to be described clearly, tested, and compared. This is where the formal language earns its keep. A strategy described in the plumbing calculus is precise, checkable, and can be swapped out for another without rewriting the surrounding infrastructure. We can experiment with memory architectures the way we experiment with any other part of a system: by describing what we want and seeing if it works.

Why has nobody done this?

When the first draft of this post was written, it was a mystery why the field had not thought to give agents self-knowledge as a routine matter: what they are doing, who they are talking to, what they should remember. Prompts are initial conditions. They get compacted away. There are agents that save files to disc, in a somewhat ad hoc way, but we do not give them tools to keep track of important information in a principled way.

Contemporaneously with this work, some providers have started to do it. For example, giving agents a clock, the ability to know what time it is. This is happening now, in the weeks between drafting and publication. The field is only now realising that agents need a certain amount of self-knowledge in order to function well. The compressed timeline is itself interesting: the gap between "why has nobody done this?" and "everybody is starting to do this" was a matter of weeks.

The mechanisms we have presented here allow us to construct agent networks and establish protocols that describe rigorously how they are meant to work. We can describe strategies for memory management in a formal language, test them, and swap them out. And perhaps beyond the cost savings and the efficiency increases, the ability to experiment clearly and formally with how agents manage their own memory is where the real value lies.

John BaezThe Agent that Doesn’t Know Itself

guest post by William Waites

The previous post introduced the plumbing calculus: typed channels, structural morphisms, two forms of composition, and agents as stateful morphisms with a protocol for managing their state. The examples were simple. This post is about what happens when the algebra handles something genuinely complex.

To get there, we need to understand a little about how large language models work. These models are sequence-to-sequence transducers: a sequence of tokens comes in, a sequence comes out. Text is tokenised and the model operates on the tokens.

From the outside, the morphism is simple: !string → !string. A message goes in, a message comes out. But the client libraries (the code that calls the LLM provider) maintain the conversation history and send it back with every call. The actual morphism is
(!string, ![Message]) → (!string, ![Message]): the input message and the accumulated history go in, the response and the updated history come out. The history feeds back. This is a trace in the sense of traced monoidal categories: the feedback channel is hidden from the user, who sees only !string → !string.

Crucially, the model has a limited amount of memory. It is not a memoryless process, but the memory it has is not large: 200,000 tokens for current models, perhaps a million for the state of the art. This sounds like a lot. It is not. An academic paper is roughly 10,000 tokens. A literature review that needs to work with thirty papers has already exceeded the context window of most models, and that is before the model has produced a single word of output.

If you have used any of these agent interfaces, you will have noticed that after talking back and forth for a while, the agent will compact. This is a form of memory management. What is happening is that some supervisory process has noticed the context window filling up, and has intervened to shorten its contents. A naïve approach is to truncate: discard everything before the last N exchanges. A better approach is to feed the entire context to another language model and ask it to summarise, then put the summary back.

This is normally done by specialised code outside the agent, invisible to it.

How to manage agent memory well is an active research area. We do not, in general, do it very well. Truncation loses information. Summarisation loses nuance. Pinning helps but the right pinning strategy depends on the task. These are open questions, and to make progress we need to be able to experiment with different schemes and mechanisms: express a memory management strategy, test it, swap it for another, compare. Not by recompiling specialised code or hardcoding behaviour, but by writing it down in a language designed for exactly this kind of composition. Memory management should be a plumbing program: modular, type-checked, swappable.

So we built an implementation of compaction using the plumbing calculus, and the first thing we did was test it. I ran the protocol on a very short cycle: a single message caused a compaction, because the threshold was set to zero for testing. The compressor fired, produced a summary, rebuilt the agent’s context. The logs showed [compressor] 3404 in / 541 out. The protocol worked.

Then I asked the agent: "have you experienced compaction?"

The agent said no. It explained what compaction is, accurately. Said it hadn’t happened yet. It was confident.

I asked: "do you have a context summary in your window?"

Yes, it said, and described the contents accurately.

"How did that context summary get there if you have not yet compacted?"

The agent constructed a plausible, confident, and completely wrong explanation: the summary was "provided to me by the system at the start of this conversation" as a "briefing or recap." When pressed, it doubled down:

"The context-summary is not evidence that compaction has occurred. It’s more like a briefing or recap that the system gives me at the start of a conversation session to provide continuity."

The agent was looking at the direct evidence of its own compaction and confidently explaining why it was not compaction. We will return to why it gets this wrong, and how to fix it. But first: how do we build this?

The compacting homunculus

At a high level, it works like this. An agent is running: input comes in, output goes out. Together with the output, the agent emits a telemetry report. The telemetry includes token counts: with each transaction, the entire train of messages and responses is sent to the LLM provider, and back comes a response together with a count of the tokens that went in and the tokens that came out. Our agent implementation sends this telemetry out of the telemetry port to anybody who is listening.

The construction involves a second agent. This second agent is a homunculus: the little man who sits on your shoulder and watches what your mind is doing. Here is the topology:

Topology of the compacting homunculus. Two boxes: Agent (large, bottom) and Compressor (smaller, top). The Agent has input and output ports for the main data flow (dark blue arrows). Three channels connect the Agent to the Compressor: telemetry flows up from the Agent (the Compressor watches token counts), ctrl_out flows up from the Agent (the Compressor receives acknowledgements), and ctrl_in flows down from the Compressor to the Agent (the Compressor sends commands). The Agent does not know the Compressor exists. It just receives control messages and responds to them.

The homunculus listens to the telemetry and says: the memory is filling up. The token count has crossed a threshold. It is time to compact. And then it acts:

• Send pause to the agent’s control port. Stop accepting input.
• Send get memory. The agent produces the contents of its context window.
• Summarise that memory (using another LLM call).
• Send set memory with the compacted version.
• Send resume. The agent continues processing input.

Each step requires an acknowledgement before the next can proceed. This is a protocol: pause, acknowledge, get memory, here is the memory, set memory, acknowledge, resume, acknowledge.

It is possible to express this directly in the plumbing calculus, but it would be painfully verbose. Instead, we use session types to describe the protocol. This is not pseudocode. There is a compiler and a runtime for this language. Here is the protocol:

protocol Compaction =
  send Pause . recv PauseAck .
  send GetMemory . recv MemoryDump .
  send SetMemory . recv SetMemoryAck .
  send Resume . recv ResumeAck . end

The protocol is eight lines. It reads as a sequence of steps:
send, receive, send, receive, and so on. The compiler knows what
types each step carries. Now we wire it up:

let compact : (!CtrlResp, !json) -> !CtrlCmd =
  plumb(ctrl_out, telemetry, ctrl_in) {

    (ctrl_out, ctrl_in)  Compaction as session

    telemetry
      ; filter(kind = "usage" && input_tokens > 150000)
      ; map(null) ; session@trigger

    session@trigger ; map({pause: true})
      ; session@send(Pause)
    session@done(PauseAck) ; map({get_memory: true})
      ; session@send(GetMemory)
    session@recv(MemoryDump) ; compressor
      ; session@send(SetMemory)
    session@done(SetMemoryAck) ; map({resume: true})
      ; session@send(Resume)
}

The first line binds the protocol to the agent’s control ports:

(ctrl_out, ctrl_in) <-> Compaction as session.

This says: the Compaction protocol runs over the control channel, and we refer to it as session. The telemetry line is the trigger: when token usage crosses a threshold, the protocol begins. Each subsequent line is one step of the protocol, wired to the appropriate control
messages.

Here is a direct depiction of the protocol as wired. You can trace it through:

Diagram of the compaction protocol wired between the homunculus and the bot agent. Shows the telemetry stream flowing from the bot to the homunculus, a filter checking token usage against a threshold, and then a sequence of control messages: Pause flows to ctrl_in, PauseAck returns on ctrl_out, GetMemory flows in, MemoryDump returns, passes through a compressor agent, SetMemory flows in, SetMemoryAck returns, Resume flows in, ResumeAck returns. The protocol steps are connected in sequence. This is a direct transcription of the session type protocol into a wiring diagram.

And here is how we wire the homunculus to the agent:

let main : !string -> !string =
  plumb(input, output) {
    let ctrl : !CtrlCmd = channel
    let ctrl_out : !CtrlResp = channel
    let telem : !json = channel

    spawn bot(input=input, ctrl_in=ctrl,
              output=output, ctrl_out=ctrl_out,
              telemetry=telem)
    spawn compact(ctrl_out=ctrl_out,
                  telemetry=telem, ctrl_in=ctrl)
}

The main morphism takes a string input and produces a string output. Internally, it creates three channels (control commands, control responses, telemetry) and spawns two processes: the bot agent and the compact homunculus. The homunculus listens to the bot’s telemetry and control responses, and sends commands to the bot’s control input. The bot does not know the homunculus exists. It just receives control messages and responds to them.

There are two nested traces here. The first is the one from before, inside the agent: messages go in, the output accumulates with everything that came before, and the whole history feeds back on the next turn. We do not see this trace. It is hidden inside the client library. The second trace is the one we have just built: the homunculus. What goes around the outer loop is control: telemetry flows out, commands flow in, acknowledgements come back. The memory dump passes through the control channel at one point in the protocol, but the feedback path is control, not conversation history. Nested traces compose; the algebra has identities for this and it is fine. But they are different loops carrying different things.

Session types as barrier chains

The connection between the protocol above and what the compiler actually produces is the functor from session types into the plumbing calculus. This functor works because of barrier.

Why do we need the barrier? Because the protocol is about sending a message and waiting for a response. We can send a message, but we need the response to arrive before we proceed. The barrier takes two streams, one carrying the "done" signal and one carrying the response, and synchronises them into a pair. Only when both are present does the next step begin.

Each session type primitive has a direct image in the plumbing category, and the structure is prettier than it first appears. The primitives come in dual pairs:

In the diagrams below, session types are on the left in blue; their images in the plumbing calculus are on the right in beige.

send and recv are dual. They map to map and filter, which are also dual: send wraps the value with map, then synchronises via barrier with the done signal from the previous step. Recv filters the control output by step number, synchronises via barrier, then extracts the payload with map.

select and offer are dual. They map to tag and case analysis, which are also dual: select tags the value with a label via map, synchronises via barrier, and routes to the chosen branch chain. Offer copies the control output and filters each copy by label, routing to the appropriate branch chain.

Diagram showing the two dual pairs of session type primitives and their images in the plumbing calculus. Session types are shown in blue on the left; plumbing calculus images in beige on the right. Top section: send T is a simple arrow carrying type T; its image is map(wrap) followed by barrier with a done signal, then routing to the control input. recv T is its dual: filtering ctrl_out by step number, barrier with done, then map to extract the payload. Bottom section: select with labelled alternatives maps to coproduct injection via map(tag), barrier, then routing to the chosen branch chain. offer is its dual: copy the control output, filter each copy by label, and route to the corresponding branch chain.

• The sequencing operator (.) maps to a barrier chain. Each send-then-recv step becomes a barrier that synchronises the outgoing message with the incoming acknowledgement, and these barriers chain together to enforce the protocol ordering.

Diagram showing how the sequencing operator in session types maps to a barrier chain in the plumbing calculus. Top: the session type sequence "send T1 . recv T2 . send T3" shown as three boxes in a row connected by dots. Bottom: the plumbing image, a chain of barriers. Each send-recv pair becomes a barrier that takes the outgoing message on one arm and the incoming acknowledgement on the other, producing a synchronised pair. The done signal from one barrier feeds into the next, creating a chain that enforces protocol ordering. The trigger input starts the chain; the done output signals completion. Filters select responses by step number; maps construct outgoing messages.

rec maps to a feedback loop: merge takes the initial
arm signal and the last done signal from the previous iteration, feeds them into the barrier chain body, and copy at the end splits done into output and feedback. The trigger serialisation gate starts

end is implicit: the chain simply stops. Discard handles any remaining signals.

Diagram showing three more session type primitives and their plumbing images. Top: rec X . S (recursion) maps to a feedback loop. A merge node takes two inputs: the initial arm signal and the last-done signal fed back from the end of the body. The body is a barrier chain (S). At the output, copy splits the done signal into an output arm and a feedback arm that loops back to merge. Middle: end is shown as simply discarding any remaining signal. Bottom: the trigger and serialisation gate, which starts the protocol. A trigger input feeds through a barrier that synchronises with a copy of the done signal, ensuring only one instance of the protocol runs at a time.

This mapping is a functor. It is total: every session type primitive has an image in the plumbing category, using only the morphisms we already have. Session types are a specification language; the plumbing calculus is the execution language. The compiler translates one into the other.

The reason we do this becomes obvious from the diagram below. It is scrunched up and difficult to look at. If you click on it you can get a big version and puzzle it out. If you squint through the spaghetti, you can see that it does implement the same compaction protocol above. We would not want to implement this by hand. So it is nice to have a functor. If you have the patience to puzzle your way through it, you can at least informally satisfy yourself that it is correct.

Thumbnail of the fully desugared compaction protocol as produced by the compiler. The diagram is intentionally dense: a large network of barriers, filters, maps, copy and merge nodes, all connected by typed wires. Each step of the compaction protocol (pause, get memory, set memory, resume) is visible as a cluster of barrier chains with filters selecting response types and maps constructing commands. The full-size version is linked for readers who want to trace the individual connections, but the point is that this is what the compiler produces from the eight-line session type specification above, and you would not want to construct it by hand.

Document pinning

There is another feature we implement, because managing the memory of an agent is not as simple as just compressing it.

The problem with compression is that it is a kind of annealing. As the conversation grows, it explores the space of possible conversation. When it gets compacted, it is compressed, and that lowers the temperature. Then it grows again, the temperature rises, and then it is compressed again. With each compression, information is lost. Over several cycles of this, the agent can very quickly lose track of where it was, what you said at the beginning, what it was doing.

We can begin to solve this with document pinning. The mechanism is a communication between the agent and its homunculus, not shown in the protocol above. It is another protocol. The agent says: this document that I have in memory (technically a tool call and response, or just a document in the case of the prompts), pin it. What does that mean? When we do compaction, we compact the contents of memory, but when we replace the memory, we also replace those pinned documents verbatim. And of course you can unpin a document and say: I do not want this one any more.

Either the agent can articulate this or the user can. The user can say: you must remember this, keep track of this bit of information. And the agent has a way to keep the most important information verbatim, without it getting compacted away.

This accomplishes two things.

First, it tends to keep the agent on track, because the agent no longer loses the important information across compaction cycles. The annealing still happens to the bulk of the conversation, but the pinned documents survive intact.

Second, it has to do with the actual operation of the underlying LLM on the GPUs. When you send a sequence of messages, this goes into the GPU and each token causes the GPU state to update. This is an expensive operation, very expensive. This is why these things cost so much. What you can do with some providers is put a cache point and say: this initial sequence of messages, from the beginning of the conversation up until the cache point, keep a hold of that. Do not recompute it. When you see this exact same prefix, this exact same sequence of messages again, just load that memory into the GPU directly. Not only is this a lot more efficient, it is also a lot cheaper, a factor of ten cheaper if you can actually hit the cache.

So if you are having a session with an agent and the agent has to keep some important documents in its memory, it is a good idea to pin them to the beginning of memory. You sacrifice a little bit of the context window in exchange for making sure that, number one, the information in those documents is not forgotten, and number two, that it can hit the cache. This is explained in more detail in a separate post on structural prompt preservation.

The agent that doesn’t know itself

Why does the agent get this wrong? In one sense, it is right. It has not experienced compaction. Nobody experiences compaction. Compaction happens in the gap between turns, in a moment the agent cannot perceive. The agent’s subjective time begins at the summary. There is no "before" from its perspective.

The summary is simply where memory starts. It is like asking someone "did you experience being asleep?" You can see the evidence, you are in bed, time has passed. But you did not experience the transition.

The <context-summary> tag is a structural marker. But interpreting it as evidence of compaction requires knowing what the world looked like before, and the agent does not have that. It would need a memory of not having a summary, followed by a memory of having one. Compaction erases exactly that transition.

Self-knowledge as metadata

The fix is not complicated. It is perfectly reasonable to provide, along with the user’s message, self-knowledge to the agent as metadata. What would it be useful for the agent to know?

The current time. The sense of time that these agents have is bizarre. We live in continuous time. Agents live in discrete time. As far as they are concerned, no time passes between one message and the next. It is instantaneous from their point of view. You may be having a conversation, walk away, go to the café, come back two hours later, send another message, and as far as the agent is concerned no time has passed. But if along with your message you send the current time, the agent knows.

How full the context window is. The agent has no way of telling, but you can provide it: this many tokens came in, this many went out.

Compaction cycles. So the agent knows how many times it has been compacted, and can judge the accuracy of the contents of its memory, which otherwise it could not do.

With the compaction counter, the agent immediately gets it right:

"Yes, I have experienced compaction. According to the runtime context, there has been 1 compaction cycle during this session."

No hedging, no confabulation. Same model, same prompts, one additional line of runtime context.

Context drift

This matters beyond the compaction story, because many of the failures we see in the news are context failures, not alignment failures.

While we were writing this post, a story appeared in the Guardian about AI chatbots directing people with gambling addictions to online casinos. This kind of story is common: vulnerable people talking to chatbots, chatbots giving them bad advice. The response of the industry is always the same: we need better guardrails, better alignment, as though the chatbots are badly aligned.

I do not think that is what is happening. What is happening is a lack of context. Either the chatbot was never told the person was vulnerable, or it was told and the information got lost. Someone with a gambling addiction may start by saying "I have a gambling problem." Then there is a four-hour conversation about sports. Through compaction cycles, what gets kept is the four hours of sports talk. The important bit of information does not get pinned and does not get kept. Context drift. By the time the user asks for betting tips, the chatbot no longer knows it should not give them.

The way to deal with this is not to tell the language model to be more considerate. The way to deal with it is to make sure the agent has enough information to give good advice, and that the information does not get lost. This is what document pinning is for: pinned context survives compaction, stays at the top of the window, cannot be diluted by subsequent conversation. This is discussed further in a separate post on structural prompt preservation.

But pinning is only one strategy. The field is in its infancy. We do not really know the right way to manage agent memory, and we do not have a huge amount of experience with it. What we are going to need is the ability to experiment with strategies: what if compaction works like this? What if pinning works like that? What if the homunculus watches for different signals? Each of these hypotheses needs to be described clearly, tested, and compared. This is where the formal language earns its keep. A strategy described in the plumbing calculus is precise, checkable, and can be swapped out for another without rewriting the surrounding infrastructure. We can experiment with memory architectures the way we experiment with any other part of a system: by describing what we want and seeing if it works.

Why has nobody done this?

When the first draft of this post was written, it was a mystery why the field had not thought to give agents self-knowledge as a routine matter: what they are doing, who they are talking to, what they should remember. Prompts are initial conditions. They get compacted away. There are agents that save files to disc, in a somewhat ad hoc way, but we do not give them tools to keep track of important information in a principled way.

Contemporaneously with this work, some providers have started to do it. For example, giving agents a clock, the ability to know what time it is. This is happening now, in the weeks between drafting and publication. The field is only now realising that agents need a certain amount of self-knowledge in order to function well. The compressed timeline is itself interesting: the gap between "why has nobody done this?" and "everybody is starting to do this" was a matter of weeks.

The mechanisms we have presented here allow us to construct agent networks and establish protocols that describe rigorously how they are meant to work. We can describe strategies for memory management in a formal language, test them, and swap them out. And perhaps beyond the cost savings and the efficiency increases, the ability to experiment clearly and formally with how agents manage their own memory is where the real value lies.

Jordan EllenbergSad Badgers

Since 2013, a 12 seed has upset a 5 seed in the first round of the NCAA men’s basketball tournament 20 times. Four out of those 20 games were lost by Wisconsin.

Matt von HippelThe Twitter of Physics

The paper I talked about last week was frustratingly short. That’s not because the authors were trying to hide anything, or because they were lazy. It’s just that these days, that’s how the game is played.

Twitter started out with a fun gimmick: all posts had to be under 140 characters. The restriction inspired some great comedy, trying to pack as much humor as possible into a bite-sized format. Then, Twitter somehow became the place for journalists to discuss the news, tech people to discuss the industry, and politicians to discuss politics. Now, the length limit fuels conflict, an endless scroll of strong opinions without space for nuance.

Physics has something like this too.

In the 1950’s, it was hard for scientists to get the word out quickly about important results. The journal Physical Review had a trick: instead of normal papers, they’d accept breaking news in the form of letters to the editor, which they could publish more quickly than the average paper. In 1958, editor Samuel Goudsmit founded a new journal, Physical Review Letters (or PRL for short), that would publish those letters all in one place, enforcing a length limit to make them faster to process.

The new journal was a hit, and soon played host to a series of breakthrough results, as scientists chose it as a way to get their work out fast. That popularity created a problem, though. As PRL’s reputation grew, physicists started trying to publish there not because their results needed to get out fast, but because just by publishing in PRL, their papers would be associated with all of the famous breakthroughs the journal had covered. Goudsmit wrote editorials trying to slow this trend, but to no avail.

Now, PRL is arguably the most prestigious journal in physics, hosting over a quarter of Nobel prize-winning work. Its original motivation is no longer particularly relevant: the journal is not all that much faster than other journals in its area, if at all, and is substantially slower than the preprint server arXiv, which is where physicists actually read papers in practice.

The length limit has changed over the years, but not dramatically. It now sits at 3,750 words, typically allowing a five-or-six page article in tight two-column text.

If you see a physics paper on arXiv.org that fits the format, it’s almost certainly aimed at PRL, or one of the journals with similar policies that it inspired. It means the authors think their work is cool enough to hang out with a quarter of all Nobel-winning results, or at least would like it to be.

And that, in turn, means that anyone who wants to claim that prestige has to be concise. They have to leave out details (often, saving them for a later publication in a less-renowned journal). The results have to lean, by the journal’s nature, more to physicist-clickbait and a cleaned-up story than to anything their colleagues can actually replicate.

Is it fun? Yeah, I had some PRLs in my day. It’s a rush, shining up your work as far as it can go, trimming down complexities into six pages of essentials.

But I’m not sure it’s good for the field.

Doug NatelsonAPS March Meeting 2026, Day 4 and wrap-up

Since I headed home early this afternoon, I was only able to go to a couple of talks this morning.  Here are those highlights, and a couple of general observations about the meeting.

  • Piers Coleman gave a very interesting talk that put me onto an experimental puzzle I'm sorry to say I had not seen previously.  Some context:  It is now well-established that one can do spin-polarized scanning tunneling microscopy, which (given certain constraints) can image magnetic contrast in conductors down to the atomic scale.  The mechanism is basically the same as tunneling magnetoresistance:  there is a difference in the density of states for spin-up and spin-down electrons, and so a spin-polarized (magnetic) tip results in a tunneling current into/out of a magnetic sample that depends on the local magnetization.  That is, the sign of the current doesn't affect the sign of the magnetic contrast.  I had missed this 2022 Science paper, where instead of a magnetic tip, the investigators used a tip made from a nanowire of SmB6.  That peculiar material is widely (though not universally) viewed as a topological Kondo insulator that can host special surface states in which the spin direction is locked to the current direction.  With that tip, they see magnetic contrast (!) that flips sign with the sign of the current (!!), which is at least hand-wavingly what you'd expect if the direction of the tunneling electron's spin is tied to the current direction.  A more recent paper does something similar with a (BiBr)4 tip (another topologically nontrivial material).  In the talk and related paper, the argument is made that something special happens to the surface states (the effect in SmB6 turns on below about 10 K) and that this tied to the condensed matter analog of axion physics.
  • On a completely different note, I saw a talk by John Davis about a new, clever kind of continuously running refrigerator that has a base temperature of around 500 mK and uses only a couple of gas liters of 3He.  One can pump on liquid 3He and get down to about 270 mK in one-shot mode, or about 450 mK if recondensing the 3He gas with a heat exchanger to get continuous operation, but 3He is very expensive.  The new design works with a mixture that's mostly 4He.  After condensing, pumping on this can cool it sufficiently that the 3He phase separates and rises to the top of the liquid, and then the 3He can be preferentially pumped (and recirculated back in).  Very cute.
  • Tangentially, one nice feature of conferences is that you can stumble upon facts you didn't know.  For example, during that talk, Prof. Davis mentioned, off-handedly, that in 2D turbulence as studied in things like helium films, you can end up with long-time persistent vortices, and that this is similar to how cyclonic storms persist for centuries on Jupiter.
  • Regarding the meeting in general, the APS is aware that there were some AV issues, including some of the rooms having 50" monitors rather than projectors.  This was a surprise to the organizers.  I'm still not sure how much I like the merger of the March and April meetings into one super-meeting.  On the plus side, there are opportunities for cross-over events (e.g., the Kavli symposium, which I didn't see this time), and there are some financial benefits to the society via economies of scale.  Still, 14,000 attendees makes things unwieldy for sure.
  • I don't understand some of the choices re the meeting website and the meeting app.  For example, people can upload their slides and make them available.  However, on the meeting website, even when you're logged in, there's apparently no way to get to them.  You can only find the files using the APS meetings app, and even then it's not trivial.  
For those at the meeting the rest of today and Friday, if there are big stories that I missed because of my travel, please feel free to discuss in the comments.

March 19, 2026

John BaezTiny Musical Intervals

Music theorists have studied many fractions of the form

2i 3j 5k

that are close to 1. They’re called 5-limit commas. Especially cherished are those that have fairly small exponents—given how close they are to 1. I discussed a bunch here:

Well temperaments (part 2).

and I explained the tiniest named one, the utterly astounding ‘atom of Kirnberger’, here:

Well temperaments (part 3).

The atom of Kirnberger equals

2161 · 3-84 · 5-12 ≈ 1.0000088728601397

Two pitches differing by this ratio sound the same to everyone except certain cleverly designed machines. But remarkably, the atom of Kirnberger shows up rather naturally in music—and it was discovered by a student of Bach! Read my article for details.

All this made me want to systematically explore such tiny intervals. Below is a table of them, where I list the best ones: the ones that are closest to 1 for a given complexity. The first eleven have names, and many of them play important roles in music! But beyond that point, all but one remain unnamed—or at least I don’t know their names. That’s because they’re too small to be audible, and—except for one—not even considered to be of great theoretical importance.

I’ll list these numbers in decimal form and also in cents, where we take the logarithm of the number in base 2 and multiply by 1200. (I dislike this blend of base 2, base 10 and base 12, but it’s traditional in music theory.)

Most importantly, I list the monzo of each numbers. This is the vector of exponents: for example, the monzo of

2i 3j 5k

is

[i, j, k]

In case you’re wondering, this term was named after the music theorist Joseph Monzo.

Finally, I list the Tenney height of each number. This is a measure of the number’s complexity: the Tenney height of

2i 3j 5k

is

∣i​∣ log2​(2) + ∣j​∣ log​2(3) + ∣k​∣ log2​(5)

The table below purports to list only 5-limit commas that are close to 1 as possible for a given Tenney height. More precisely, it should list numbers of the form 2i 3j 5k that are > 1 and closer to 1 than any number with smaller Tenney height—except of course for 1 itself.

Cents Decimal Name Monzo Tenney height
498.04 1.3333333333 just perfect fourth [2, −1, 0] 3.6
386.31 1.2500000000 just major third [−2, 0, 1] 4.3
315.64 1.2000000000 just minor third [1, 1, −1] 4.9
203.91 1.1250000000 major tone [−3, 2, 0] 6.2
182.40 1.1111111111 minor tone [1, −2, 1] 6.5
111.73 1.0666666667 diatonic semitone [4, −1, −1] 7.9
70.67 1.0416666667 lesser chromatic semitone [−3, −1, 2] 9.2
21.51 1.0125000000 syntonic comma [−4, 4, −1] 12.7
19.55 1.0113580247 diaschisma [11, −4, −2] 22.0
8.11 1.0046939300 kleisma [−6, −5, 6] 27.9
1.95 1.0011291504 schisma [−15, 8, 1] 30.0
1.38 1.0007999172 unnamed? [38, −2, −15] 76.0
0.86 1.0004979343 unnamed? [1, −27, 18] 85.6
0.57 1.0003289700 unnamed? [−53, 10, 16] 106.0
0.29 1.0001689086 unnamed? [54, −37, 2] 117.3
0.23 1.0001329015 unnamed? [−17, 62, −35] 196.5
0.047 1.0000271292 unnamed? [−90, −15, 49] 227.5
0.0154 1.0000088729 atom of Kirnberger [161, −84, −12] 322.0
0.0115 1.0000066317 unnamed? [21, 290, −207] 961.3
0.00088 1.0000005104 quark of Baez [−573, 237, 85] 1146.0

You’ll see there’s a big increase in Tenney height after the schisma. This is very interesting: it suggests that the schisma is the last ‘useful’ interval. It’s useful only in that it’s the ratio of two musically important commas, the syntonic comma and the Pythagorean comma. Life in music would be simpler if these were equal, and in well-tempered tuning systems it’s common to pretend that they are.

All the intervals in this table up to the schisma were discovered by musicians a long time ago, and they all have standard names! After the schisma, interest drops off dramatically.

The atom of Kirnberger has such amazing properties that it was worth naming. The rest, maybe not. But as you can see, I’ve taken the liberty of naming the smallest interval in the table the ‘quark of Baez’. This is much smaller than all that come before. It’s in bad taste to name things after oneself—indeed this is item 25 on the crackpot index—but I hope it’s allowed as a joke.

I also hope that in the future this is considered my smallest mathematical discovery.

Here is the Python code that should generate the above information. If you’re good at programming, please review it and check it! Someone gave me a gift subscription to Claude, and it (more precisely Opus 4.5) created this code. It seems to make sense, and I’ve checked a bunch of the results, but I don’t know Python.

from math import log2

log3 = log2(3)
log5 = log2(5)

commas = []

max_exp_3 = 1200
max_exp_5 = 250

for a3 in range(-max_exp_3, max_exp_3+1):
    for a5 in range(-max_exp_5, max_exp_5+1):
        if a3 == 0 and a5 == 0:
            continue

# Find a2 that minimizes |a2 + a3 * log2(3) + a5 * log2(5)|

        target = -(a3 * log3 + a5 * log5)
        a2 = round(target)
        
        log2_ratio = a2 + a3 * log3 + a5 * log5
        cents = abs(1200 * log2_ratio)
        
        if cents > 0.00001:  # non-trivial
            tenney = abs(a2) + abs(a3) * log3 + abs(a5) * log5
            commas.append((tenney, cents, a2, a3, a5))

# Find Pareto frontier

commas.sort(key=lambda x: x[0])  # sort by Tenney height

frontier = []
best_cents = float('inf')
for c in commas:
    if c[1] < best_cents:
        best_cents = c[1]
        frontier.append(c)

# Print results 

for tenney, cents, a2, a3, a5 in frontier:
    log2_ratio = a2 + a3 * log3 + a5 * log5
    decimal = 2**log2_ratio
    if decimal < 1:
        decimal = 1/decimal
        a2, a3, a5 = -a2, -a3, -a5
    print(f"{cents:.6f} cents | {decimal:.10f} | [{a2}, {a3}, {a5}] | "
          f"Tenney: {tenney:.1f}")

Gene Ward Smith

In studying this subject I discovered that tiny 5-limit intervals were studied by Gene Ward Smith, a mathematician I used to see around on sci.math and the like. I never knew he worked on microtonal music! I am sad to hear that he died from COVID-19 in January 2021.

I may just be redoing a tiny part of his work: if anyone can find details, please let me know. In his memory, I’ll conclude with this article from the Xenharmonic Wiki:

Gene Ward Smith (1947–2021) was an American mathematician, music theorist, and composer.

In mathematics, he worked in the areas of Galois theory and Moonshine theory.

In music theory, he introduced wedge products as a way of classifying regular temperaments. In this system, a temperament is specified by means of a wedgie, which may technically be identified as a point on a Grassmannian. He had long drawn attention to the relationship between equal divisions of the octave and the Riemann zeta function.[1][2][3] He early on identified and emphasized free abelian groups of finite rank and their homomorphisms, and it was from that perspective that he contributed to the creation of the regular mapping paradigm.

In the 1970s, Gene experimented with musical compositions using a device with four square-wave voices, whose tuning was very stable and accurate, being controlled by a crystal oscillator. The device in turn was controlled by HP 9800 series desktop computers, initially the HP 9830A, programmed in HP Basic, later the 9845A. Using this, he explored both just intonation with a particular emphasis on groups of transformations, and pajara.

Gene had a basic understanding of the regular mapping paradigm during this period, but it was limited in practice since he was focused on the idea that the next step from meantone should keep some familiar features, and so was interested in tempering out 64/63 in place of 81/80. He knew 7-limit 12 and 22 had tempering out 64/63 and 50/49 in common, and 12 and 27 had tempering out 64/63 and 126/125 in common, and thought these would be logical places to progress to, blending novelty with familiarity. While he never got around to working with augene, he did consider it. For pajara, he found tempering certain JI scales, the 10 and 12 note highschool scales, led to interesting (omnitetrachordal) results, and that there were also closely related symmetric (MOS) scales of size 10 and 12 for pajara; he did some work with these, particularly favoring the pentachordal decatonic scale.

Gene was among the first to consider extending the Tonnetz of Hugo Riemann beyond the 5-limit and hence into higher dimensional lattices. In three dimensions, the hexagonal lattice of 5-limit harmony extends to a lattice of type A3 ~ D3. He is also the first to write music in a number of exotic intonation systems.

Historical interest

Usenet post from 1990 by Gene Smith on homomorphisms and kernels
Usenet post from 1995 by Gene Smith on homomorphisms and kernels

See also

Microtonal music by Gene Ward Smith
Hypergenesis58 (a scale described by Gene Ward Smith)

References

[1] Rusin, Dave. “Why 12 tones per octave?

[2] OEIS. Increasingly large peaks of the Riemann zeta function on the critical line: OEIS: A117536.

[3] OEIS. Increasingly large integrals of the Z function between zeros: OEIS: A117538.

Doug NatelsonAPS March Meeting 2026, Day 3

It was another eclectic day at the APS Global Physics Summit.  Here is a selection of highlights based on my stochastic sampling of talks.

  • I've written before about CISS (the chirality-induced spin selection effect).  Joe Subotnik gave a neat invited talk related to this, based on something I'd never really considered.  In physics we learn about the Born-Oppenheimer approximation, which basically says that electrons are fast and nuclei are slow, so we can often solve electronic problems without worrying about nuclear motion.  In practice, as usually done, B-O theory does not strictly conserve momentum or angular momentum, so it cannot explain something like the Einstein-de Haas effect, where flipping electronic spins eventually results in actual mechanical rotation of a solid.  Similarly, ordinary Marcus theory of electron transfer doesn't worry about angular momentum conservation.   The talk focused on a recent approach (and here) that looks carefully at wavefunctions, involves the equivalent of Berry phase and quantum geometry and recaptures the key physics, and this may explain CISS.
  • Javad Shabani presented his group's recent work on growing epitaxial layers of germanium substitutionally doped with gallium, at carrier densities around \(5 \times 10^{21}\) carriers per cc, basically around 1 Ga atom in each 8-atom unit cell.   This hole-dopes the material enormously.  The resulting films superconduct with a \(T_{c}\sim\) 3 K and good critical fields, and look very nice structurally.  This is potentially a route toward creating arrays of millions of epitaxially nice Josephson junctions.
  • I attended the AI Town Hall, which featured Hal Finkel from DOE talking about the Genesis MissionRachel Burley, chief publication officer of the APS, speaking about the challenges that AI presents to all facets of journals and scientific publishing; and Sarah Demers, chair of the physics department at Yale and chair of the APS's Panel on Public Affairs, discussing the community's effort to formulate an enduring position on physics and AI in this rapidly changing landscape.  
  • In the last session of the day, I attended the DCMP prize session, and it was very interesting to hear from this year's Buckley Prize winners about their journeys and what they've been doing lately.  
In addition to a lot of fun conversations, I popped in and out of a few other talks, and apologies for not covering everything.

March 18, 2026

Scott Aaronson Congrats to Bennett and Brassard on the Turing Award!

I’m on a spring break vacation-plus-lecture-tour with Dana and the kids in Mexico City this week, and wasn’t planning to blog, but I see that I need to make an exception. Charles Bennett and Gilles Brassard have won the Turing Award, for their seminal contributions to quantum computing and information including the BB84 quantum key distribution scheme. This is the first-ever Turing Award specifically for quantum stuff (though previous Turing Award winners, including Andy Yao, Leslie Valiant, and Avi Wigderson, have had quantum among their interests).

As a practical proposal, BB84 is already technologically feasible but has struggled to find an economic niche, in a world where conventional public-key encryption already solves much the same problem using only the standard Internet—and where, even after scalable quantum computers become able to break many of our current encryption schemes, post-quantum encryption (again running on the standard Internet) stands ready to replace those schemes. Nevertheless, as an idea, BB84 has already been transformative, playing a central role in the birth of quantum information science itself. Beyond BB84, Bennett and Brassard have made dozens of other major contributions to quantum information science, with a personal favorite of mine being the 1994 BBBV (Bennett Bernstein Brassard Vazirani) paper, which first established the limitations of quantum computers at solving unstructured search problems (and indeed, proved the optimality of Grover’s algorithm even before Grover’s algorithm had been discovered to exist).

While I take my kids to see Aztec artifacts, you can learn much more from Ben Brubaker’s Quanta article, to which I contributed without even knowing that it would be about Bennett and Brassard winning the Turing Award (info that was strictly embargoed before today). It’s an honor to have known Charlie and Gilles as well as I have for decades, and to have been able to celebrate one of their previous honors, the Wolf Prize, with them in Jerusalem. Huge congrats to two of the founders of our field!

Doug NatelsonAPS March Meeting 2026, Day 2

Today was again a bit random, as I had talks for both one of my students and me, and meetings with folks.  Some highlights:

  • Edoardo Baldini gave a very nice talk about exotic phases and collective excitations in van der Waals magnets.  This included using second harmonic generation microscopy and polarimetry to look at the evolution of magnetic phases in NiPS3 as a function of thickness, ending up at the monolayer which acts like a 2D XY magnet.  In the paper, they see clear evidence of a BKT transition, plus a second lower temperature ordering of some kind.
  • After some AV issues (seem like quite a few of those this year), Barry Zink gave an interesting presentation about using Cr as a spin detector in spin Seebeck measurements on YIG, and looking at how the antiferromagnetism of the Cr affects the measurement (see here).  In new results, they have been adding in an intervening layer of antiferromagnetic IrMn, looking at how magnons in the IrMn affect the results.
  • This was followed by a talk by Romain LeBrun all about spintronics in the GHz and THz, using hematite (Fe2O3) as an example antiferromagnetic insulator.  They see some interesting nonlinear dynamics and rectification in antiferromagnetic resonance in Fe2O3.  In NiO, they use optical excitation to drive coherent phonons, exciting a spin current, which then leads to a THz pulse when the spin current hits an inverse spin Hall detector (Pt or W).  Similar experiments in BiFeO3 show that THz generation in that multiferroic system can arise just from oscillating the ferroelectric polarization.
  • Andrew Dane from IBM gave a great presentation to a more-than-packed room about their recent studies of two-level systems in qubits.  From waaaaay in the back, I learned about their use of nearby suspended electrodes to apply electric fields to try to shift the energies of some of the TLS (the ones with electric dipole moments and at the surface of the devices).  TLS drastically suppress the coherence of superconducting qubits, and understanding their origins and ways to work around them (to characterize fab processes, for example) is very important.  As I said in that post linked above, once again we see that TLS are everywhere, and they are evil.  I need to think about whether there's anything I could contribute on this.  The real highlight of the talk was the use of "percussive maintenance" (banging the side of the cryostat) to alter the not-field-tunable TLS distribution via some unknown mechanism.
  • Bonking the experiment was taken to a new level in this talk about mechanoluminescence, which involved shooting the sample with an airsoft pellet gun under controlled conditions.
There were other talks as well - some fun stuff.  I also want to give a shoutout to the free-to-play vintage arcade games in the exhibit hall.  Galaga and the stand-up vector graphics Star Wars game were great consumers of my time and my quarters back in the day.

March 17, 2026

Doug NatelsonAPS March Meeting 2026, Day 1

I hit a pretty random assortment of talks on my first day at the APS Global Physics Summit, after catching a very early flight to get to Denver.  Here are a few highlights:

  • My colleague Hanyu Zhu gave a nice talk about the coupling between chiral phonons (vibrational excitations of atomic motion that carry net orbital angular momentum) and their coupling to electronic spins.  For example, chiral ionic motion can effectively generate enormous local magnetic fields (see here).
  • I went to part of a session about magnons (quantized spin waves) and their connection to quantum information.  There was a theory talk by Silvia Viola Kusminskiy about cavity manipulation of magnons, and there was an experimental talk by Mathias Weiler about using surface acoustic waves plus magnetoelastic coupling to set up all sorts of interesting nonreciprocal magnetoacoustic devices
  • My former postdoc Longji Cui gave a talk about molecular phononics - measuring thermal transport (by phonons) down to the single molecule level.  For a nice review of the overall topic, see here.  He then discussed extending this to measurements of polymers.
  • There was a session about strange metals and the cuprates, which included a talk by Dragana Popovic about how there is evidence for persistent vortex-liquid-like phase fluctuations in these materials even into the normal state.  This was followed by Nigel Hussey showing systematic studies of the magnetoresistance in both electron-doped and hole-doped cuprates.  The upshot is that in the electron-doped materials, there is a clear anisotropic inelastic scattering rate (from spin fluctuations) that scales with the superconducting transition, implying that spin fluctuations are the "glue".  In contrast, the hole-doped system has different systematics, implying that perhaps the strange metal fraction of material is what leads to superconductivity.
  • For maybe the second time in my long attendance at the meeting, I attended the APS prize session, where they present the certificates associated with the various honors.  It was very nice.
Now I just need to get some sleep and figure out what to see tomorrow....

March 15, 2026

Scott Aaronson On Montgomery County public magnet schools: a guest post by Daniel Gottesman


Scott’s foreword: I’ve known fellow quantum computing theorist Daniel Gottesman, now at the University of Maryland, for a quarter-century at this point. Daniel has been a friend, colleague, coauthor, and one of the people from whom I’ve learned the most in my career. Today he writes about a topic close to my heart, and one to which I’ve regularly lent this blog over the decades: namely, the struggle to protect enrichment and acceleration in the United States (in this case, the public magnet programs in Montgomery County, Maryland) from the constant attempts to weaken or dismantle them. Thanks so much to Daniel for doing this, and please help out if you can!


Without further ado, Daniel Gottesman:

Scott has kindly let me write this guest post because I’d like to ask the readers of Shtetl-Optimized for help.  I live in Montgomery County, Maryland, and the county is getting ready to replace our current handful of great magnet programs with a plethora of mediocre ones.

Montgomery County has a generally quite good school system, but its gifted education programs are really inadequate at the elementary and middle school level.  Montgomery County Public Schools (MCPS) offers nothing at all for gifted children until 4th grade.  Starting in 4th grade, magnet programs are available, but there are not enough spaces for everyone who meets the minimum qualifications.  A few years ago, the elementary and middle school magnets were switched to a lottery system, meaning the highest-achieving students, who most need special programming, might or might not get in, based purely on luck of the draw.

The remaining bright spot has been the high school magnets.  Montgomery County has two well-known and high-performing magnets, a STEM magnet at Montgomery Blair high school and an International Baccalaureate (IB) program at Richard Montgomery.  The Richard Montgomery IB program draws students from the whole county and the Blair Magnet draws from 2/3 of the county (with the remaining 1/3 eligible to go to another successful but less well-known magnet at Poolesville).  And these programs have so far resisted the lottery: They pick the best students from the application pool.

So with inadequate magnets in the lower grades and stellar magnets in high school, you can guess which one is up for a change.

MCPS now wants to reconfigure the high school magnet programs by splitting the county up into 6 regions.  Students will only be allowed to apply to programs in their home region.  Each region will have its own STEM magnet and its own IB program, as well as programs in the arts, medicine, and leadership.  And actually there are multiple program strands in each of these subjects, sometimes in different schools.  The whole plan is big and complicated, with close to 100 different programs around the county, more than half of them new.

The stated purpose of this plan is to expand access to these programs by admitting more students and reducing travel times to the programs.  And who could object to that?  There are definitely places in the county that are far from the current magnets and there are certainly more students that can benefit from high-quality magnets than there is currently space for.

The problem is that making high-quality magnets has not been a priority in the design process.  The last time MCPS tried adding regional magnets was about 7 years ago, when they added 3 regional IB programs while keeping Richard Montgomery available to students all over the county.  It was a failure: Test scores at the regional IB programs are far below those at Richard Montgomery (the worst-performing regional IB had only 24% getting a passing grade in even one subject in 2024, compared to 99% at Richard Montgomery) and all 3 are underenrolled.  Now MCPS has decided they can solve this problem by preventing students from going to Richard Montgomery to try to force them to go to the regional IBs.  In addition, they want to repeat the same mistakes with the STEM and other magnets.  The best programs in the county will shrink and only be accessible to a small fraction of students, leaving everyone else with new programs of likely highly-varying quality.

And if that were not enough, they want to do this revamp on a ridiculously short timeline.  The new programs are supposed to start in the 2027-8 school year, and between now and then, they need to recruit and train teachers for these 100 programs, create all the curricula for the first year of the programs (they are only planning to do one year at a time), and much much more.  The probability of a train wreck in the early years of the new system seems high.

Equity is certainly a concern driving this change.  And let me be clear: I am totally in favor of improving equity in the school system.  But I agree with Scott on this point: strong magnet programs in the public schools are pro-equity and weakening magnet programs is anti-equity.  Magnet programs are pro-equity even if the magnets are disproportionally populated by more affluent students, which is admittedly the case in MCPS: Affluent students will always have access to enrichment outside school and to private schools for the most affluent, whereas the public magnet programs are the only source of enrichment for those without those resources.

If MCPS really wants to address the difference in achievement between richer and poorer students, the way to do that is to create gifted programming starting from kindergarten.  If you wait until high school, it is unreasonable to expect even brilliant students to catch up to their also highly-capable peers who have been doing math and science camps and extracurriculars and contests and whatnot since they were little.  Some can manage it, but it is certainly not easy.  Unfortunately, MCPS’s notion of equity seems more focused on optimizing the demographic breakdown of magnet programs, which is most easily achieved by techniques which don’t improve — and usually degrade — the quality of the education provided.

So how can you help?  The Board of Education (BOE) is supposed to vote on this plan on Mar. 26.  Those of us opposed to it are hoping to sway enough members to vote to tell MCPS to investigate alternatives.  For instance, I have proposed a model with only 3 regions, which could also substantially improve access while preserving the strong existing magnets.

If you live in Montgomery County, write to BOE members telling them you oppose this change.  You can also sign a petition — there are many, but my favorite is here.

If you are an alumnus of one of the MCPS magnets, write to the BOE telling them how your education there was valuable to you and how a smaller program would not have served you as well.

If you are unconnected to Montgomery County, you can still spread the word.  If the BOE gets enough press inquiries asking about the many things that don’t add up in the MCPS proposal, perhaps they will recognize that this is a bad idea.

If you are really really interested in this topic and want to learn more: Last fall, I put together a long analysis of some of the flaws in MCPS’s plan and their claims, and of the alternative 3-region model.  You can find it here.

March 14, 2026

Terence TaoMathematics Distillation Challenge – Equational Theories

Mathematical research traditionally involves a small number of professional mathematicians working closely on difficult problems. However, I have long believed that there is a complementary way to do mathematics, in which one works with a broad community of mathematically minded people on problems which may not be as deep as the problems one traditionally works on, but still are of mathematical interest; and that modern technologies, including AI, are more suitable for contributing the latter type of workflow. The “Polymath projects” were one example of this broad type of collaboration, where internet platforms such as blogs and wikis were used to facilitate such collaboration. Some years later, collaborative formalization projects (such as the one to formalize the Polynomial Freiman–Ruzsa conjecture of Marton, discussed previously on this blog here) became popular in some circles. And in 2024, I launched the Equational Theories Project (ETP) (discussed on this blog here and here), combining the rigor of Lean formalization with “good old fashioned AI” (in the form of automated theorem provers) to settle (with formal verification) over 22 million true-false problems in universal algebra.

Continuing in this spirit, Damek Davis and I are launching a new project, in the form of an experimental competitive challenge hosted by the SAIR Foundation (where I serve as a board member, and which is supplying technical support and compute). The idea of this challenge, motivated in part by this recent paper of Honda, Murakami, and Zhang, is to measure the extent to which the 22 million universal algebra true-false results obtained by the ETP can be “distilled” into a short, human-readable “cheat sheet”, similar to how a student in an undergraduate math class might distill the knowledge learned from that class into a single sheet of paper that the student is permitted to bring into an exam.

Here is a typical problem in universal algebra that the ETP was able to answer:

Problem 1 Suppose that {*: M \times M \rightarrow M} is a binary operation such that {x * (y * z) = (z * w) * w} for all {x,y,z,w}. Is it true that {x * (y * x) = (x * y) * z} for all {x,y,z}?

Such a problem can be settled either by algebraically manipulating the initial equation to deduce the target equation, or by finding a counterexample to the target equation that still satisfies the initial equation. There are a variety of techniques to achieve either of these, but this sort of problem is difficult, and even undecidable in some cases; see this paper of the ETP collaborators for more discussion. Nevertheless, many of these problems can be settled with some effort by humans, by automated theorem provers, or by frontier AI systems; here for instance is an AI-generated solution to the above problem.

However, these AI models are expensive, and do not reveal much insight as to where their answers come from. If one instead tries a smaller and cheaper model, such as one of the many open-source models available, it turns out that these models basically perform no better than random chance, in that when asked to say whether the answer to a question such as the above is true or false, they only answer correctly about 50% of the time.

But, similarly to how a student struggling with the material for a math class can perform better on an exam when provided the right guidance, it turns out that such cheap models can perform at least modestly better on this task (with success rates increasing to about 55%-60%) if given the right prompt or “cheat sheet”.

“Stage 1” of the distillation challenge, which we launched today, asks for contestants to design a cheat sheet (of at most 10 kilobytes in size) that can increase the performance of these models on the above true-false problems to as high a level as possible. We have provided a “playground” with which to test one’s cheat sheet (or a small number of example cheat sheets) some cheap models against a public set of 1200 problems (1000 of which were randomly selected, and rather easy, together with 200 “hard” problems that were selected to resist the more obvious strategies for resolving these questions); a brief video explaining how to use the playground can be found here.

Submissions stage will end on April 20, after which we will evaluate the submissions against a private subset of test questions. The top 1000 submissions will advance to a second stage which we are currently in the process of designing, which will involve more advanced models, but also the more difficult task of not just providing a true-false answer, but also a proof or counterexample to the problem.

The competition will be coordinated on this Zulip channel, where I hope there will be a lively and informative discussion.

My hope is that the winning submissions will capture the most productive techniques for solving these problems, and/or provide general problem-solving techniques that would also be applicable to other types of mathematical problems. We started with the equational theory project data set for this pilot competition due to its availability and spectrum of difficulty levels, but if this type of distillation process leads to interesting results, one could certainly run in on many other types of mathematical problem classes to get some empirical data on how readily they can be solved, particularly after we learn from this pilot competition on how to encourage participation and share of best practices.

SAIR will also launch some other mathematical challenges in the coming months that will be of a more cooperative nature than this particular competitive challenge; stay tuned for further announcements.

John PreskillThe FeMo-cofactor and classical and quantum computing

Recently, my coworkers and I put out a preprint “Classical solution of the FeMo-cofactor model to chemical accuracy and its implications’’ (Zhai et al. 2026). It is a bit unusual to write commentary on one’s own scientific article. However, in this case, given the many inquiries I have had about the work in the context of quantum computing, many of which have contained similar questions (and often similar misunderstandings), I thought it would be useful to provide some perspective that we could not provide in the original preprint, in an informal manner.

What is FeMo-co?

I will start with some background on the FeMo-cofactor (FeMo-co). This cofactor is the reaction center of nitrogenase, an enzyme found in certain soil-dwelling bacteria. Nitrogenase’s claim to fame is that it converts atmospheric dinitrogen, which is held together by a strong N-N triple bond, into a reduced form (ammonia) which can then be taken up by plants and thereby be passed onto the rest of the living biomass. In terms of incorporating nitrogen into biomass, nitrogenase is believed to be responsible for about 2/3 of biological nitrogen, with the remainder coming from fertilizers. Because it plays this critical role, it is sometimes referred to as the enzyme that feeds the planet.

The chemistry of how dinitrogen is reduced at the FeMo-cofactor is still largely unknown. The basic stoichiometry of the reaction is often written as

\mathrm{N}_2 + 8\mathrm{H}^{+} + 8e^- + \mathrm{16MgATP} \to 2 \mathrm{NH}_3 + \mathrm{H}_2 + 16\mathrm{MgADP} + 16\mathrm{P}_i

but this just a sketch of the process. In particular, the above equation contains, nominally, a large number of molecular reactants, and clearly they do not all just come together in a bang! The role of the cofactor, and the enzyme more generally, is to coordinate the protons, electrons, biological energy source (ATP), and the dinitrogen molecule, into a sequence of well-defined steps, known as the reaction mechanism. Since the work of Lowe and Thorneley (Thorneley and Lowe 1984), the most common proposal for the nitrogenase reaction mechanism contains 8 intermediate steps (corresponding roughly to 8 sequential proton and electron additions). However, due to the difficulty in isolating the intermediate states of FeMo-co, as well as challenges in using experimental probes to deduce what these states are, the Lowe-Thorneley cycle still remains an unproven hypothesis. Biochemists, spectroscopists, as well as a few theoretical quantum chemists, are today actively engaged in observing, computing, deducing (and arguing about) the nitrogenase mechanism (Jiang and Ryde 2023; Lancaster et al. 2011; Einsle and Rees 2020; Badding et al. 2023; Thorhallsson et al. 2019).

So how did nitrogenase become so widely discussed in the setting of quantum computing? In 2016, an article “Elucidating reaction mechanisms on quantum computers’’, that has since become one of the most cited papers in the nitrogenase field, arguably started this all (Reiher et al. 2017). The article included a number of proposals, including (1) that the ‘promise of exponential speedups for the electronic structure problem’ could be applied to elucidate the nitrogenase reaction mechanism that had so far proved intractable for classical computation, and (2) that solving this problem would be an example of how quantum simulation could be ‘scientifically and economically impactful’. (Similar proposals can also be found repeated in less technical language and settings, see e.g. ‘Why do we want a quantum computer’). An important technical contribution of the article was to provide a detailed quantum resource estimate for a simulation of chemistry. The problem statement was to compute the ground-state energy of a specific ‘54 orbital’ (108 qubit) model of FeMo-co, to an accuracy of 1 kcal/mol, referred to as chemical accuracy. It is important to note the word ‘model’ in the problem statement. Electrons move in continuous space, and thus quantum chemical Hamiltonians are formulated in the continuum, while quantum computation requires discretization of this space. This discretization, in terms of a so-called active space set of orbitals that the electrons can variously occupy, constitutes the model. We will return to the definition of the model below. By compiling a Trotter-Suzuki implementation of the quantum phase estimation algorithm within a fault-tolerant resource model for their specific FeMo-co model Hamiltonian, Ref. (Reiher et al. 2017) provided a T-gate resource estimate. Combined with some assumptions about the quantum architecture, this provided perhaps the first concrete time-cost to solve an interesting chemistry problem on a quantum computer. This work has since served as an inspiration for many subsequent quantitative resource estimation efforts in the quantum computing for chemistry field.

Exponential speedup and societal impact

Before proceeding further in this story, it is worth examining the two key propositions made in Ref. (Reiher et al. 2017). I start with the question of exponential speedup. Quantum algorithms for the ground-state energy, such as quantum phase estimation, essentially perform a projective measurement of the energy (encoded in a phase). Thus, it is essential to prepare a good initial state, i.e. with large overlap with the desired eigenstate, to measure the correct energy. This, however, is a strong constraint, if we are seeking asymptotically large quantum advantage. For example, if such an initial state is first determined classically, as is often suggested, then exponential quantum advantage in a given problem requires that finding good classical guesses is easy, while improving them classically to fully solve the problem becomes exponentially hard as the problem size increases. Unfortunately, convincing evidence that chemically relevant electronic structure problems, including the problem of cofactor electronic structure exemplified by FeMo-co, fall into this category has not yet been found, as discussed in detail in Refs. (Lee et al. 2023; Chan 2024).

The second proposition, that elucidating the reaction mechanism of nitrogenase will lead to a transformative societal impact, is similarly nuanced. The claim originates in the observation that the competing industrial process for fertilizer production via nitrogen reduction, namely, the Haber-Bosch process, takes place at high temperatures and pressures and consumes a significant percentage of the world’s energy. Bacteria, on the other hand, can do this process at room temperature.

While it is true that the nitrogenase enzyme functions at ambient temperature and pressure, it is simply false that it consumes much less energy. This is because the large amount of energy required for nitrogen fixation mainly originates from thermodynamics, i.e. one needs energy to break the strong nitrogen triple bond. In fact, taking into account the physiological conditions and the ATP cost, bacteria arguably expend more energy to reduce ammonia (Chan 2024) than a modern efficient industrial implementation of the Haber-Bosch process. Thus the real hope behind trying to understand the nitrogenase mechanism in the context of societal impact is that we may one day engineer a variant of it with more desirable properties, e.g. with higher turnover, or with a lower carbon footprint, or which is more selective for nitrogen reduction. Whether this is actually possible remains to be seen, and certainly requires much more than knowing the ground-state of FeMo-co, or even the full reaction mechanism.

Which FeMo-cofactor model?

I now return to the question of FeMo-cofactor models. Ref. (Reiher et al. 2017) introduced a particular cofactor model, which I will refer to it as RWST, following the names of the authors. As we soon found out, simulating the ground-state of the RWST model was actually very easy classically, and in fact (as reported in  (Li et al. 2019)) could be done using standard quantum chemistry methods with a few hours of calculation on a laptop. This was because although the RWST model was a 108 qubit model, and (in the worst case) a 108 qubit ground-state cannot be stored classically, the RSWT model Hamiltonian was constructed in such a way to not capture any of the difficult features of the FeMo-cofactor ground-state. This highlights the importance of not assuming worst case complexity about physical problems!

What makes the electronic structure of the FeMo-cofactor (relatively) complicated is the presence of many ‘unpaired’ electrons. In simple molecules, we can describe the ground-state as one where all the electrons sit in pairs in orbitals. Since an orbital can only carry a pair of electrons at a time, the ground-state is simply described by filling the lowest energy orbitals with pairs. However, in molecules with transition metals, there are typically ‘unpaired’ electrons (so-called open-shells), and then we need to consider whether and how they pair up, which orbitals are singly versus doubly occupied, and so on. The RSWT model ground-state had no unpaired electrons! It was therefore unrepresentatively easy to solve for the ground state classically.

Because of the problems with the RWST model, my group formulated a more suitable 76 orbital/152 qubit model of FeMo-co in Ref. (Li et al. 2019), which I will refer to as the LLDUC model, again by the names of the authors. Although the LLDUC model is still a significant truncation of the true electronic structure of FeMo-co, we verified that it contains the correct open-shell character of the cofactor, and thus has a ‘representative’ complexity in its ground-state. Since we published the LLDUC model, it has become the most common benchmark model of FeMo-co used in quantum resource estimates for new quantum chemistry ground-state algorithms (Wan et al. 2022; Berry et al. 2019; Luo and Cirac 2025; Low et al. 2025).

Heuristics in the classical solution of the LLDUC FeMo-cofactor model

This brings me now to the recent work in Ref. (Zhai et al. 2026), where, through a sequence of classical calculations, we could produce a classical estimate of the ground-state energy of the LLDUC model to chemical accuracy. How was this achieved?

Classical electronic structure methods (aside from exact diagonalization) are heuristic algorithms. Much like quantum algorithms, they implicitly or explicitly start from an initial state. In chemical applications, this can be viewed as a product state or set of product states: for tensor network algorithms, such as the density matrix renormalization group (when not considering topological order) this is the set of D=1 states (specified by the underlying basis) which connect to the space of slightly entangled states with D>1 that the algorithm naturally explores. In coupled cluster methods, this is the initial reference state to which excitations are applied. Although many classical heuristics are exact with exponential effort, e.g. by increasing the bond dimension D in a tensor network or excitation level in coupled cluster theory, in practical computational chemistry, classical heuristics are used with the assumption that so long as the initial state is chosen appropriately, they will converge rapidly to the true ground-state without exponential effort. I analyze this heuristic working assumption in Ref. (Chan 2024) where I name it the classical heuristic cost conjecture. However, finding the good classical initial state is an NP hard problem, and this is often the crux of where the challenge in simulation actually lies.

In FeMo-co, unlike in simpler molecules, it is not at all obvious what product state to start from. To address this, in Ref. (Zhai et al. 2026), we devised an enumeration and filtering protocol. The relevant manifold arises from the orbital and spin degrees of freedom of the Fe ions: which Fe orbitals are occupied, by how many electrons, and with which spins. One technical point is that the resulting product states do not generally conserve the global S^2 spin symmetry. However, as recognized by Anderson decades ago, for magnetic order in large systems, the eigenstates can be chosen to break S^2 symmetry due to the tower of symmetry preserving eigenstates at an energy scale of \mathrm{const} \times V^{-1} (where V is the system volume). For a finite energy resolution \Delta \gg \mathrm{const} \times V^{-1} we can equally use a broken symmetry description of the states, an example of the fragility of entanglement effects in physical systems.

Because applying the highest level of classical approximation to all enumerated product states was far too expensive, we used a filtering funnel, where product states were ranked at different levels of theory, passing promising candidates to higher levels of classical computation. In the end, the final most accurate calculations were performed on only 3 candidates, which we deduced to all be essentially degenerate to within chemical accuracy.

There are other important technical details in Ref. (Zhai et al. 2026) which I have not mentioned: the use of unrestricted orbitals, the systematic extrapolations to obtain the final energies and estimated errors, and the benchmarking required to be confident about the protocol. However, recognizing that the FeMo-co ground-state problem could be reduced essentially to a ranking problem was the essence of what made the estimate possible.

Implications of the classical solution for chemistry

From a chemical and biochemical perspective, computing the ground-state energy of a model to some specified accuracy – even chemical accuracy – is a highly artificial target. Most chemical calculations that have an impact on our understanding never achieve or even target chemical accuracy in the total energy. In addition, chemistry does not depend on the the total energy, but the relative energy of different chemical configurations, which typically differ only by O(1) changes in the bonding and chemistry.

The main take-home from our work then is that there is nothing especially mysterious about FeMo-co’s electronic structure. The story of the FeMo-co ground-state is not one of multiconfigurational electronic structure (i.e. where the states are not at all close to product states), but one of multiple configurations (i.e. many competing product states). Indeed, this is basically how nitrogenase chemists have long reasoned about the electronic structure of iron-sulfur clusters and FeMo-co (Lovell et al. 2001; Yamaguchi et al. 1990). Our work thus now provides extensive and rigorous numerical support for this picture.

Because of this simplicity, the full richness of classical quantum chemistry methods can now be brought to bear on FeMo-co electronic structure beyond the LLDUC model. Assuming the model already captures the qualitative complexity of the cluster’s electronic structure, we expect such investigations to provide quantitative corrections to the picture we have obtained. We took some initial steps to confirm this in our manuscript, considering larger orbital spaces, the effect of protein fluctuations, and the interpretation of certain spectroscopies. In the future, connecting these simulations to more spectroscopic measurements will be an exciting possibility. In addition, now that the electronic structure is on a conceptually sound footing, we have a foundation to support the central question of resolving the reaction mechanism. This opens up a whole new set of scientific challenges associated with observing reactions on extremely slow timescales.

Implications of the classical solution for quantum computing in chemistry

Because of the success of classical heuristic methods for this problem, one may naturally wonder what these results mean for the application of quantum computers in chemistry. Here I address some commonly asked questions.

Is the classical simulation of the LLDUC model a ‘last hurrah’ for classical methods?

I have seen the analogy drawn between the FeMo-co result and the classical tensor network simulations for random circuit sampling experiments. In that case, while the famous Google Sycamore experiment (Arute et al. 2019) could be replicated by classical tensor network simulations (Gray and Kourtis 2021), subsequent improvements in quantum processors, soon led to random circuit sampling experiments outpacing the capabilities of classical simulations.

However, the situation here is quite different. There is strong evidence that generating samples from a random quantum circuit (without noise) is actually exponentially hard to do using a classical algorithm, and indeed, the classical simulations used for the task were (mostly) brute force simulations with exponential cost in circuit size. In contrast, the theoretical support for exponential quantum advantage in the FeMo-co problem is much weaker, and as an empirical fact, most of the methods used in the FeMo-co simulation (namely the coupled cluster methods for a given excitation level) are polynomial cost algorithms. Since a similar simulation strategy has also been successfully applied across the series of 2, 4, and 8 metal iron-sulfur complexes (Sharma et al. 2014; Li et al. 2019; Zhai et al. 2023, 2026), we have no reason to expect a radically different situation if we consider larger analogous complexes in this series.

And in any case, chemistry does not provide an endless scaling of problem size; FeMo-co is the largest enzyme cofactor in terms of the number of transition metals. Materials simulations provide a setting to scale the problem size, but one still faces the question as to whether the relevant states observed are truly that complicated classically. For example, classical simulations of the ground-state orders of the 2D Hubbard model currently show no exponential increase in difficulty when going to larger system sizes (Chen et al. 2025; Liu et al. 2025).

Has the availability of a classical strategy for FeMo-co changed your enthusiasm for quantum computers in chemistry?

Again, my answer is no. There is an entire community of nitrogenase scientists: experimental spectroscopists, synthetic chemists, and of course computational chemists, who are working to map out the reaction mechanism, none of whose research is predicated on using a quantum computer. Personally, I have never thought that to understand nitrogenase we would first have to build a quantum computer, otherwise I would not work on the problem!

At the same time, any computational tool brings new capabilities that will be useful. Quantum algorithms come with theoretical guarantees; for example, so long as the initial state is well prepared, we know the error in the energy that we measure from a quantum algorithm, which is more reliable than the classical estimates of error we obtain from extrapolations. Similarly, initial state preparation for a quantum computer, even for classically tractable problems, is probably easier than solving the entire problem classically, since only a ‘rough’ guess is needed. And finally, a polynomial or even constant factor speedup is exciting, so long as the speedup is large enough!

Thus, I am in fact excited to see quantum computers applied to this problem, I am just not waiting for them to be built first.

How should one think about past work on quantum algorithms that has used FeMo-co as a target?

FeMo-co was amongst the earliest examples of a chemical problem for which a case for quantum advantage was made. For this reason, it is overrepresented in the literature of quantum computing for chemistry. Should fully fault-tolerant quantum computers be available, they will naturally be applied to a wider set of systems (Chan 2024; Babbush et al. 2025).

Also, one must recognize that the availability of a single concrete optimization target has led to undeniable advances in quantum algorithms for quantum chemistry. In most cases, prior work to improve quantum resources estimates for FeMo-co involve techniques that apply to other systems as well. Thus, there’s no need to throw away those papers!

What are some lessons and conclusions to draw?

The first is obviously that, just because something has not been solved, or appears hard to solve classically, does not mean it is the best problem to choose for a quantum computer. The classical solution strategy for FeMo-co essentially involved a complicated classical state preparation problem, which is a shared challenge with ground-state estimation algorithms in quantum computers, and thus not perhaps an optimal choice of problem.

My second main conclusion is that since classical solutions in complex problems are possible because they use some understanding of the problem, for quantum algorithms to have maximum impact, they should use the same knowledge. In fact most chemistry is not about truly mysterious quantum systems, but more about ordinary quantum matter where we know roughly what is going on, but where detailed simulations are still required. If quantum computing algorithms can target this ‘mundane’ regime, they will have maximum impact on chemistry as it is practiced today. In recent work, we have taken some steps in this direction by proposing quantum algorithms for electronic structure that work within the same heuristic framework as most current quantum chemistry methods (Chen and Chan 2025).

Finally, I wish to emphasize that, from the perspective of understanding nitrogenase, and maximising societal impact, the choice of computational algorithm and hardware to solve the problem is irrelevant. The fact that FeMo-co electronic structure is not so mysterious is an enormously positive thing, as it means that making progress on the larger problem of the mechanism using computation no longer seems so impossible. I have seen some of the brightest minds in the world helping to advance quantum algorithms for this problem. If any of this brainpower can be devoted to the chemical question itself, I believe we can be very optimistic about the future solution of the nitrogenase problem.

References

Arute, Frank, Kunal Arya, Ryan Babbush, et al. 2019. “Quantum Supremacy Using a Programmable Superconducting Processor.” Nature 574 (7779): 505–10.
Babbush, Ryan, Robbie King, Sergio Boixo, et al. 2025. “The Grand Challenge of Quantum Applications.” arXiv Preprint arXiv:2511.09124.
Badding, Edward D, Suppachai Srisantitham, Dmitriy A Lukoyanov, Brian M Hoffman, and Daniel LM Suess. 2023. “Connecting the Geometric and Electronic Structures of the Nitrogenase Iron–Molybdenum Cofactor Through Site-Selective 57Fe Labelling.” Nature Chemistry 15 (5): 658–65.
Berry, Dominic W, Craig Gidney, Mario Motta, Jarrod R McClean, and Ryan Babbush. 2019. “Qubitization of Arbitrary Basis Quantum Chemistry Leveraging Sparsity and Low Rank Factorization.” Quantum 3: 208.
Chan, Garnet Kin-Lic. 2024. “Spiers Memorial Lecture: Quantum Chemistry, Classical Heuristics, and Quantum Advantage.” Faraday Discussions 254: 11–52.
Chen, Ao, Zhou-Quan Wan, Anirvan Sengupta, Antoine Georges, and Christopher Roth. 2025. “Neural Network-Augmented Pfaffian Wave-Functions for Scalable Simulations of Interacting Fermions.” arXiv Preprint arXiv:2507.10705.
Chen, Jielun, and Garnet Kin Chan. 2025. “A Framework for Robust Quantum Speedups in Practical Correlated Electronic Structure and Dynamics.” arXiv Preprint arXiv:2508.15765.
Einsle, Oliver, and Douglas C Rees. 2020. “Structural Enzymology of Nitrogenase Enzymes.” Chemical Reviews 120 (12): 4969–5004.
Gray, Johnnie, and Stefanos Kourtis. 2021. “Hyper-Optimized Tensor Network Contraction.” Quantum 5: 410.
Jiang, Hao, and Ulf Ryde. 2023. “N2 Binding to the E0–E4 States of Nitrogenase.” Dalton Transactions 52 (26): 9104–20.
Lancaster, Kyle M, Michael Roemelt, Patrick Ettenhuber, et al. 2011. “X-Ray Emission Spectroscopy Evidences a Central Carbon in the Nitrogenase Iron-Molybdenum Cofactor.” Science 334 (6058): 974–77.
Lee, Seunghoon, Joonho Lee, Huanchen Zhai, et al. 2023. “Evaluating the Evidence for Exponential Quantum Advantage in Ground-State Quantum Chemistry.” Nature Communications 14 (1): 1952.
Li, Zhendong, Sheng Guo, Qiming Sun, and Garnet Kin-Lic Chan. 2019. “Electronic Landscape of the P-Cluster of Nitrogenase as Revealed Through Many-Electron Quantum Wavefunction Simulations.” Nature Chemistry 11 (11): 1026–33.
Liu, Wen-Yuan, Huanchen Zhai, Ruojing Peng, Zheng-Cheng Gu, and Garnet Kin-Lic Chan. 2025. “Accurate Simulation of the Hubbard Model with Finite Fermionic Projected Entangled Pair States.” Physical Review Letters 134 (25): 256502.
Lovell, Timothy, Jian Li, Tiqing Liu, David A Case, and Louis Noodleman. 2001. FeMo Cofactor of Nitrogenase: A Density Functional Study of States MN, MOX, MR, and MI.” Journal of the American Chemical Society 123 (49): 12392–410.
Low, Guang Hao, Robbie King, Dominic W Berry, et al. 2025. “Fast Quantum Simulation of Electronic Structure by Spectrum Amplification.” arXiv Preprint arXiv:2502.15882.
Luo, Maxine, and J Ignacio Cirac. 2025. “Efficient Simulation of Quantum Chemistry Problems in an Enlarged Basis Set.” PRX Quantum 6 (1): 010355.
Reiher, Markus, Nathan Wiebe, Krysta M Svore, Dave Wecker, and Matthias Troyer. 2017. “Elucidating Reaction Mechanisms on Quantum Computers.” Proceedings of the National Academy of Sciences 114 (29): 7555–60.
Sharma, Sandeep, Kantharuban Sivalingam, Frank Neese, and Garnet Kin-Lic Chan. 2014. “Low-Energy Spectrum of Iron–Sulfur Clusters Directly from Many-Particle Quantum Mechanics.” Nature Chemistry 6 (10): 927–33.
Thorhallsson, Albert Th, Bardi Benediktsson, and Ragnar Bjornsson. 2019. “A Model for Dinitrogen Binding in the E4 State of Nitrogenase.” Chemical Science 10 (48): 11110–24.
Thorneley, Roger NF, and DJ Lowe. 1984. “The Mechanism of Klebsiella Pneumoniae Nitrogenase Action. Pre-Steady-State Kinetics of an Enzyme-Bound Intermediate in N2 Reduction and of NH3 Formation.” Biochemical Journal 224 (3): 887–94.
Wan, Kianna, Mario Berta, and Earl T Campbell. 2022. “Randomized Quantum Algorithm for Statistical Phase Estimation.” Physical Review Letters 129 (3): 030503.
Yamaguchi, Kizashi, Takayuki Fueno, Masa-aki Ozaki, Norikazu Ueyama, and Akira Nakamura. 1990. “A General Spin-Orbital (GSO) Description of Antiferromagnetic Spin Couplings Between Four Irons in Iron-Sulfur Clusters.” Chemical Physics Letters 168 (1): 56–62.
Zhai, Huanchen, Seunghoon Lee, Zhi-Hao Cui, Lili Cao, Ulf Ryde, and Garnet Kin-Lic Chan. 2023. “Multireference Protonation Energetics of a Dimeric Model of Nitrogenase Iron–Sulfur Clusters.” The Journal of Physical Chemistry A 127 (47): 9974–84.
Zhai, Huanchen, Chenghan Li, Xing Zhang, Zhendong Li, Seunghoon Lee, and Garnet Kin-Lic Chan. 2026. “Classical Solution of the FeMo-Cofactor Model to Chemical Accuracy and Its Implications.” arXiv Preprint arXiv:2601.04621.

March 13, 2026

Matt von HippelAbout the OpenAI Amplitudes Paper, but Not as Much as You’d Like

I’ve had a bit more time to dig in to the paper I mentioned last week, where OpenAI collaborated with amplitudes researchers, using one of their internal models to find and prove a simplified version of a particle physics formula. I figured I’d say a bit about my own impressions from reading the paper and OpenAI’s press release.

This won’t be a real “deep dive”, though it will be long nonetheless. As it turns out, most of the questions I’d like answers to aren’t answered in the paper or the press release. Getting them will involve actual journalistic work, i.e. blocking off time to interview people, and I haven’t done that yet. What I can do is talk about what I know so far, and what I’m still wondering.

Context:

Scattering amplitudes are formulas used by particle physicists to make predictions. For a while, people would just calculate these when they needed them, writing down pages of mess that you could plug in numbers to to get answers. However, forty years ago two physicists decided they wanted more, writing “we hope to obtain a simplified form for the answer, making our result not only an experimentalist’s, but a theorist’s delight.”

In their next paper, they managed to find that “theorist’s delight”: a simplified, intuitive-looking answer that worked for calculations involving any number of particles, summarizing many different calculations. Ten years later, a few people had started building on it, and ten years after that, the big shots started paying attention. A whole subfield, “amplitudeology”, grew from that seed, finding new forms of “theorists’s delight” in scattering amplitudes.

Each subfield has its own kind of “theory of victory”, its own concept for what kind of research is most likely to yield progress. In amplitudes, it’s these kinds of simplifications. When they work out well, they yield new, more efficient calculation techniques, yielding new messy results which can be simplified once more. To one extent or another, most of the field is chasing after those situations when simplification works out well.

That motivation shapes both the most ambitious projects of senior researchers, and the smallest student projects. Students often spend enormous amounts of time looking for a nice formula for something and figuring out how to generalize it, often on a question suggested by a senior researcher. These projects mostly serve as training, but occasionally manage to uncover something more impressive and useful, an idea others can build around.

I’m mentioning all of this, because as far as I can tell, what ChatGPT and the OpenAI internal model contributed here roughly lines up with the roles students have on amplitudes papers. In fact, it’s not that different from the role one of the authors, Alfredo Guevara, had when I helped mentor him during his Master’s.

Senior researchers noticed something unusual, suggested by prior literature. They decided to work out the implications, did some calculations, and got some messy results. It wasn’t immediately clear how to clean up the results, or generalize them. So they waited, and eventually were contacted by someone eager for a research project, who did the work to get the results into a nice, general form. Then everyone publishes together on a shared paper.

How impressed should you be?

I said, “as far as I can tell” above. What’s annoying is that this paper makes it hard to tell.

If you read through the paper, they mention AI briefly in the introduction, saying they used GPT-5.2 Pro to conjecture formula (39) in the paper, and an OpenAI internal model to prove it. The press release actually goes into more detail, saying that the humans found formulas (29)-(32), and GPT-5.2 Pro found a special case where it could simplify them to formulas (35)-(38), before conjecturing (39). You can get even more detail from an X thread by one of the authors, OpenAI Research Scientist Alex Lupsasca. Alex had done his PhD with another one of the authors, Andrew Strominger, and was excited to apply the tools he was developing at OpenAI to his old research field. So they looked for a problem, and tried out the one that ended up in the paper.

What is missing, from the paper, press release, and X thread, is any real detail about how the AI tools were used. We don’t have the prompts, or the output, or any real way to assess how much input came from humans and how much from the AI.

(We have more for their follow-up paper, where Lupsasca posted a transcript of the chat.)

Contra some commentators, I don’t think the authors are being intentionally vague here. They’re following business as usual. In a theoretical physics paper, you don’t list who did what, or take detailed account of how you came to the results. You clean things up, and create a nice narrative. This goes double if you’re aiming for one of the most prestigious journals, which tend to have length limits.

This business-as-usual approach is ok, if frustrating, for the average physics paper. It is, however, entirely inappropriate for a paper showcasing emerging technologies. For a paper that was going to be highlighted this highly by OpenAI, the question of how they reached their conclusion is much more interesting than the results themselves. And while I wouldn’t ask them to go to the standards of an actual AI paper, with ablation analysis and all that jazz, they could at least have aimed for the level of detail of my final research paper, which gave samples of the AI input and output used in its genetic algorithm.

For the moment, then, I have to guess what input the AI had, and what it actually accomplished.

Let’s focus on the work done by the internal OpenAI model. The descriptions I’ve seen suggest that it started where GPT-5.2 Pro did, with formulas (29)-(32), but with a more specific prompt that guided what it was looking for. It then ran for 12 hours with no additional input, and both conjectured (39) and proved it was correct, providing essentially the proof that follows formula (39) in the paper.

Given that, how impressed should we be?

First, the model needs to decide to go to a specialized region, instead of trying to simplify the formula in full generality. I don’t know whether they prompted their internal model explicitly to do this. It’s not something I’d expect a student to do, because students don’t know what types of results are interesting enough to get published, so they wouldn’t be confident in computing only a limited version of a result without an advisor telling them it was ok. On the other hand, it is actually something I’d expect an LLM to be unusually likely to do, as a result of not managing to consistently stick to the original request! What I don’t know is whether the LLM proposed this for the right reason: that if you have the formula for one region, you can usually find it for other regions.

Second, the model needs to take formulas (29)-(32), write them in the specialized region, and simplify them to formulas (35)-(38). I’ve seen a few people saying you can do this pretty easily with Mathematica. That’s true, though not every senior researcher is comfortable doing that kind of thing, as you need to be a bit smarter than just using the Simplify[] command. Most of the people on this paper strike me as pen-and-paper types who wouldn’t necessarily know how to do that. It’s definitely the kind of thing I’d expect most students to figure out, perhaps after a couple of weeks of flailing around if it’s their first crack at it. The LLM likely would not have used Mathematica, but would have used SymPy, since these “AI scientist” setups usually can write and execute Python code. You shouldn’t think of this as the AI reasoning through the calculation itself, but it at least sounds like it was reasonably quick at coding it up.

Then, the model needs to conjecture formula (39). This gets highlighted in the intro, but as many have pointed out, it’s pretty easy to do. If any non-physicists are still reading at this point, take a look:

Could you guess (39) from (35)-(38)?

After that, the paper goes over the proof that formula (39) is correct. Most of this proof isn’t terribly difficult, but the way it begins is actually unusual in an interesting way. The proof uses ideas from time-ordered perturbation theory, an old-fashioned way to do particle physics calculations. Time-ordered perturbation theory isn’t something any of the authors are known for using with regularity, but it has recently seen a resurgence in another area of amplitudes research, showing up for example in papers by Matthew Schwartz, a colleague of Strominger at Harvard.

If a student of Strominger came up with an idea drawn from time-ordered perturbation theory, that would actually be pretty impressive. It would mean that, rather than just learning from their official mentor, this student was talking to other people in the department and broadening their horizons, showing a kind of initiative that theoretical physicists value a lot.

From an LLM, though, this is not impressive in the same way. The LLM was not trained by Strominger, it did not learn specifically from Strominger’s papers. Its context suggested it was working on an amplitudes paper, and it produced an idea which would be at home in an amplitudes paper, just a different one than the one it was working on.

While not impressive, that capability may be quite useful. Academic subfields can often get very specialized and siloed. A tool that suggests ideas from elsewhere in the field could help some people broaden their horizons.

Overall, it appears that that twelve-hour OpenAI internal model run reproduced roughly what an unusually bright student would be able to contribute over the course of a several-month project. Like most student projects, you could find a senior researcher who could do the project much faster, maybe even faster than the LLM. But it’s unclear whether any of the authors could have: different senior researchers have different skillsets.

A stab at implications:

If we take all this at face-value, it looks like OpenAI’s internal model was able to do a reasonably competent student project with no serious mistakes in twelve hours. If they started selling that capability, what would happen?

If it’s cheap enough, you might wonder if professors would choose to use the OpenAI model instead of hiring students. I don’t think this would happen, though: I think it misunderstands why these kinds of student projects exist in a theoretical field. Professors sometimes use students to get results they care about, but more often, the student’s interest is itself the motivation, with the professor wanting to educate someone, to empire-build, or just to take on their share of the department’s responsibilities. AI is only useful for this insofar as AI companies continue reaching out to these people to generate press releases: once this is routinely possible, the motivation goes away.

More dangerously, if it’s even cheaper, you could imagine students being tempted to use it. The whole point of a student project is to train and acculturate the student, to get them to the point where they have affection for the field and the capability to do more impressive things. You can’t skip that, but people are going to be tempted to.

And of course, there is the broader question of how much farther this technology can go. That’s the hardest to estimate here, since we don’t know the prompts used. So I don’t know if seeing this result tells us anything more about the bigger picture than we knew going in.

Remaining questions:

At the end of the day, there are a lot of things I still want to know. And if I do end up covering this professionally, they’re things I’ll ask.

  1. What was the prompt given to the internal model, and how much did it do based on that prompt?
  2. Was it really done in one shot, no retries or feedback?
  3. How much did running the internal model cost?
  4. Is this result likely to be useful? Are there things people want to calculate that this could make easier? Recursion relations it could seed? Is it useful for SCET somehow?
  5. How easy would it have been for the authors to do what the LLM did? What about other experts in the community?

March 12, 2026

John BaezA Typed Language for Agent Coordination

guest post by William Waites

Agent frameworks are popular. (These are frameworks for coordinating large language model agents, not to be confused with agent-based modelling in the simulation sense.) There are dozens of them for wrapping large language models in something called an agent and assembling groups of agents into workflows. Much of the surrounding discussion is marketing, but the underlying intuition is old: your web browser identifies itself as a user agent. What is new is the capability that generative language models bring.

The moment you have one agent, you can have more than one. That much is obvious. How to coordinate them is not. The existing frameworks (n8n, LangGraph, CrewAI, and others) are engineering solutions, largely ad hoc. Some, like LangGraph, involve real thinking about state machines and concurrency. But none draws on what we know from mathematics and computer science about typed composition, protocol specification, or structural guarantees for concurrent systems.

This matters because it is expensive. Multi-agent systems are complicated concurrent programs. Without structural guardrails, they fail in ways you discover only after spending the compute. A job can go off the rails, and the money you paid for it is wasted; the providers will happily take it regardless. At current subscription rates the cost is hidden, but a recent Forbes investigation found that a heavy user of Anthropic’s $200/month Claude Code subscription can consume up to $5,000/month measured at retail API rates. For third-party tools like Cursor, which pay close to those retail rates, these costs are real. Wasted tokens are wasted money.

To address this, we built a language called plumbing. It describes how agents connect and communicate, in such a way that the resulting graph can be checked before execution: checked for well-formedness, and within limits for deadlocks and similar properties. It is a statically typed language, and these checks are done formally. There is a compiler and a runtime for this language, working code, not a paper architecture. In a few lines of plumbing, you can describe agent systems with feedback loops, runtime parameter modulation, and convergence protocols, and be sure they are well-formed before they run. This post explains how it works.

The name has a history in computing. Engineers have always talked informally about plumbing to connect things together: bits of software, bits of network infrastructure. When I was a network engineer I sometimes described myself as a glorified plumber. The old Solaris ifconfig command took plumb as an argument, to wire a network interface into the stack. Plan 9 had a deeper version of the same idea. The cultural connection goes back decades.

This is the first of two posts. This one introduces the plumbing calculus: what it is, how it works, and a few simple examples. Motifs for adversarial review, ensemble reasoning, and synthesis. The second post will tackle something harder.

The calculus

The plumbing language is built on a symmetric monoidal category, specifically a copy-discard category with some extra structure. The terminology may be unfamiliar, but the underlying concept is not. Engineers famously like Lego. Lego bricks have studs on top and holes with flanged tubes underneath. The studs of one brick fit into the tubes of another. But Lego has more than one connection type: there are also holes through the sides of Technic bricks, and axles that fit through them, and articulated ball joints for the fancier kits. Each connection type constrains what can attach to what. This is typing.

In plumbing, the objects of the category are typed channels: streams that carry a potentially infinite sequence of values, each of a specific type (integer, string, a record type, or something more complex). We write !A to mean "a stream of As", so !string is a stream of strings and !int is a stream of integers. The morphisms, which describe how you connect channels together, are processes. A process has typed inputs and typed outputs.

There are four structural morphisms. Copy takes a stream and duplicates it: the same values appear on two output streams. Discard throws values away, perhaps the simplest thing you can do with a stream, and often needed. These two, together with the typed channels and the laws of the category, give us a copy-discard category.

To this we add two more. Merge takes two streams of the same type and interleaves them onto a single output stream. This is needed because a language model’s input is a single stream. There is nothing to be done about that. If you want to send two different things into it, you must send one and then the other. One might initially give merge the type !A ⊗ !B → !(A + B), taking two streams of different types and producing their coproduct. This works, but it is unnecessarily asymmetrical.

As Tobias Fritz has observed, it is cleaner to do the coproduct injection first, converting each stream to the coproduct type separately, and then merge streams that already have the same type. This gives:

merge : !A ⊗ !A → !(A + A)

Barrier takes two streams, which may be of different types, and synchronises them. Values arrive unsynchronised; the barrier waits for one value from each stream and produces a pair.

barrier : !A ⊗ !B → !(A, B)

(A mathematician would write A × B for the product. We cannot easily do this in a computer language because there is no × symbol on most keyboards, so we use (A, B) for the product, following Haskell’s convention.)

This is a synchronisation primitive. It is important because it unlocks session types, which we will demonstrate in the second post.

Two further morphisms are added to the category (they are not derivable from the structural ones, but are needed to build useful things): map, which applies a pure function to each value in a stream, and filter, which removes values that do not satisfy a predicate. Both are pure functions over streams. Both will be familiar from functional programming.

Here is a graphical representation of the morphisms. We can glue them together freely, as long as the types and the directions of the arrows match up.

Diagram showing all six morphisms as boxes with typed input and output wires. Top row: copy Δ (one input, two outputs of the same type), merge ∇ (two inputs of copyable type, one output of sum type), discard ◇ (one input, no output). Bottom row: barrier ⋈ (two inputs, one paired output, synchronises two streams), map f (one input, one output, applies a function), filter p (one input, one output, removes values failing a predicate). Each morphism shows its type signature using the !A notation for copyable streams.

There are two forms of composition. Sequential composition connects morphisms nose to tail, the output of one feeding the input of the next. Parallel composition places them side by side, denoted by ⊗ (the tensor product, written directly in plumbing source code). So: four structural morphisms, two utilities, two compositional forms, all operating on typed channels.

Because the channels are typed, the compiler can check statically, at compile time, that every composition is well-formed: that outputs match inputs at every boundary. This gives a guarantee that the assembled graph makes sense.

Two diagrams side by side. Left: sequential composition, showing two morphisms connected end-to-end, the output wire of the first feeding into the input wire of the second, forming a pipeline. Right: parallel composition (tensor product), showing two morphisms stacked vertically with no connection between them, running simultaneously on independent streams. Both forms produce a composite morphism whose type is derived from the types of the components.

A composition of morphisms is itself a morphism. This follows from the category laws (it has to, or it is not a category) but the practical consequence is worth stating explicitly. We can assemble a subgraph of agents and structural morphisms, and then forget the internal detail and use the entire thing as a single morphism in a larger graph. This gives modularity. We can study, test, and refine a building block in isolation, and once satisfied, use it as a component of something bigger.

What we have described so far is the static form of the language: concise, point-free (composing operations without naming intermediate values), all about compositions. This is what you write. It is not what the runtime executes. A compiler takes this static form and produces the underlying wiring diagram, expanding the compositions into explicit connections between ports. The relationship is similar to point-free style in functional programming: the concise form is good for thinking and writing; the expanded form is good for execution.

Agents

An agent is a special kind of morphism. It takes typed input and produces typed output, like any other morphism, and we can enforce these types. This much is a well-known technique; PydanticAI and the Vercel AI SDK do it. Agents implement typing at the language model level by producing and consuming JSON, and we can check that the JSON has the right form. This is the basis of the type checking.

Unlike the structural morphisms and utilities, an agent is stateful. It has a conversation history, a context window that fills up, parameters that change. You cannot sensibly model an agent as a pure function. You could model it using the state monad or lenses, and that would be formally correct, but it is the wrong level of abstraction for engineering. Instead, we allow ourselves to think of agents as opaque processes with a typed protocol for interacting with them. We mutate their state through that protocol, and we know how to do that purely from functional programming and category theory. The protocol is the right abstraction; the state management is an implementation detail behind it. How this works in practice, and what happens when it goes wrong, is the subject of the second post.

In addition to their main input and output ports, agents in plumbing have control ports (control in and control out) for configuring the agent at runtime. For example, the temperature parameter governs how creative a language model is: how wide its sampling distribution when choosing output. At zero it is close to deterministic; at one it becomes much less predictable. A control message might say set temperature to 0.3; the response on the control out wire might be acknowledged. The control port carries a typed stream like anything else.

Agents also have ports for operator-in-the-loop (often called human-in-the-loop, though there is no reason an operator must be human), tool calls, and telemetry. The telemetry port emits usage statistics and, if the underlying model supports it, thinking traces. We will not detail these here. Suffice it to say that an agent has several pairs of ports beyond what you might imagine as its regular chat input and output.

Diagram of a generic agent morphism showing all port pairs. The agent is a central box. On the left: input (main data stream), ctrl_in (control commands), tool_in (tool call responses), oitl_in (operator-in-the-loop responses). On the right: output (main data stream), ctrl_out (control responses), tool_out (tool call requests), oitl_out (operator-in-the-loop requests), telemetry (usage and diagnostic data). Each port pair carries a typed stream. Most programs use only a few of these ports; unused ports are elided via the don't-care-don't-write convention.

An agent has many ports, but most programs use only a few of them. We adopt a convention from the κ calculus: don’t care, don’t write. Any output port that is not mentioned in the program is implicitly connected to discard. If a port’s output cannot matter, there is no reason to write it down.

Example: adversarial document composition

Suppose the problem is to write a cover letter for a job application. You provide some background material (a CV, some notes, some publications) and a job advert. You want a network of agents to produce a good cover letter. A good cover letter has two constraints: it must be accurate, grounded in the source materials, not making things up; and it must be compelling, so that the reader wants to give you an interview.

These two constraints are in tension, and they are best served by different agents with different roles. A composer drafts from the source materials. A checker verifies the draft against those materials for accuracy, producing a verdict: pass or fail, with commentary. A critic, who deliberately cannot see the source materials, evaluates whether the result is compelling on its own terms, producing a score.

The feedback loops close the graph. If the checker rejects the draft, its commentary goes back to the composer. If the critic scores below threshold, its review goes back to the composer. Only when the critic is satisfied does the final draft emerge.

Here is the plumbing code:

type Verdict = { verdict: bool, commentary: string, draft: string }
type Review  = { score: int, review: string, draft: string }

let composer : !string -> !string = agent { ... }
let checker  : !string -> !Verdict = agent { ... }
let critic   : !Verdict -> !Review = agent { ... }

let main : !string -> !string = plumb(input, output) {
  input   ; composer ; checker
  checker ; filter(verdict = false)
          ; map({verdict, commentary}) ; composer
  checker ; filter(verdict = true) ; critic
  critic  ; filter(score < 85)
          ; map({score, review}) ; composer
  critic  ; filter(score >= 85).draft ; output
}

And here is a graphical representation of what’s going on:

Vertical diagram of the adversarial document composition pipeline. Flow runs top to bottom. Input feeds into a composer agent. The composer's output goes to a checker agent. The checker splits two ways via filter: if verdict is false, the verdict and commentary are mapped back to the composer as feedback (loop). If verdict is true, the draft goes to a critic agent. The critic also splits two ways: if score is below 85, the score and review are mapped back to the composer for revision (second loop). If score is 85 or above, the draft is extracted via map and sent to the output. Two feedback loops, two quality gates, one output.

The agent configuration is elided. The main pipeline takes a string input and produces a string output. It is itself a morphism, and could be used as a component in something larger.

Notice what the wiring enforces. The critic receives verdicts, not the original source materials. The information partition is a consequence of the types, not an instruction in a prompt. The feedback loops are explicit: a failed verdict routes back to the composer with commentary; a low score routes back with the review. All of this is checked at compile time.

Example: heated debate

The previous example shows sequential composition and feedback loops but not parallel composition. An ensemble of agents running simultaneously on the same input needs the tensor product.

Ensembles are common. Claude Code spawns sub-agents in parallel to investigate or review, then gathers the results. This is a scatter-gather pattern familiar from high-performance computing.
But this example, due to Vincent Danos, adds something less common: modulation of agent behaviour through the control port.

The input is a proposition. Two agents debate it, one advocating and one sceptical, running in parallel via the tensor product. Their outputs are synchronised by a barrier into a pair and
presented to a judge. The judge decides: has the debate converged? If so, a verdict goes to the output. If not, a new topic goes back to the debaters, and a temperature goes to their control inputs.

The intuition is that the debaters should start creative (high temperature, wide sampling) and become progressively more focused as the rounds continue. The judge controls this. Each round, the
judge decides both whether to continue and how volatile the next round should be. If the debate appears to be converging, the judge lowers the temperature, preventing the system from wandering
off in new directions. Whether this actually causes convergence is a research question, not a proven result.

type Verdict = { resolved: bool, verdict: string,
                 topic: string, heat: number }
type Control = { set_temp: number }

let advocate : (!string, !Control) -> !string = agent { ... }
let skeptic  : (!string, !Control) -> !string = agent { ... }
let judge    : !(string, string) -> !Verdict  = agent { ... }

let cool : !Verdict -> !Control = map({set_temp: heat})

let main : !string -> !string = plumb(input, output) {
  input ; (advocate ⊗ skeptic) ; barrier ; judge
  judge ; filter(resolved = false).topic ; (advocate ⊗ skeptic)
  judge ; filter(resolved = true).verdict ; output
  judge ; cool ; (advocate@ctrl_in ⊗ skeptic@ctrl_in)
}

And here is the graphical representation:

Diagram of the heated debate example. Two agent boxes (advocate and skeptic) are placed in parallel via tensor product, both receiving the same input proposition. Their outputs feed into a barrier which synchronises them into a pair. The pair goes to a judge agent. The judge has two outputs: a verdict (going to the main output) and a feedback loop. The feedback loop carries both a new topic (routed back to the debaters' inputs) and a temperature setting (routed to both debaters' control input ports via ctrl_in). The diagram shows parallel composition, barrier synchronisation, and a control feedback loop in one system.

The operator is the tensor product: parallel composition. (The grammar also accepts * for editors that cannot input unicode.) The advocate and skeptic run simultaneously on the same input. The barrier synchronises their outputs into a pair for the judge. The last line is the control feedback: the judge’s verdict is mapped to a temperature setting and sent to both agents’ control inputs. Notice that advocate@ctrl_in addresses a specific port on the agent, the control port rather than the main input.

This is a small program. It is also a concurrent system with feedback loops, runtime parameter modulation, and a convergence protocol. Without types, getting the wiring right would be a matter of testing and hope. With types, it is checked before it runs.

What this shows

In a few lines of code, with a language that has categorical foundations, we can capture interesting agent systems and be sure they are well-formed before they run.

The upshot: when we have guarantees about well-formedness, systems work more stably and more predictably. With static typing, entire classes of structural errors are impossible. You cannot wire an output of one type to an input of another. You cannot forget a connection. The job you pay for is more likely to actually work, and you get more useful work per dollar spent. Runtime budget controls can put a ceiling on cost, but they do not prevent the waste. Static typing prevents the waste. But there is a lot more to do. What we have so far is already useful as a language for constructing agent graphs with static type checking. But we have given short shrift to the complexity and internal state of the agent morphism, which is really all about memory architecture and context management. That is where the real power comes from. For that we need more than a copy-discard category with some extra structure. We need protocols—and that is the subject of the sequel, soon to appear here.

The plumbing compiler, runtime, and MCP server are available as binary downloads for macOS and Linux:

Download plumbing version 0.

Here is the research paper describing the broader programme of work:

• William Waites, Artificial organisations (arXiv:2602.13275).

n-Category Café A Typed Language for Agent Coordination

guest post by William Waites

How category theory can be used to help coordinate collections of interacting large language models.

Agent frameworks are popular. (These are frameworks for coordinating large language model agents, not to be confused with agent-based modelling in the simulation sense.) There are dozens of them for wrapping large language models in something called an agent and assembling groups of agents into workflows. Much of the surrounding discussion is marketing, but the underlying intuition is old: your web browser identifies itself as a user agent. What is new is the capability that generative language models bring.

The moment you have one agent, you can have more than one. That much is obvious. How to coordinate them is not. The existing frameworks (n8n, LangGraph, CrewAI, and others) are engineering solutions, largely ad hoc. Some, like LangGraph, involve real thinking about state machines and concurrency. But none draws on what we know from mathematics and computer science about typed composition, protocol specification, or structural guarantees for concurrent systems.

This matters because it is expensive. Multi-agent systems are complicated concurrent programs. Without structural guardrails, they fail in ways you discover only after spending the compute. A job can go off the rails, and the money you paid for it is wasted; the providers will happily take it regardless. At current subscription rates the cost is hidden, but a recent Forbes investigation found that a heavy user of Anthropic’s $200/month Claude Code subscription can consume up to $5,000/month measured at retail API rates. For third-party tools like Cursor, which pay close to those retail rates, these costs are real. Wasted tokens are wasted money.

To address this, we built a language called plumbing. It describes how agents connect and communicate, in such a way that the resulting graph can be checked before execution: checked for well-formedness, and within limits for deadlocks and similar properties. It is a statically typed language, and these checks are done formally. There is a compiler and a runtime for this language, working code, not a paper architecture. In a few lines of plumbing, you can describe agent systems with feedback loops, runtime parameter modulation, and convergence protocols, and be sure they are well-formed before they run. This post explains how it works.

The name has a history in computing. Engineers have always talked informally about plumbing to connect things together: bits of software, bits of network infrastructure. When I was a network engineer I sometimes described myself as a glorified plumber. The old Solaris ifconfig command took plumb as an argument, to wire a network interface into the stack. Plan 9 had a deeper version of the same idea. The cultural connection goes back decades.

This is the first of two posts. This one introduces the plumbing calculus: what it is, how it works, and a few simple examples. Motifs for adversarial review, ensemble reasoning, and synthesis. The second post will tackle something harder.

The calculus

The plumbing language is built on a symmetric monoidal category, specifically a copy-discard category with some extra structure. The terminology may be unfamiliar, but the underlying concept is not. Engineers famously like Lego. Lego bricks have studs on top and holes with flanged tubes underneath. The studs of one brick fit into the tubes of another. But Lego has more than one connection type: there are also holes through the sides of Technic bricks, and axles that fit through them, and articulated ball joints for the fancier kits. Each connection type constrains what can attach to what. This is typing.

In plumbing, the objects of the category are typed channels: streams that carry a potentially infinite sequence of values, each of a specific type (integer, string, a record type, or something more complex). We write !A to mean "a stream of As", so !string is a stream of strings and !int is a stream of integers. The morphisms, which describe how you connect channels together, are processes. A process has typed inputs and typed outputs.

There are four structural morphisms. Copy takes a stream and duplicates it: the same values appear on two output streams. Discard throws values away, perhaps the simplest thing you can do with a stream, and often needed. These two, together with the typed channels and the laws of the category, give us a copy-discard category.

To this we add two more. Merge takes two streams of the same type and interleaves them onto a single output stream. This is needed because a language model’s input is a single stream. There is nothing to be done about that. If you want to send two different things into it, you must send one and then the other. One might initially give merge the type !A ⊗ !B → !(A + B), taking two streams of different types and producing their coproduct. This works, but it is unnecessarily asymmetrical.

As Tobias Fritz has observed, it is cleaner to do the coproduct injection first, converting each stream to the coproduct type separately, and then merge streams that already have the same type. This gives:

merge : !A ⊗ !A → !(A + A)

Barrier takes two streams, which may be of different types, and synchronises them. Values arrive unsynchronised; the barrier waits for one value from each stream and produces a pair.

barrier : !A ⊗ !B → !(A, B)

(A mathematician would write A ×\times B for the product. We cannot easily do this in a computer language because there is no ×\times symbol on most keyboards, so we use (A, B) for the product, following Haskell’s convention.)

This is a synchronisation primitive. It is important because it unlocks session types, which we will demonstrate in the second post.

Two further morphisms are added to the category (they are not derivable from the structural ones, but are needed to build useful things): map, which applies a pure function to each value in a stream, and filter, which removes values that do not satisfy a predicate. Both are pure functions over streams. Both will be familiar from functional programming.

Here is a graphical representation of the morphisms. We can glue them together freely, as long as the types and the directions of the arrows match up.

Diagram showing all six morphisms as boxes with typed input and output wires. Top row: copy Δ (one input, two outputs of the same type), merge ∇ (two inputs of copyable type, one output of sum type), discard ◇ (one input, no output). Bottom row: barrier ⋈ (two inputs, one paired output, synchronises two streams), map f (one input, one output, applies a function), filter p (one input, one output, removes values failing a predicate). Each morphism shows its type signature using the !A notation for copyable streams.

There are two forms of composition. Sequential composition connects morphisms nose to tail, the output of one feeding the input of the next. Parallel composition places them side by side, denoted by ⊗ (the tensor product, written directly in plumbing source code). So: four structural morphisms, two utilities, two compositional forms, all operating on typed channels.

Because the channels are typed, the compiler can check statically, at compile time, that every composition is well-formed: that outputs match inputs at every boundary. This gives a guarantee that the assembled graph makes sense.

Two diagrams side by side. Left: sequential composition, showing two morphisms connected end-to-end, the output wire of the first feeding into the input wire of the second, forming a pipeline. Right: parallel composition (tensor product), showing two morphisms stacked vertically with no connection between them, running simultaneously on independent streams. Both forms produce a composite morphism whose type is derived from the types of the components.

A composition of morphisms is itself a morphism. This follows from the category laws (it has to, or it is not a category) but the practical consequence is worth stating explicitly. We can assemble a subgraph of agents and structural morphisms, and then forget the internal detail and use the entire thing as a single morphism in a larger graph. This gives modularity. We can study, test, and refine a building block in isolation, and once satisfied, use it as a component of something bigger.

What we have described so far is the static form of the language: concise, point-free (composing operations without naming intermediate values), all about compositions. This is what you write. It is not what the runtime executes. A compiler takes this static form and produces the underlying wiring diagram, expanding the compositions into explicit connections between ports. The relationship is similar to point-free style in functional programming: the concise form is good for thinking and writing; the expanded form is good for execution.

Agents

An agent is a special kind of morphism. It takes typed input and produces typed output, like any other morphism, and we can enforce these types. This much is a well-known technique; PydanticAI and the Vercel AI SDK do it. Agents implement typing at the language model level by producing and consuming JSON, and we can check that the JSON has the right form. This is the basis of the type checking.

Unlike the structural morphisms and utilities, an agent is stateful. It has a conversation history, a context window that fills up, parameters that change. You cannot sensibly model an agent as a pure function. You could model it using the state monad or lenses, and that would be formally correct, but it is the wrong level of abstraction for engineering. Instead, we allow ourselves to think of agents as opaque processes with a typed protocol for interacting with them. We mutate their state through that protocol, and we know how to do that purely from functional programming and category theory. The protocol is the right abstraction; the state management is an implementation detail behind it. How this works in practice, and what happens when it goes wrong, is the subject of the second post.

In addition to their main input and output ports, agents in plumbing have control ports (control in and control out) for configuring the agent at runtime. For example, the temperature parameter governs how creative a language model is: how wide its sampling distribution when choosing output. At zero it is close to deterministic; at one it becomes much less predictable. A control message might say set temperature to 0.3; the response on the control out wire might be acknowledged. The control port carries a typed stream like anything else.

Agents also have ports for operator-in-the-loop (often called human-in-the-loop, though there is no reason an operator must be human), tool calls, and telemetry. The telemetry port emits usage statistics and, if the underlying model supports it, thinking traces. We will not detail these here. Suffice it to say that an agent has several pairs of ports beyond what you might imagine as its regular chat input and output.

Diagram of a generic agent morphism showing all port pairs. The agent is a central box. On the left: input (main data stream), ctrl_in (control commands), tool_in (tool call responses), oitl_in (operator-in-the-loop responses). On the right: output (main data stream), ctrl_out (control responses), tool_out (tool call requests), oitl_out (operator-in-the-loop requests), telemetry (usage and diagnostic data). Each port pair carries a typed stream. Most programs use only a few of these ports; unused ports are elided via the don't-care-don't-write convention.

An agent has many ports, but most programs use only a few of them. We adopt a convention from the κ calculus: don’t care, don’t write. Any output port that is not mentioned in the program is implicitly connected to discard. If a port’s output cannot matter, there is no reason to write it down.

Example: adversarial document composition

Suppose the problem is to write a cover letter for a job application. You provide some background material (a CV, some notes, some publications) and a job advert. You want a network of agents to produce a good cover letter. A good cover letter has two constraints: it must be accurate, grounded in the source materials, not making things up; and it must be compelling, so that the reader wants to give you an interview.

These two constraints are in tension, and they are best served by different agents with different roles. A composer drafts from the source materials. A checker verifies the draft against those materials for accuracy, producing a verdict: pass or fail, with commentary. A critic, who deliberately cannot see the source materials, evaluates whether the result is compelling on its own terms, producing a score.

The feedback loops close the graph. If the checker rejects the draft, its commentary goes back to the composer. If the critic scores below threshold, its review goes back to the composer. Only when the critic is satisfied does the final draft emerge.

Here is the plumbing code:

type Verdict = { verdict: bool, commentary: string, draft: string }
type Review  = { score: int, review: string, draft: string }
 
let composer : !string -> !string = agent { ... }
let checker  : !string -> !Verdict = agent { ... }
let critic   : !Verdict -> !Review = agent { ... }
 
let main : !string -> !string = plumb(input, output) {
  input   ; composer ; checker
  checker ; filter(verdict = false)
          ; map({verdict, commentary}) ; composer
  checker ; filter(verdict = true) ; critic
  critic  ; filter(score < 85)
          ; map({score, review}) ; composer
  critic  ; filter(score >= 85).draft ; output
}

And here is a graphical representation of what’s going on:

Vertical diagram of the adversarial document composition pipeline. Flow runs top to bottom. Input feeds into a composer agent. The composer's output goes to a checker agent. The checker splits two ways via filter: if verdict is false, the verdict and commentary are mapped back to the composer as feedback (loop). If verdict is true, the draft goes to a critic agent. The critic also splits two ways: if score is below 85, the score and review are mapped back to the composer for revision (second loop). If score is 85 or above, the draft is extracted via map and sent to the output. Two feedback loops, two quality gates, one output.

The agent configuration is elided. The main pipeline takes a string input and produces a string output. It is itself a morphism, and could be used as a component in something larger.

Notice what the wiring enforces. The critic receives verdicts, not the original source materials. The information partition is a consequence of the types, not an instruction in a prompt. The feedback loops are explicit: a failed verdict routes back to the composer with commentary; a low score routes back with the review. All of this is checked at compile time.

Example: heated debate

The previous example shows sequential composition and feedback loops but not parallel composition. An ensemble of agents running simultaneously on the same input needs the tensor product.

Ensembles are common. Claude Code spawns sub-agents in parallel to investigate or review, then gathers the results. This is a scatter-gather pattern familiar from high-performance computing. But this example, due to Vincent Danos, adds something less common: modulation of agent behaviour through the control port.

The input is a proposition. Two agents debate it, one advocating and one sceptical, running in parallel via the tensor product. Their outputs are synchronised by a barrier into a pair and presented to a judge. The judge decides: has the debate converged? If so, a verdict goes to the output. If not, a new topic goes back to the debaters, and a temperature goes to their control inputs.

The intuition is that the debaters should start creative (high temperature, wide sampling) and become progressively more focused as the rounds continue. The judge controls this. Each round, the judge decides both whether to continue and how volatile the next round should be. If the debate appears to be converging, the judge lowers the temperature, preventing the system from wandering off in new directions. Whether this actually causes convergence is a research question, not a proven result.

type Verdict = { resolved: bool, verdict: string,
                 topic: string, heat: number }
type Control = { set_temp: number }
 
let advocate : (!string, !Control) -> !string = agent { ... }
let skeptic  : (!string, !Control) -> !string = agent { ... }
let judge    : !(string, string) -> !Verdict  = agent { ... }
 
let cool : !Verdict -> !Control = map({set_temp: heat})
 
let main : !string -> !string = plumb(input, output) {
  input ; (advocate ⊗ skeptic) ; barrier ; judge
  judge ; filter(resolved = false).topic ; (advocate ⊗ skeptic)
  judge ; filter(resolved = true).verdict ; output
  judge ; cool ; (advocate@ctrl_in ⊗ skeptic@ctrl_in)
}

And here is the graphical representation:

Diagram of the heated debate example. Two agent boxes (advocate and skeptic) are placed in parallel via tensor product, both receiving the same input proposition. Their outputs feed into a barrier which synchronises them into a pair. The pair goes to a judge agent. The judge has two outputs: a verdict (going to the main output) and a feedback loop. The feedback loop carries both a new topic (routed back to the debaters' inputs) and a temperature setting (routed to both debaters' control input ports via ctrl_in). The diagram shows parallel composition, barrier synchronisation, and a control feedback loop in one system.

The operator is the tensor product: parallel composition. (The grammar also accepts * for editors that cannot input unicode.) The advocate and skeptic run simultaneously on the same input. The barrier synchronises their outputs into a pair for the judge. The last line is the control feedback: the judge’s verdict is mapped to a temperature setting and sent to both agents’ control inputs. Notice that advocate@ctrl_in addresses a specific port on the agent, the control port rather than the main input.

This is a small program. It is also a concurrent system with feedback loops, runtime parameter modulation, and a convergence protocol. Without types, getting the wiring right would be a matter of testing and hope. With types, it is checked before it runs.

What this shows

In a few lines of code, with a language that has categorical foundations, we can capture interesting agent systems and be sure they are well-formed before they run.

The upshot: when we have guarantees about well-formedness, systems work more stably and more predictably. With static typing, entire classes of structural errors are impossible. You cannot wire an output of one type to an input of another. You cannot forget a connection. The job you pay for is more likely to actually work, and you get more useful work per dollar spent. Runtime budget controls can put a ceiling on cost, but they do not prevent the waste. Static typing prevents the waste. But there is a lot more to do. What we have so far is already useful as a language for constructing agent graphs with static type checking. But we have given short shrift to the complexity and internal state of the agent morphism, which is really all about memory architecture and context management. That is where the real power comes from. For that we need more than a copy-discard category with some extra structure. We need protocols—and that is the subject of the sequel, soon to appear here.

The plumbing compiler, runtime, and MCP server are available as binary downloads for macOS and Linux:

Download plumbing version 0.

Here is the research paper describing the broader programme of work:

• William Waites, Artificial organisations (arXiv:2602.13275).

March 11, 2026

John PreskillNicole’s guide to navigating faculty-position offers

It’s happening. 

Your inbox registers an email from the chair of a faculty-hiring committee. With trembling fingers, you click on the message. “We were very impressed…we’re delighted to offer…” Months of labor, soul-searching, strain, and anxiety give way to jubilation. You hug your partner/roommate/mom/dog; throw an impromptu dance party; and forward the email, prefaced with five exclamation points, to your mentor.1 

As your heart rate returns to a level less likely to alarm a cardiologist, a new source of uncertainty puckers your brow. You’ve received an offer of a faculty position. What happens now? How should you proceed?

This article will address those questions. It follows my guide to faculty interviews, which follows my guide to writing research statements. Like the former guide, this one pertains most to theoretical physicists seeking assistant professorships at R1-level North American universities. Yet all the advice pertains to candidates outside this pool.

The institution will bring you (and, if relevant, your partner) over for a visit. Yes, you visited to interview; but you’re now visiting for another purpose. Assess whether you and your family could flourish if you accepted the offer. Which neighborhoods might you like to live in? Could you tolerate the commute to campus? Vide infra for more questions to keep in mind.

Politely notify the other hiring committees that interviewed you and that are still considering your application. You’ll do the other committees a kindness: their chances of hiring you have narrowed. If they wish to lure you, they’ll need to act quickly. The notice may bump you up in their priority lists. Did the first institution request that you decide about its offer by some deadline? If so, notify the other institutions.

Gather all the information you need. The department may offer to put you in touch with faculty members, deans, and more. Request more connections if necessary. Approach each conversation with a list of questions, and take notes. How will the tenure process unfold? How do early-career faculty members characterize their experiences with it? To what extent does the department shield early-career faculty from administrative duties (serving on committees)? How do the institutions’ policies address parental leave and elder care? If you have a child as an assistant professor, will your tenure clock pause for a year (will you be able to build your credentials for an extra year before applying for tenure)? In which neighborhoods should you search for a house?  

List your priorities. Rank them. Measure each offer against each criterion. Here are example priorities that you might wish to include:

  • Salary
  • Startup package
  • Type of environment: Do you want to live in a city, in the suburbs, or in the country? Do you drive, or would you learn to drive?
  • Length of commute
  • Geographical location: Do you prefer to live near family? 
  • Proximity, and means of transportation, to an airport: You might commute to and from that airport many times to participate in conferences, present seminars, etc. How much time and exhaustion would the experience cost?
  • Local school system: If you have or might have children, where would they learn?
  • Partner’s needs: Do you have a partner who would need to find a job near yours?
  • Proximity of faculty with whom you could collaborate
  • Courtesy positions in other departments: Suppose you’re a physicist who studies quantum computation. You might want to recruit students from the computer-science or math department occasionally. Could you? Would you need a courtesy position in the other department? A courtesy appointment offers you limited privileges at the cost of limited responsibilities: you probably won’t be able to vote in the other department’s faculty meetings. On the other hand, you probably won’t need to spend time on those faculty meetings.
  • Academic quality of undergraduate/graduate population
  • Presence of an institute/center dedicated to your specialization
  • Lab space: location, size, quality, renovations available, how soon and quickly the university would undertake those renovations
  • Help with finding housing: Some universities have apartments that new faculty can rent for a year or two. Other universities offer real-estate-agent services or help faculty obtain mortgages (*cough* San Francisco Bay area *cough*). 
  • Administrative assistance for you and your research group
  • Protection from onerous service to the department until you reach tenure
  • Teaching relief granted en route to tenure: At some universities, a new faculty member can avoid teaching their usual course load during one or two semesters. Such relief frees you to buff up your research program while pursuing tenure.
  • Deferral: Deferring an offer, you postpone the time at which you take up the new mantle. When I accepted a permanent position, I was completing year two of a three-year postdoctoral fellowship. I wanted to complete the final year before assuming my new role: I was still finishing projects with the community to which I belonged, and I wanted to continue deepening my ties with that community. Also, I enjoyed undertaking research without the distraction of a primary investigator’s administrative responsibilities. Other people defer their start dates for other reasons. For example, a partner might need time to fulfill a contract where they live and work. In my experience, people tend to defer PI positions for approximately twelve months, give or take six months. Some institutions don’t offer deferrals, though.

Identify everything you’ll need in a startup package. A startup package helps cover your research program’s costs until you’ve won your first grants. Multiple organizations within a university might contribute to a startup package—for example, a department and an institute that cuts across departments. The hiring committee might propose a startup package to you, or the committee might ask you what you need. Either way, you can (and should) negotiate the package. 

View the negotiation in terms of the question “What do I need to succeed?” List every item, and estimate its cost. Don’t skimp on rigor: estimate prices to the single-dollar level of precision. Such precision helps demonstrate the thoroughness of the research behind your list—helps demonstrate that you need every dollar you seek. 

Request more funding than you believe you’ll need, because you will need more than you believe. Build the breathing room into your estimates. For example, assume that your academic visitors will fly from across the country or across the world. Assume that you’ll fly such distances to present talks. Estimating how much you’ll pay a student or postdoc throughout the next few years? Don’t forget that the university might raise salaries and benefits under a cost-of-living adjustment (COLA) every year. COLAs fluctuate across years, so assume you’ll face steep ones. 

Here are examples of items that a startup package can include:

  • Summer salary: Your institution won’t pay your salary during the summer; you’ll need to fund yourself through grants. A startup package can cover the initial summers.
  • Lab equipment
  • Computers and tablets for you and your group members: Don’t forget protective cases, AppleCare or a non-Apple equivalent, implements for writing on tablets, external mice, and external monitors. Check whether your department or institute has spare mice or monitors that you can requisition.
  • Other computational resources: Does the department have a computational cluster that you intend to use? Do you need to access a national lab’s supercomputer or a quantum computer available on the cloud? How much will you pay per unit time and memory?
  • Postdoc costs: These costs include a salary, benefits, and the cost of moving to your institution. The salary will increase from year to year if the institution implements a COLA. The benefits include healthcare, dental care, and the like. Administrators might call benefits “fringe,” as I discovered after considerable confusion. 
  • Graduate-student costs: These costs include a research assistantship, benefits, and possibly tuition. The salary might increase as a student progresses through the stages of their PhD, particularly once they achieve candidacy. Their need for tuition might change, too. Check whether domestic students cost more than international students, and budget for international students.
  • Undergraduate researchers: Do you plan to employ an undergraduate during the summer? Throughout the academic year?2 
  • Travel for yourself: Budget several trips per year for yourself. You’ll need to spread the word about your research and to grow your network en route to tenure.
  • Travel for your postdocs and students: A mentor shared that she covers one conference per year per group member. You might want to budget also for a seminar or two per group member per year.
  • Visitors: Visitors can boost your research program. Budget for week-long visits if your institution can accommodate them.

Negotiate. Even if your dream school has offered you your dream job. Even if you receive only one offer. You might still garner resources that can help your research program and family to thrive. Don’t feel shy, sheepish, or ashamed to negotiate. If you remain polite and considerate, you won’t offend anyone. Besides, the hiring committee, department chair, and dean expect you to negotiate. The department chair might even hope that you do so; vide infra. 

When I was a PhD student, Caltech offered a workshop about negotiation to women grad students. The workshop helped participants build skills, knowledge, and self-assurance that would benefit us when we negotiated contracts. I recommend attending a workshop, taking a course, reading a book, or watching videos about negotiation. Contact your institution’s professional-development office about opportunities and suggestions. If you’re reading this blog post before applying for jobs—any jobs—start now.

What can you negotiate for? Many of the items on your list of priorities. Certain institutions might lack the freedom to negotiate certain items, though. For example, a union might determine salaries. Don’t let such a discovery discourage you; explore the options thoroughly.

View the department chair as an ally. The department chair negotiates on your behalf with administrators higher up in the university hierarchy, such as deans. The chair aims to garner as many resources as possible for you—and, by extension, for their department. Explain to the department chair (or to the committee chair who might explain to the department chair) what you need and why you need it, to help strengthen their argument.

As soon as you know you’ll decline an offer, decline it politely. Your notification will free the committee to attract another candidate. Imagine you’re Candidate #2 on the priority list. Wouldn’t you want the current offer recipient to decline their offer as soon as their conscience allows? Now, imagine you’re the hiring-committee chair. You’re worried that Candidate #1 will decline—and, by the time they decline, other institutions will have snapped up the other top candidates. As Candidate #1, demonstrate toward the committee chair and toward Candidate #2 the consideration that you’d value if in their shoes.

Savor the moment. You’ve just survived the faculty-application process, one of the most stressful periods of your life. The faculty life is no walk in the park, either. Nor will you necessarily sleep soundly between the receipt of your first offer and your signing of a contract. The prospect of more offers could leave you in limbo. If you receive multiple offers, choosing between them—choosing the course of your and your family’s life—may stress you as much as applying did. So remember to feel grateful for the source of your anxiety. Give yourself credit for your accomplishment. 

Congratulations!

1Please do! They’ll want to celebrate with you.

2I recommend targeting undergrads who’ll work with you for more than a summer. Training an undergrad takes nearly a summer; you and the student will benefit from having time to take advantage of that training.

John BaezStandard Model 5: Spin-1/2 Particles

One of the simplest quantum systems is a spin-1/2 particle, also known as a spinor. If we measure the angular momentum of a spin-1/2 particle along any axis, there are two possible outcomes: either the angular momentum along that axis is +1/2, or it’s -1/2.

How is it possible for this to be true along every axis? Here I explain this, using the basic rules of quantum physics described last time. In particular, I say how any point on a sphere of radius 1/2 gives a quantum state of the spin 1/2 particle—and vice versa!

Using this, we can understand things like the famous Stern–Gerlach experiment, where we measure the angular momentum of a spin-1/2 particle first along one axis, and then along another.

Jordan EllenbergIn which my daughter invents science

While we ate our pizza I was talking with AB about differential equations, which they’re about to start doing in calculus. We talked about y’=y. “This is why populations grow exponentially,” I explained, “because their growth is proportional to the existing population, so they satisfy y’ = cy and we can see that a differential equation like that has an exponential as its solution.”

This was met with skepticism. “Why couldn’t you just say we know populations grow exponentially and that’s how you know they have that differential equation?”

My first reaction was to say, “no, it goes the other way,” but then I realized that what AB had asked me was in fact very deep. This two-way traffic sign she was gesturing at is really at the heart of science! Sometimes you have a mechanism and you use mathematics to work out its consequences, including consequences you can’t observe directly. Sometimes you have observations and no mechanism, and then you cherchez la differential equation. In a world where we had no idea how flies were spawned, you could count the number of flies over time and reason that maybe, just maybe, y’ = cy was the governing law, and so somehow each existing fly was emitting new flies at a constant rate.

Once I realized this, I launched into a whole spiel about Newton and his inference of the gravitational differential equation from planetary motion, probably getting most of it slightly wrong, or maybe even totally wrong, but that is a dad’s prerogative sometimes.

March 10, 2026

Scott Aaronson Remarks at UT on the Pentagon/Anthropic situation

Last Thursday, my friend and colleague Sam Baker, in UT Austin’s English department, convened an “emergency panel” here about the developing Pentagon/Anthropic situation, and asked me to speak at it. Even though the situation has continued to develop since then, I thought my prepared remarks for the panel might be of interest. At the bottom, I include a few additional thoughts.


Hi! I’m Scott Aaronson! I teach CS here at UT. While my background is in quantum computing, I’ve spent the past four years dabbling in AI alignment. I did a two-year leave at OpenAI, in their now-defunct Superalignment team. I joined back when OpenAI’s line was “we’re a little nonprofit, doing all this in the greater interest of humanity, and we’d dissolve ourselves before we raced to build an AI that we thought would be dangerous.” I know Sam Altman, and many other current and former OpenAI people. I also know Dario Amodei—in fact, I knew Dario well before Anthropic existed. Despite that, I don’t actually feel like I have deep insight into the current situation with Anthropic and the Pentagon that you wouldn’t get by reading the news, or (especially) reading commentators like Zvi Mowshowitz, Kelsey Piper, Scott Alexander, and Dean Ball. But since I was asked to comment, I’ll try.

The first point I’ll make: the administration’s line, to the extent they’ve had a consistent line, is basically that they needed to cut off Anthropic because Anthropic is a bunch of woke, America-hating, leftist radicals. I think that, if you actually know the Anthropic people, that characterization is pretty laughable. Unless by “woke,” what the administration meant was “having any principles at all, beyond blind deference to authority, and sticking to them.”

I mean, Anthropic only got into this situation in the first place because it was more eager than the other AI companies to support US national security, by providing a version of Claude that could be used on classified networks. So they signed a contract with the Pentagon, and that contract had certain restrictions in it, which the Pentagon read and agreed to … until they decided that they no longer agreed.

That brings me to my second point. The Pentagon regularly signs contracts with private firms that limit what the Pentagon can do in various ways. That’s why they’re called military contract-ors. So anyone who claims it’s totally unprecedented for Anthropic to try to restrict what the government can do with Anthropic’s private property—I think that person is either misinformed or else trying to misinform.

The third point. If the Pentagon felt that it couldn’t abide a private company telling it what is or isn’t an appropriate military use of current AI, then the Pentagon was totally within its rights to cancel its contract with Anthropic, and find a different contractor (like OpenAI…) that would play ball. So it’s crucial for everyone here to understand that that’s not all that the Pentagon did. Instead they said: because Anthropic dared to stand up to us, we’re going to designate them a Supply Chain Risk—a designation that was previously reserved for foreign nation-state adversaries, and that, incredibly, hasn’t been applied to DeepSeek or other Chinese AI companies that arguably do present such risks. So basically, they threatened to destroy Anthropic, by making it horrendously complicated for any companies that do business with the government—i.e., just about all companies—also to do business with Anthropic.

Either that, the Pentagon threatened, or we’ll invoke the Defense Production Act to effectively nationalize Anthropic—i.e., we’ll just commandeer their intellectual property, use it for whatever we want despite Anthropic’s refusal. You get that? Claude is both a supply chain risk that’s too dangerous for the military to use, and somehow also so crucial to the supply chain that we, the military, need to commandeer it.

To me, this is the authoritarian part of what the Pentagon is doing (with the inconsistency being part of the authoritarianism; who but a dictator gets to impose his will on two directly contradictory grounds?). It’s the part that goes against the free-market principles that our whole economy is built on, and the freedom of speech and conscience that our whole civilization is built on. And I think this will ultimately damage US national security, by preventing other American AI companies from wanting to work on defense going forward.

That brings me to the fourth point, about OpenAI. While this was going down, Sam Altman posted online that he agreed with Anthropic’s red lines: LLMs should not be used for killing people with no human in the kill chain, and they also shouldn’t be used for mass surveillance of US citizens. I thought, that’s great! The frontier AI labs are sticking together when the chips are down, rather than infighting.

But then, just a few hours after the Pentagon designated Anthropic a supply chain risk, OpenAI announced that it had reached a deal with the Pentagon. Huh?!? If they have the same red lines, then why can one of them reach a deal while the other can’t?

The experts’ best guess seems to be this: Anthropic said, yes, using AI to kill people autonomously or to surveil US citizens should already be illegal, but we insist on putting those things in the contract to be extra-double-sure. Whereas OpenAI said, the Pentagon can use our models for “all lawful purposes”—this was the language that the Pentagon had insisted on. And, continued OpenAI, we interpret “all lawful purposes” to mean that they can’t cross these red lines. But if it turns out we’re wrong about that … well, that’s not our problem! That’s between the Pentagon and the courts, or whatever.

Again, we don’t fully know, because most of the relevant contracts haven’t been made public, but that’s an inference from reading between the lines of what has been made public.

Back in 2023-2024, when there was the Battle of the Board, then the battle over changing OpenAI’s governance structure, etc., some people formed a certain view of Sam, that he would say all the good and prosocial and responsible things even while he did whichever thing maximized revenue. I’ll leave it to you whether last week’s events are consistent with that view.

OK, fifth and final point. I remember 15-20 years ago, talking to Eliezer Yudkowsky and others terrified about AI. They said, this is the biggest issue facing the world. It’s not safe for anyone to build because it could turn against us, or even before that, the military could commandeer it or whatever. And I and others were like, dude, you guys obviously read too much science fiction!

And now here we are. Not only are we living in a science-fiction story, I’d say we’re living in a particularly hackneyed one. I mean, the military brass marching into a top AI lab and telling the nerds, “tough luck, we own your AI now”? Couldn’t reality have been a little more creative than that?

The point is, given the developments of the past couple weeks, I think we now need to retire forever the argument against future AI scenarios that goes, “sorry, that sounds too much like a science-fiction plot.” As has been said, you’d best get used to science fiction because you’re living in one!


Updates and Further Thoughts: Of course I’ve seen that Anthropic has now filed a lawsuit to block the Pentagon from designating it a supply chain risk, arguing that both its free speech and due process rights were violated. I hope their lawsuit succeeds; it’s hard for me to imagine how it wouldn’t.

The fact that I’m, obviously, on Anthropic’s side of this particular dispute doesn’t mean that I’ll always be on Anthropic’s side. Here as elsewhere, it’s crucial not to outsource your conscience to anyone.

Zvi makes an extremely pertinent comparison:

[In shutting down Starlink over Ukraine,] Elon Musk actively did the exact thing [the Pentagon is] accusing Anthropic of maybe doing. He made a strategic decision of national security at the highest level as a private citizen, in the middle of an active military operation in an existential defensive shooting war, based on his own read of the situation. Like, seriously, what the actual fuck.

Eventually we bought those services in a contract. We didn’t seize them. We didn’t arrest Musk. Because a contract is a contract is a contract, and your private property is your private property, until Musk decides yours don’t count.

Another key quote in Zvi’s piece, from Gregory Allen:

And here’s the thing. I spent so much of my life in the Department of Defense trying to convince Silicon Valley companies, “Hey, come on in, the water is fine, the defense contracting market, you know, you can have a good life here, just dip your toe in the water”.

And what the Department of Defense has just said is, “Any company that dips their toe in the water, we reserve the right to grab their ankle, pull them all the way in at any time”. And that is such a disincentive to even getting started in working with the DoD.

Lastly, I’d like to address the most common counterargument against Anthropic’s position—as expressed for example by Noah Smith, or in the comments of my previous post on this. The argument goes roughly like so:

You, nerds, are the ones who’ve been screaming for years about AI being potentially existentially dangerous! So then, did you seriously expect to stay in control of the technology? If it’s really as dangerous and important as you say, then of course the military was going to step in at some point and commandeer your new toy, just like it would if you were building a nuclear weapon.

Two immediate responses:

  1. Even in WWII, in one of the most desperate circumstances in human history, the US government didn’t force a single scientist at gunpoint to build nuclear weapons for them. The scientists did so voluntarily, based on their own considered moral judgment at the time (even if some later came to regret their involvement).
  2. Even if I considered it “inevitable” that relatively thoughtful and principled people, like Dario Amodei, would lose control over the future to gleeful barbarians like Pete Hegseth, it still wouldn’t mean I couldn’t complain when it happened. This is still a free country, isn’t it?

March 09, 2026

Jordan EllenbergThe Irrational Decision

Ben Recht‘s book The Irrational Decision: How We Gave Computers the Power to Choose for Us comes out tomorrow! I was privileged to get my copy early. And if you want an opinionated, informed, no-bullshit take on what optimization actually does and what purpose (and whose purposes) it serves, written by somebody who knows this subject inside and out, this, my readers, is your March reading. Here’s the Amazon purchase page if that’s how you like to get books. And if you’re not already reading Ben’s arg min blog, get thee thither! It’s where the all the saltiest action in inference, statistics, and machine learning is going on.

Jordan EllenbergIn which I flex music knowledge for my daughter

A nice but unfamiliar power-pop number came on in the pizzeria where we were having lunch. “I like this,” I said. “It sounds it’s from around 2005, kind of a Fountains of Wayne sound.”

Indeed, this song came out in 2005, and it was written by Adam Schlesinger (RIP) of Fountains of Wayne. When I’m good, I’m good!

March 08, 2026

Scott Aaronson The ”JVG algorithm” is crap

Sorry to interrupt your regular programming about the AI apocalypse, etc., and return to the traditional beat of this blog’s very earliest years … but I’ve now gotten multiple messages asking me to comment on something called the “JVG (Jesse–Victor–Gharabaghi) algorithm” (yes, the authors named it after themselves). This is presented as a massive improvement over Shor’s factoring algorithm, which could (according to popular articles) allow RSA-2048 to be broken using only 5,000 physical qubits.

On inspection, the paper’s big new idea is that, in the key step of Shor’s algorithm where you compute xr mod N in a superposition over all r’s, you instead precompute the xr mod N’s on a classical computer and then load them all into the quantum state.

Alright kids, why does this not work? Shall we call on someone in the back of the class—like, any undergrad quantum computing class in the world? Yes class, that’s right! There are exponentially many r’s. Computing them all takes exponential time, and loading them into the quantum computer also takes exponential time. We’re out of the n2-time frying pan but into the 2n-time fire. This can only look like it wins on tiny numbers; on large numbers it’s hopeless.

If you want to see people explaining the same point more politely and at greater length, try this from Hacker News or this from Postquantum.com.

Even for those who know nothing about quantum algorithms, is there anything that could’ve raised suspicion here?

  1. The paper didn’t appear on the arXiv, but someplace called “Preprints.org.” Come to think of it, I should add this to my famous Ten Signs a Claimed Mathematical Breakthrough is Wrong! It’s not that there isn’t tons of crap on the arXiv as well, but so far I’ve seen pretty much only crap on preprint repositories other than arXiv, ECCC, and IACR.
  2. Judging from a Google search, the claim seems to have gotten endlessly amplified on clickbait link-farming news sites, but ignored by reputable science news outlets—yes, even the usual quantum hypesters weren’t touching this one!

Often, when something is this bad, the merciful answer is to let it die in obscurity. In this case, I feel like there was a sufficient level of intellectual hooliganism, just total lack of concern for what’s true, that those involved deserve to have this Shtetl-Optimized post as a tiny bit of egg on their faces forever.

March 06, 2026

Matt von HippelPractice, Don’t Memorize, Understand Justifications, Not Stories

Teaching is one of those things that’s always controversial.

There seems to be a constant tug of war between two approaches. In one, thought of as old-fashioned and practical, students are expected to work hard, study to memorize facts and formulas, and end up with an impressive ability to reproduce the knowledge of the past. In the other, presented as more modern or more permissive, students aren’t supposed to memorize, but to understand, to get intuition for how things work, and are expected to end up more creative and analytical, able to come up with new ideas and understand things in ways their predecessors could not. This whole thing then gets muddled further with discussions of which skills actually matter in the modern day, with the technology of the hour standing in. If adults can use calculators, why should students be able to do arithmetic? If adults can use AI, why should students be able to draw, or write, or reason?

I’ve taught a little in my day, though likely less than I should. More frequently, I’ve learned. And, with apologies to the teachers and education experts who read this blog, I’ve got my own opinion.

I don’t think anyone in the old-fashioned/new-fashioned tug of war is thinking about education right.

People talk about memorization, when they should be talking about practice.

We want kids to be able to multiply and divide numbers. That’s not because they won’t have calculators. It’s because we want to teach them things that build on top of multiplying and dividing numbers. We want some of them to learn how to multiply and divide polynomials, and if you don’t know how to multiply and divide numbers, then learning to multiply and divide polynomials is almost impossible. We want some of them to learn abstract generalizations, groups and rings and fields, and if you’re not comfortable with the basics, then learning these is almost impossible. And for everyone, we want them to get used to making a logical argument why something is true, in a context where we can easily judge whether the argument works.

This doesn’t mean that we need students to memorize their times tables, though. It helps, sure. But we don’t actually care whether students can recite 5 times 7 equals 35, that’s not our end goal. Instead, we want to make sure that students can do these operations, and that they find them easy to do. And ultimately, that doesn’t come from memorization, but from practice. It comes from using the ideas, again and again, until it’s obvious how to step ahead to the results. You can’t replicate that with pure understanding, like some more modern approaches try to. You need the “muscle memory”, and that takes real practice. But you also can’t get there by memorizing isolated facts for an exam. You need to use them.

Understanding is important too, though. We need students to know the limits of their knowledge, not just what they’ve been taught but why it’s true. It’s the only way to get adults who can generalize, who can accept that maybe there is a type of math with numbers that square to zero without dismissing it as a plot to corrupt the youth. It’s the only way to get students who can go to the next level, and the next, and then generate new knowledge on their own.

But that understanding often gets left by the wayside, when teachers forget what it’s for. If you try to teach the Pythagorean theorem by showing a few examples, or tell students stories where different types of energy are different “stuff”, you’re trying to convey an intuitive understanding, but not the useful kind. What you’re trying to give the students is stories about how things work. But the kind of understanding we need students to have isn’t of stories. It’s of justifications, and arguments. Students should understand why what they are taught is true, and understanding why doesn’t mean having a feeling in their hearts about it: it means they can convince a skeptic.

It’s easier, for a world full of overworked teachers from a variety of backgrounds, to teach the simpler versions of these. It’s easy for a traditionalist teacher to drill their students on memorization, and test them on memorization. It’s easier for a sympathetic teacher to tell students stories, based on stories the teacher thinks they understand.

But if you want the traditionalist approach to work, you have to actually do things, to practice using ideas rather than merely know them, to have that experience down as reflexively as those times tables. And if you want the modern approach to work, you have to actually understand why what you’re teaching is true, the way you would convince a skeptic that it is true, and then convey those justifications to the students.

And if you, instead, are a student:

Don’t worry about memorizing facts, you’ll drill too hard and stress yourself out. Don’t worry about finding a comfortable story, because no story is true. Use the ideas you’re learning. Use them to convince yourself, and to convince others. Use them again and again, until you reach for them as easily as breathing. When you can use what you’re learning, and know why it holds, then you’re ready to move forward.

March 05, 2026

Scott Aaronson Moar Updatez

To start on a somber note: those of us at UT Austin are in mourning this week for Savitha Shan, an undergrad double major here in economics and information systems, who was murdered over the weekend by an Islamist terrorist who started randomly shooting people on Sixth Street, apparently angry about the war in Iran. Two other innocents were also killed.

As it happens, these murders happened just a few hours after the end of my daughter’s bat mitzvah, and in walking distance from the venue. The bat mitzvah itself was an incredibly joyful and successful event that consumed most of my time lately, and which I might or might not say more about—the nastier the online trolls get, the more I need to think about my family’s privacy.


Of all the many quantum computing podcasts/interviews I’ve done recently, I’m probably happiest with this one, with Yuval Boger of QuEra. It covers all the main points about where the hardware currently is, the threat to public-key cryptography, my decades-long battle against quantum applications hype, etc. etc., and there’s even an AI-created transcript that eliminates my verbal infelicities!


A month ago, I blogged about “The Time I Didn’t Meet Jeffrey Epstein” (basically, because my mom warned me not to). Now the story has been written up in Science magazine, under the clickbaity headline “Meet Three Scientists Who Said No to Epstein.” (Besides yours truly, the other two scientists are friend-of-the-blog Sean Carroll, whose not-meeting-Epstein story I’d already heard directly from him, and David Agus, whose story I hadn’t heard.)

To be clear: as I explained in my post, I never actually said “no” to Epstein. Instead, based on my mom’s advice, I simply failed to follow up with his emissary, to the point where no meeting ever happened.

Anyway, ever since Science ran this story and it started making the rounds on social media, my mom has been getting congratulatory messages from friends of hers who saw it!


I’ve been a huge fan of the philosopher-novelist Rebecca Newberger Goldstein ever since I read her celebrated debut work, The Mind-Body Problem, back in 2005. Getting to know Rebecca and her husband, Steven Pinker, was a highlight of my last years at MIT. So I’m thrilled that Rebecca will be visiting UT Austin next week to give a talk on Spinoza, related to her latest book The Mattering Instinct (which I’m reading right now), and hosted by me and my colleague Galen Strawson in UT’s philosophy department. More info is in the poster below. If you’re in Austin, I hope to see you there!


The 88-year-old Donald Knuth has published a 5-page document about how Claude was able to solve a tricky graph theory problem that arose while he was working on the latest volume of The Art of Computer Programming—a series that Knuth is still writing after half a century. As you’d expect from Knuth, the document is almost entirely about the graph theory problem itself and Claude’s solution to it, eschewing broader questions about the nature of machine intelligence and how LLMs are changing life on Earth. To anyone who’s been following AI-for-math lately, the fact that Claude now can help with this sort of problem won’t come as a great shock. The virality is presumably because Knuth is such a legend that to watch him interact productively with an LLM is sort of like watching Leibniz, Babbage, or Turing do the same.


John Baez is a brilliant mathematical physicist and writer, who was blogging about science before the concept of “blogging” even existed, and from whom I’ve learned an enormous amount. But regarding John’s quest for the past 15 years — namely, to use category theory to help solve the climate crisis (!) — I always felt like the Cookie Monster would, with equal intellectual justification, say that the key to arresting climate change was for him to eat more Oreos. Then I read this Quanta article on the details of Baez’s project, and … uh … I confess it failed to change my view. Maybe someday I’ll understand why it’s better to say using category theory what I would’ve said in a 100x simpler way without category theory, but I fear that day is not today.

Tommaso DorigoA Nice Little Combination

Although I have long retired from serious chess tournaments (they take too much time, a luxury I do not have anymore - even more so now that I have two infants to help grow!), I insist playing online blitz on chess.com, with alternating fortunes. My elo rating hovers in the 2200-2300 range, signalling that I still have my wits around me (I figure it is a very good way to keep a watch on my mental capabilities: if Alzheimer lurks, I will spot it early). 

read more

Jordan EllenbergThe ex-prospect who only hit homers

Vance Honeycutt is 22 years old and already widely considered a bust. Lots of power in college, Orioles’ first-round draft pick in 2024, but has a gigantic hole in his swing and spent 2025 striking out all the time, absurdly much, so much that people figured he was simply not going to learn to hit.

He has come to the plate four times in spring training and has hit four home runs. The last one went 471 feet. The sample size, she is, how we say, small. But I’m glad the guy is getting this moment of glory, at least, even if it ends up being short.

This was one of my favorite books as a kid:

February 28, 2026

John PreskillWhat is next in quantum advantage?

We are now at an exciting point in our process of developing quantum computers and understanding their computational power: It has been demonstrated that quantum computers can outperform classical ones (if you buy my argument from Parts 1 and 2 of this mini series). And it has been demonstrated that quantum fault-tolerance is possible for at least a few logical qubits. Together, these form the elementary building blocks of useful quantum computing.

And yet: the devices we have seen so far are still nowhere near being useful for any advantageous application in, say, condensed-matter physics or quantum chemistry, which is where the promise of quantum computers lies.

So what is next in quantum advantage?

This is what this third and last part of my mini-series on the question “Has quantum advantage been achieved?” is about.

The 100 logical qubits regime
I want to have in mind the regime in which we have 100 well-functioning logical qubits, so 100 qubits on which we can run maybe 100 000 gates.

Building devices operating in this regime will require thousand(s) of physical qubits and is therefore well beyond the proof-of-principle quantum advantage and fault-tolerance experiments that have been done. At the same time, it is (so far) still one or more orders of magnitude away from any of the first applications such as simulating, say, the Fermi-Hubbard model or breaking cryptography. In other words, it is a qualitatively different regime from the early fault-tolerant computations we can do now. And yet, there is not a clear picture for what we can and should do with such devices.

The next milestone: classically verifiable quantum advantage

In this post, I want to argue that a key milestone we should aim for in the 100 logical qubit regime is classically verifiable quantum advantage. Achieving this will not only require the jump in quantum device capabilities but also finding advantage schemes that allow for classical verification using these limited resources.

Why is it an interesting and feasible goal and what is it anyway?

To my mind, the biggest weakness of the RCS experiments is the way they are verified. I discussed this extensively in the last posts—verification uses XEB which can be classically spoofed, and only actually measured in the simulatable regime. Really, in a quantum advantage experiment I would want there to be an efficient procedure that will without any reasonable doubt convince us that a computation must have been performed by a quantum computer when we run it. In what I think of as classically verifiable quantum advantage, a (classical) verifier would come up with challenge circuits which they would then send to a quantum server. These would be designed in such a way that once the server returns classical samples from those circuits, the verifier can convince herself that the server must have run a quantum computation.

The theoretical computer scientist’s cartoon of verifying a quantum computer.

This is the jump from a physics-type experiment (the sense in which advantage has been achieved) to a secure protocol that can be used in settings where I do not want to trust the server and the data it provides me with. Such security may also allow a first application of quantum computers: to generate random numbers whose genuine randomness can be certified—a task that is impossible classically.

Here is the problem: On the one hand, we do know of schemes that allow us to classically verify that a computer is quantum and generate random numbers, so called cryptographic proofs of quantumness (PoQ). A proof of quantumness is a highly reliable scheme in that its security relies on well-established cryptography. Their big drawback is that they require a large number of qubits and operations, comparable to the resources required for factoring. On the other hand, the computations we can run in the advantage regime—basically, random circuits—are very resource-efficient but not verifiable.

The 100-logical-qubit regime lies right in the middle, and it seems more than plausible that classically verifiable advantage is possible in this regime. The theory challenge ahead of us is to find it: a quantum advantage scheme that is very resource-efficient like RCS and also classically verifiable like proofs of quantumness.

To achieve verifiable advantage in the 100-logical-qubit regime we need to close the gap between random circuit sampling and proofs of quantumness.

With this in mind, let me spell out some concrete goals that we can achieve using 100 logical qubits on the road to classically verifiable quantum advantage.

1. Demonstrate fault-tolerant quantum advantage

Before we talk about verifiable advantage, the first experiment I would like to see is one that combines the two big achievements of the past years, and shows that quantum advantage and fault-tolerance can be achieved simultaneously. Such an experiment would be similar in type to the RCS experiments, but run on encoded qubits with gate sets that match that encoding. During the computation, noise would be suppressed by correcting for errors using the code. In doing so, we could reach the near-perfect regime of RCS as opposed to the finite-fidelity regime that current RCS experiments operate in (as I discussed in detail in Part 2).

Random circuits with a quantum advantage that are particularly easy to implement fault-tolerantly are so-called IQP circuits. In those circuits, the gates are controlled-NOT gates and diagonal gates, so rotations Z(θ,z)Z(\theta,z), which just add a phase to a basis state |x\ket x as Z(θ,z)|x=exp(iθzx)|xZ(\theta, z) \ket x = \exp(i \theta \, z \cdot x) \ket x. The only “quantumness” comes from the fact that each input qubit is in the superposition state |+=|0+|1\ket + = \ket 0 + \ket 1, and that all qubits are measured in the XX basis. This is an example of an example of an IQP circuit:

An IQP circuit starts from the all-|0\ket 0 state by applying a Hadamard transform, followed by IQP gates (in this case Z(π/4,1011)Z(\pi/4,1011), some CNOT gates, Z(π/5,1100)Z(\pi/5,1100), some CNOT gates, Z(π/3,1111)Z(\pi/3,1111)) and ends in a measurement in the Hadamard basis.

As it so happens, IQP circuits are already really well understood since one of the first proposals for quantum advantage was based on IQP circuits (VerIQP1), and for a lot of the results in random circuits, we have precursors for IQP circuits, in particular, their ideal and noisy complexity (SimIQP). This is because their near-classical structure makes them relatively easy to study. Most importantly, their outcome probabilities are simple (but exponentially large) sums over phases eiθe^{i \theta } that can just be read off from which gates are applied in the circuit and we can use well-established classical techniques like Boolean analysis and coding theory to understand those.

IQP gates are natural for fault-tolerance because there are codes in which all the operations involved can be implemented transversally. This means that they only require parallel physical single- or two-qubit gates to implement a logical gate rather than complicated fault-tolerant protocols which are required for universal circuits. This is in stark contrast to universal circuit which require resource-intensive fault-tolerant protocols. Running computations with IQP circuits would also be a step towards running real computations in that they can involve structured components such as cascades of CNOT gates and the like. These show up all over fault-tolerant constructions of algorithmic primitives such as arithmetic or phase estimation circuits.

Our concrete proposal for an IQP-based fault-tolerant quantum advantage experiment in reconfigurable-atom arrays is based on interleaving diagonal gates and CNOT gates to achieve super-fast scrambling (ftIQP1). A medium-size version of this protocol was implemented by the Harvard group (LogicalExp) but with only a bit more effort, it could be performed in the advantage regime.

In those proposals, verification will still suffer from the same problems of standard RCS experiments, so what’s up next is to fix that!

2. Closing the verification loophole

I said that a key milestone for the 100-logical-qubit regime is to find schemes that lie in between RCS and proofs of quantumness in terms of their resource requirements but at the same time allow for more efficient and more convincing verification than RCS. Naturally, there are two ways to approach this space—we can make quantum advantage schemes more verifiable, and we can make proofs of quantumness more resource-efficient.

First, let’s focus on the former approach and set a more moderate goal than full-on classical verification of data from an untrusted server. Are there variants of RCS that allow us to efficiently verify that finite-fidelity RCS has been achieved if we trust the experimenter and the data they hand us?

2.1 Efficient quantum verification using random circuits with symmetries

Indeed, there are! I like to think of the schemes that achieve this as random circuits with symmetries. A symmetry is an operator SS such that the outcome state of the computation |C\ket C (or some intermediate state) is invariant under the symmetry, so S|C=|CS \ket C =\ket C. The idea is then to find circuits CC that exhibit a quantum advantage and at the same time have symmetries that can be easily measured, say, using only single-qubit measurements or a single gate layer. Then, we can use these measurements to check whether or not the pre-measurement state respects the symmetries. This is a test for whether the quantum computer prepared the correct state, because errors or deviations from the true state would violate the symmetry (unless they were adversarially engineered).

In random circuits with symmetries, we can thus use small, well-characterized measurements whose outcomes we trust to probe whether a large quantum circuit has been run correctly. This is possible in a scenario I call the trusted experimenter scenario.

The trusted experimenter scenario
In this scenario, we receive data from an actual experiment in which we trust that certain measurements were actually and correctly performed.

I think of random circuits with symmetries as introducing measurements in the circuit that check for errors.

Here are some examples of random circuits with symmetries, which allow for efficient verification of quantum advantage in the trusted experimenter scenario.

Graph states. My first example are locally rotated graph states (GStates). These are states that are prepared by CZ gates acting according to the edges of a graph on an initial all-|+\ket + state, and a layer of single-qubit ZZ-rotations is performed before a measurement in the XX basis. (Yes, this is also an IQP circuit.) The symmetries of this circuit are locally rotated Pauli operators, and can therefore be measured using only single-qubit rotations and measurements. What is more, these symmetries fully determine the graph state. Determining the fidelity then just amounts to averaging the expectation values of the symmetries, which is so efficient you can even do it in your head. In this example, we need measuring the outcome state to obtain hard-to-reproduce samples and measuring the symmetries are done in two different (single-qubit) bases.

With 100 logical qubits, samples from classically intractable graph states on several 100 qubits could be easily generated.

Bell sampling. The drawback of this approach is that we need to make two different measurements for verification and sampling. But it would be much more neat if we could just verify the correctness of a set of classically hard samples by only using those samples. For an example where this is possible, consider two copies of the output state |C\ket C of a random circuit, so |C|C\ket C \otimes \ket C. This state is invariant under a swap of the two copies, and in fact the expectation value of the SWAP operator 𝕊\mathbb S in a noisy state preparation ρC\rho_C of |C|C\ket C \otimes \ket C determines the purity of the state, so P=tr[ρ𝕊]P = \mathrm{tr}[ \rho \mathbb S]. It turns out that measuring all pairs of qubits in the state |C|C\ket C \otimes \ket C in the pairwise basis of the four Bell states |σ=(σ𝟙)(|00+|11)\ket{\sigma} = (\sigma \otimes \mathbb 1) (\ket{00} + \ket{11}), where σ\sigma is one of the four Pauli matrices 𝟙,X,Y,Z\mathbb 1, X, Y, Z, this is hard to simulate classically (BellSamp). You may also observe that the SWAP operator is diagonal in the Bell basis, so its expectation value can be extracted from the Bell-basis measurements—our hard to simulate samples. To do this, we just average sign assignments to the samples according to their parity.

If the circuit is random, then under the same assumptions as those used in XEB for random circuits, the purity is a good estimator of the fidelity, so PF(ρC,|C|C)P \approx F(\rho_C, \ket C \otimes \ket C). So here is an example, where efficient verification is possible directly from hard-to-simulate classical samples under the same assumptions as those used to argue that XEB equals fidelity.

With 100 logical qubits, we can achieve quantum advantage which is at least as hard as the current RCS experiments that can also be efficiently (physics-)verified from the classical data.

Fault-tolerant circuits. Finally, suppose that we run a fault-tolerant quantum advantage experiment. Then, there is a natural set of symmetries of the state at any point in the circuit, namely, the stabilizers of the code we use. In a fault-tolerant experiment we repeatedly measure those stabilizers mid-circuit, so why not use that data to assess the quality of the logical state? Indeed, it turns out that the logical fidelity can be estimated efficiently from stabilizer expectation values even in situations in which the logical circuit has a quantum advantage (SyndFid).

With 100 logical qubits, we could therefore just run fault-tolerant IQP circuits in the advantage regime (ftIQP1) and the syndrome data would allow us to estimate the logical fidelity.


In all of these examples of random circuits with symmetries, coming up with classical samples that pass the verification tests is very easy, so the trusted-experimenter scenario is crucial for this to work. (Note, however, that it may be possible to add tests to Bell sampling that make spoofing difficult.) At the same time, these proposals are very resource-efficient in that they only increase the cost of a pure random-circuit experiment by a relatively small amount. What is more, the required circuits have more structure than random circuits in that they typically require gates that are natural in fault-tolerant implementations of quantum algorithms.

Performing random circuit sampling with symmetries is therefore a natural next step en-route to both classically verifiable advantage that closes the no-efficient verification loophole, and towards implementing actual algorithms.

What if we do not want to afford that level of trust in the person who runs the quantum circuit, however?

2.2 Classical verification using random circuits with planted secrets

If we do not trust the experimenter, we are in the untrusted quantum server scenario.

The untrusted quantum server scenario
In this scenario, we delegate a quantum computation to an untrusted (presumably remote) quantum server—think of using a Google or Amazon cloud server to run your computation. We can communicate with this server using classical information.

In the untrusted server scenario, we can hope to use ideas from proofs of quantumness such as the use of classical cryptography to design families of quantum circuits in which some secret structure is planted. This secret structure should give the verifier a way to check whether a set of samples passes a certain verification test. At the same time it should not be detectable, or at least not be identifiable from the circuit description alone.

The simplest example of such secret structure could be a large peak in an otherwise flat output distribution of a random-looking quantum circuit. To do this, the verifier would pick a (random) string xx and design a circuit such that the probability of seeing xx in samples, is large. If the peak is hidden well, finding it just from the circuit description would require searching through all of the 2n2^n outcome bit strings and even just determining one of the outcome probabilities is exponentially difficult. A classical spoofer trying to fake the samples from a quantum computer would then be caught immediately: the list of samples they hand the verifier will not even contain xx unless they are unbelievably lucky, since there are exponentially many possible choices of xx.

Unfortunately, planting such secrets seems to be very difficult using universal circuits, since the output distributions are so unstructured. This is why we have not yet found good candidates of circuits with peaks, but some tries have been made (Peaks,ECPeaks,HPeaks)

We do have a promising candidate, though—IQP circuits! The fact that the output distributions of IQP circuits are quite simple could very well help us design sampling schemes with hidden secrets. Indeed, the idea of hiding peaks has been pioneered by Shepherd and Bremner (VerIQP1) who found a way to design classically hard IQP circuits with a large hidden Fourier coefficient. The presence of this large Fourier coefficient can easily be checked from a few classical samples, and random IQP circuits do not have any large Fourier coefficients. Unfortunately, for that construction and a variation thereof (VerIQP2), it turned out that the large coefficient can be detected quite easily from the circuit description (ClassIQP1,ClassIQP2).

To this day, it remains an exciting open question whether secrets can be planted in (maybe IQP) circuit families in a way that allows for efficient classical verification. Even finding a scheme with some large gap between verification and simulation times would be exciting, because it would for the first time allow us to verify a quantum computing experiment in the advantage regime using only classical computation.

Towards applications: certifiable random number generation

Beyond verified quantum advantage, sampling schemes with hidden secrets may be usable to generate classically certifiable random numbers: You sample from the output distribution of a random circuit with a planted secret, and verify that the samples come from the correct distribution using the secret. If the distribution has sufficiently high entropy, truly random numbers can be extracted from them. The same can be done for RCS, except that some acrobatics are needed to get around the problem that verification is just as costly as simulation (CertRand, CertRandExp). Again, a large gap between verification and simulation times would probably permit such certified random number generation.

The goal here is firstly a theoretical one: Come up with a planted-secret RCS scheme that has a large verification-simulation gap. But then, of course, it is an experimental one: actually perform such an experiment to classically verify quantum advantage.

Should an IQP-based scheme of circuits with secrets exist, 100 logical qubits is the regime where it should give a relevant advantage.

Three milestones

Altogether, I proposed three milestones for the 100 logical qubit regime.

  1. Perform fault-tolerant quantum advantage using random IQP circuits. This will allow an improvement of the fidelity towards performing near-perfect RCS and thus closes the scalability worries of noisy quantum advantage I discussed in my last post.
  2. Perform RCS with symmetries. This will allow for efficient verification of quantum advantage in the trusted experimenter scenario and thus make a first step toward closing the verification loophole.
  3. Find and perform RCS schemes with planted secrets. This will allow us to verify quantum advantage in the remote untrusted server scenario and presumably give a first useful application of quantum computers to generate classically certified random numbers.

All of these experiments are natural steps towards performing actually useful quantum algorithms in that they use more structured circuits than just random universal circuits and can be used to benchmark the performance of the quantum devices in an advantage regime. Moreover, all of them close some loophole of the previous quantum advantage demonstrations, just like follow-up experiments to the first Bell tests have closed the loopholes one by one.

I argued that IQP circuits will play an important role in achieving those milestones since they are a natural circuit family in fault-tolerant constructions and promising candidates for random circuit constructions with planted secrets. Developing a better understanding of the properties of the output distributions of IQP circuits will help us achieve the theory challenges ahead.

Experimentally, the 100 logical qubit regime is exactly the regime to shoot for with those circuits since while IQP circuits are somewhat easier to simulate than universal random circuits, 100 qubits is well in the classically intractable regime.


What I did not talk about

Let me close this mini-series by touching on a few things that I would have liked to discuss more.

First, there is the OTOC experiment by the Google team (OTOC) which has spawned quite a debate. This experiment claims to achieve quantum advantage for an arguably more natural task than sampling, namely, computing expectation values. Computing expectation values is at the heart of quantum-chemistry and condensed-matter applications of quantum computers. And it has the nice property that it is what the Google team called “quantum-verifiable” (and what I would call “hopefully-in-the-future-verifiable”) in the following sense: Suppose we perform an experiment to measure a classically hard expectation value on a noisy device now, and suppose this expectation value actually carries some signal, so it is significantly far away from zero. Once we have a trustworthy quantum computer in the future, we will be able to check that the outcome of this experiment was correct and hence quantum advantage was achieved. There is a lot of interesting science to discuss about the details of this experiment and maybe I will do so in a future post.

Finally, I want to mention an interesting theory challenge that relates to the noise-scaling arguments I discussed in detail in Part 2: The challenge is to understand whether quantum advantage can be achieved in the presence of a constant amount of local noise. What do we know about this? On the one hand, log-depth random circuits with constant local noise are easy to simulate classically (SimIQP,SimRCS), and we have good numerical evidence that random circuits at very low depths are easy to simulate classically even without noise (LowDSim). So is there a depth regime in between the very low depth and the log-depth regime in which quantum advantage persists under constant local noise? Is this maybe even true in a noise regime that does not permit fault-tolerance (see this interesting talk)? In the regime in which fault-tolerance is possible, it turns out that one can construct simple fault-tolerance schemes that do not require any quantum feedback, so there are distributions that are hard to simulate classically even in the presence of constant local noise.

So long, and thanks for all the fish!

I hope that in this mini-series I could convince you that quantum advantage has been achieved. There are some open loopholes but if you are happy with physics-level experimental evidence, then you should be convinced that the RCS experiments of the past years have demonstrated quantum advantage.

As the devices are getting better at a rapid pace, there is a clear goal that I hope will be achieved in the 100-logical-qubit regime: demonstrate fault-tolerant and verifiable advantage (for the experimentalists) and come up with the schemes to do that (for the theorists)! Those experiments would close the loopholes of the current RCS experiments. And they would work as a stepping stone towards actual algorithms in the advantage regime.


I want to end with a huge thanks to Spiros Michalakis, John Preskill and Frederik Hahn who have patiently read and helped me improve these posts!

References

Fault-tolerant quantum advantage

(ftIQP1) Hangleiter, D. et al. Fault-Tolerant Compiling of Classically Hard Instantaneous Quantum Polynomial Circuits on Hypercubes. PRX Quantum 6, 020338 (2025).

(LogicalExp) Bluvstein, D. et al. Logical quantum processor based on reconfigurable atom arrays. Nature 626, 58–65 (2024).

Random circuits with symmetries

(BellSamp) Hangleiter, D. & Gullans, M. J. Bell Sampling from Quantum Circuits. Phys. Rev. Lett. 133, 020601 (2024).

(GStates) Ringbauer, M. et al. Verifiable measurement-based quantum random sampling with trapped ions. Nat Commun 16, 1–9 (2025).

(SyndFid) Xiao, X., Hangleiter, D., Bluvstein, D., Lukin, M. D. & Gullans, M. J. In-situ benchmarking of fault-tolerant quantum circuits.
I. Clifford circuits. arXiv:2601.21472
II. Circuits with a quantum advantage. (coming soon!)

Verification with planted secrets

(PoQ) Brakerski, Z., Christiano, P., Mahadev, U., Vazirani, U. & Vidick, T. A Cryptographic Test of Quantumness and Certifiable Randomness from a Single Quantum Device. in 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS) 320–331 (2018).

(VerIQP1) Shepherd, D. & Bremner, M. J. Temporally unstructured quantum computation. Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 465, 1413–1439 (2009).

(VerIQP2) Bremner, M. J., Cheng, B. & Ji, Z. Instantaneous Quantum Polynomial-Time Sampling and Verifiable Quantum Advantage: Stabilizer Scheme and Classical Security. PRX Quantum 6, 020315 (2025).

(ClassIQP1) Kahanamoku-Meyer, G. D. Forging quantum data: classically defeating an IQP-based quantum test. Quantum 7, 1107 (2023).

(ClassIQP2) Gross, D. & Hangleiter, D. Secret-Extraction Attacks against Obfuscated Instantaneous Quantum Polynomial-Time Circuits. PRX Quantum 6, 020314 (2025).

(Peaks) Aaronson, S. & Zhang, Y. On verifiable quantum advantage with peaked circuit sampling. arXiv:2404.14493

(ECPeaks) Deshpande, A., Fefferman, B., Ghosh, S., Gullans, M. & Hangleiter, D. Peaked quantum advantage using error correction. arXiv:2510.05262

(HPeaks) Gharibyan, H. et al. Heuristic Quantum Advantage with Peaked Circuits. arXiv:2510.25838

Certifiable random numbers

(CertRand) Aaronson, S. & Hung, S.-H. Certified Randomness from Quantum Supremacy. in Proceedings of the 55th Annual ACM Symposium on Theory of Computing 933–944 (Association for Computing Machinery, New York, NY, USA, 2023).

(CertRandExp) Liu, M. et al. Certified randomness amplification by dynamically probing remote random quantum states. arXiv:2511.03686

OTOC

(OTOC) Abanin, D. A. et al. Observation of constructive interference at the edge of quantum ergodicity. Nature 646, 825–830 (2025).

Noisy complexity

(SimIQP) Bremner, M. J., Montanaro, A. & Shepherd, D. J. Achieving quantum supremacy with sparse and noisy commuting quantum computations. Quantum 1, 8 (2017).

(SimRCS) Aharonov, D., Gao, X., Landau, Z., Liu, Y. & Vazirani, U. A polynomial-time classical algorithm for noisy random circuit sampling. in Proceedings of the 55th Annual ACM Symposium on Theory of Computing 945–957 (2023).

(LowDSim) Napp, J. C., La Placa, R. L., Dalzell, A. M., Brandão, F. G. S. L. & Harrow, A. W. Efficient Classical Simulation of Random Shallow 2D Quantum Circuits. Phys. Rev. X 12, 021021 (2022).

February 27, 2026

Matt von HippelHypothesis: If AI Is Bad at Originality, It’s a Documentation Problem

Recently, a few people have asked me about this paper.

A couple weeks back, OpenAI announced a collaboration with a group of amplitudes researchers, physicists who study the types of calculations people do to make predictions at particle colliders. The amplitudes folks had identified an interesting loophole, finding a calculation that many would have expected to be zero actually gave a nonzero answer. They did the calculation for different examples involving more and more particles, and got some fairly messy answers. They suspected, as amplitudes researchers always expect, that there was a simpler formula, one that worked for any number of particles. But they couldn’t find it.

Then a former amplitudes researcher at OpenAI suggested that they use AI to find it.

“Use AI” can mean a lot of different things, and most of them don’t look much like the way the average person talks to ChatGPT. This was closer than most. They were using “reasoning models”, loops that try to predict the next few phrases in a “chain of thought” again and again and again. Using that kind of tool, they were able to find that simpler formula, and mathematically prove that it was correct.

A few of you are hoping for an in-depth post about what they did, and its implications. This isn’t that. I’m still figuring out if I’ll be writing that for an actual news site, for money, rather than free, for you folks.

Instead, I want to talk about a specific idea I’ve seen crop up around the paper.

See, for some, the existence of a result like this isn’t all that surprising.

Mathematicians have been experimenting with reasoning models for a bit, now. Recently, a group published a systematic study, setting the AI loose on a database of minor open problems proposed by the famously amphetamine-fueled mathematician Paul Erdös. The AI managed to tackle a few of the problems, sometimes by identifying existing solutions that had not yet been linked to the problem database, but sometimes by proofs that appeared to be new.

The Erdös problems solved by the AI were not especially important. Neither was the problem solved by the amplitudes researchers, as far as I can tell at this point.

But I get the impression the amplitudes problem was a bit more interesting than the Erdös problems. The difference, so far, has mostly been attributed to human involvement. This amplitudes paper started because human amplitudes researchers found an interesting loophole, and only after that used the AI. Unlike the mathematicians, they weren’t just searching a database.

This lines up with a general point, one people tend to make much less carefully. It’s often said that, unlike humans, AI will never be truly creative. It can solve mechanical problems, do things people have done before, but it will never be good at having truly novel ideas.

To me, that line of thinking goes a bit too far. I suspect it’s right on one level, that it will be hard for any of these reasoning models to propose anything truly novel. But if so, I think it will be for a different reason.

The thing is, creativity is not as magical as we make it out to be. Our ideas, scientific or artistic, don’t just come from the gods. They recombine existing ideas, shuffling them in ways more akin to randomness than miracle. They’re then filtered through experience, deep heuristics honed over careers. Some people are good at ideas, and some are bad at them. Having ideas takes work, and there are things people do to improve their ideas. Nothing about creativity suggests it should be impossible to mechanize.

However, a machine trained on text won’t necessarily know how to do any of that.

That’s because in science, we don’t write down our inspirations. By the time a result gets into a scientific paper or textbook, it’s polished and refined into a pure argument, cutting out most of the twists and turns that were an essential part of the creative process. Mathematics is even worse, most math papers don’t even mention the motivation behind the work, let alone the path taken to the paper.

This lack of documentation makes it hard for students, making success much more a function of having the right mentors to model good practices, rather than being able to pick them up from literature everyone can access. I suspect it makes it even harder for language models. And if today’s language model-based reasoning tools are bad at that crucial, human-seeming step, of coming up with the right idea at the right time? I think that has more to do with this lack of documentation, than with the fact that they’re “statistical parrots”.

Terence TaoSLMath Deputy Director Search

[This post is written in my capacity as Vice-Chair of the Board of Trustees of SLMath. -T.]

SLMath, formerly MSRI, has launched the search for the next Deputy Director. This key position is a close advisor to the Director and shares in the internal management of the scientific team and programs at SLMath. This position is ideal for an experienced professional with a PhD in mathematical sciences seeking a new opportunity to leverage their strengths in program and grant management, financial management, and people management.

February 22, 2026

n-Category Café The Univalence Principle

(guest post by Dimitris Tsementzis, about joint work with Benedikt Ahrens, Paige North, and Mike Shulman)

The Univalence Principle is the informal statement that equivalent mathematical structures are indistinguishable. There are various ways of making this statement formally precise, and a long history of work that does so. In our recently-published (but long in the making!) book we proved to our knowledge the most general version of this principle, which applies to set-based, categorical, and higher-categorical structures defined in a non-algebraic and space-based style, as well as models of higher-order theories such as topological spaces.

This work achieves three main goals. Firstly, it greatly extends the “Structure Identity Principle” from the original HoTT book, to include any (finite) level of structure, instead of just set-based structures, thus establishing in the strongest sense yet made precise that the Univalent Foundations provide an equivalence-invariant foundation for higher-categorical mathematics. Secondly, it provides very general novel definitions of equivalence between structures and between objects in a given structure that “compile” to most known notions of equivalence in known cases, but which can also be used to suggest notions in new settings; in doing so it extends M. Makkai’s classic work on First Order-Logic with Dependent Sorts (FOLDS). Thirdly, the setting in which our result is proved (a form of Two-Level Type Theory) provides a framework in which to do metamathematics in the Univalent Foundations/HoTT, i.e. carry out the mathematical study of how mathematics is formalized in UF/HoTT. The Univalence Principle we prove is a foundational metamathematical result in that sense.

Setting the Stage

Any “Univalence Principle” type of result has the following form:

M,NStructures(),(M N)(M=N) \forall M, N \in Structures(\mathcal{L}), (M \simeq_\mathcal{L} N) \simeq (M = N)

The result gains in strength if the class of Structures()Structures(\mathcal{L}) is as large as possible, and the notion of equivalence M NM \simeq_{\mathcal{L}} N between them coincides with known notions of equivalence in practice (where \mathcal{L} is a placeholder for a notion of signature in terms of which MM and NN are structures). Such a result also gains in strength if the middle notion of equivalence is as “structural” as possible, ensuring that not only properties of the structures are preserved, but also constructions built on top of them. This last feature can only really be pursued within a univalent type theory, like HoTT.

In our work, we define: diagrammatic signatures \mathcal{L} as inverse categories of finite height, \mathcal{L}-structures as Reedy-fibrant functors 𝒰\mathcal{L} \to \mathcal{U}, and a notion of indiscernibility relating objects within structures, which yields a derived notion of equivalence \simeq_{\mathcal{L}} between structures (essentially structure-preserving bijections up to indiscernibility). The primitive notions \simeq and == are given by equivalence and equality in 2LTT, appropriately defined.

Signatures, Structures (and where they live)

Two-level type theory (2LTT)

In 2LTT, there is an external level for exo-types and other exo-gizmos (allowing strictly functorial constructions), and the usual fibrant (HoTT/UF) level where mathematical objects live. The external level is the metamathematical or syntactic level, where a strict equality exists that allows us to define syntax and symbols. Our exo-gizmos are analogous to sets of syntactical symbols used to define signatures in first-order logic.

Such syntax consists of categories with strict equality on composition, which we call exo-categories. Equalities here are strict (exo-equalities), while the internal world has homotopical paths. The 2LTT setting is convenient for type-theoretic reasoning, and allows us to neatly separate the various notions of equality at play.

Diagram signatures are inverse categories of finite height

A diagram signature \mathcal{L} is an inverse exo-category of finite height equipped with a rank functor rk:( exo) op \mathrm{rk}:\mathcal{L} \longrightarrow (\mathbb{N}^{\mathrm{exo}})^{\mathrm{op}} that reflects identities. Objects of \mathcal{L} are the sorts (analogous to “sorts” in first-order logic); morphisms encode dependencies (“an arrow depends on a pair of objects,” etc.). Inverse-ness gives matching objects and allows Reedy methods to apply.

To illustrate the idea, take the example of a reflexive graph. The diagram signature for reflexive graphs would be given by the following inverse (exo-)category

I i A c d O \array{ I \\ \downarrow^i \\ A \\ \downarrow_{c} \;\; \downarrow_{d} \\ O }

where ci=dic i = d i. The intuition is that we have a sort AA of “arrows” between any two objects, and a predicate II (“identity”) that can be used to select which arrows are identity arrows with the relation ensuring that this predicate can only be “asked” of loops.

Structures are Reedy fibrant diagrams over these signatures

Given this notion of a signature, a structure in our sense is simply a (Reedy fibrant) functor 𝒰\mathcal{L} \to \mathcal{U}. In more detail, a raw \mathcal{L}-diagram is an exo-functor M:𝒰 exo M:\mathcal{L}\to\mathcal{U}^{\mathrm{exo}} For each sort KK, the matching object 𝕄 KM\mathbb{M}_{K}M collects the compatible lower-rank data needed to specify the “boundary” for an element of MKM K. The Reedy fibrancy condition is: the canonical boundary map M K𝕄 KMM_K\to \mathbb{M}_{K}M is a fibration (i.e., a dependent type) for each KK. The category of such Reedy-fibrant diagrams then forms a fibrant type Str()\mathrm{Str}(\mathcal{L}) whose points are the \mathcal{L}-structures.

To illustrate with the example of rg\mathcal{L}_{rg} from above, an rg\mathcal{L}_{rg}-structure GG in our sense would be given by the following data, in type-theoretic notation:

  • A type GO:𝒰GO:\mathcal{U}
  • A family GA:GO×GO𝒰GA: GO\times GO\to\mathcal{U} dependent on GOGO
  • A family GI: xGA(x,x)𝒰GI:\prod_x GA(x,x)\to\mathcal{U} for the “identity”

The trick here is to ensure the repeated xx in the definition of GIGI, obeying the relations of the signature. This is what the matching object machinery achieves for arbitrary signatures.

Other examples of \mathcal{L}-structures include: categories (with sorts for objects and morphisms), groupoids, nn-categories for any nn, preorders, and models of higher-order theories like topological spaces and sup-lattices. Furthermore, all of what we say applies to theories with axioms (not just structures), which we define in the book too.

Indiscernibility and local univalence

With our formal set-up complete, we address the central question: for arbitrary signatures \mathcal{L}, when are two “objects” in an \mathcal{L}-structure equivalent? Such an “object” could be a categorical (or higher-categorical) gadget itself when \mathcal{L} has height >2\gt 2, e.g. the “objects” of 2Cat\mathbf{2-}\mathbf{Cat} are themselves categories, which are \mathcal{L}'-structures for an \mathcal{L}' of lower height. The key idea is: two “objects” are indiscernible if nothing in the signature can distinguish them…up to indiscernibility!

Indiscernibilities and Local Univalence

Fix a signature \mathcal{L}, an \mathcal{L}-structure MM, a rank 00 (“bottom”) sort K:(0)K: \mathcal{L}(0), and elements a,b:MKa,b:M K. Think of KK as a “set” of objects (e.g. of a category) on top of which properties and structures are defined.

To define when aa and bb are indiscernible, we package together everything that could distinguish them. The intuition is: if there is an equivalence between “everything you can say about aa” and “everything you can say about bb” (outside of directly referencing aa or bb themselves), then they are indiscernible. We achieve this through a formal definition of a “boundary” aM\partial_a M, which one can think of as the \mathcal{L}-structure that contains everything that could distinguish aa in MKMK from anything else in MKMK. Which naturally leads us to the following definitions.

Definition (Indiscernibility). We say that a,b:MKa,b : MK are indisernible, written aba\approx b, iff there is a levelwise equivalence aM bM\partial_a M \simeq \partial_b M that is coherent in the right way.

Definition (Univalent Structure). We say that MM is locally univalent at K if the canonical map (a=b)(ab) (a=b) \;\longrightarrow\; (a\approx b) is an equivalence. We then say a structure MM is univalent if this holds at all rank-0 sorts and (recursively) for all sorts of higher rank.

We prove our main results for univalent \mathcal{L}-structures. These are quite special in that they are “saturated” in their equivalence information: two indistinguishable gizmos are actually equal. Or, put differently: when there is not “enough” structure to distinguish two gizmos, there is always enough to prove them equivalent. Some examples to illustrate (see Part 2 of the book for many more!):

  • In a reflexive graph structure, two nodes aa and bb are indiscernible iff they have the same number of arrows coming in and out, and the same number of loops that are identities.
  • In a category, two objects are indiscernible iff they are isomorphic. A univalent category is precisely a category (precategory in UF) that is locally univalent at the sort of its objects.
  • In an appropriately defined bicategory, an indiscernibility amounts to a pair of coherent adjoint equivalenes, as expected.

Equivalences of structures

We are almost there! On to the final missing piece: equivalences between structures themselves. Let f:MNf:M\to N be a morphism of \mathcal{L}-structures, defined in the expected way (as a natural transformation between the corresponding 𝒰\mathcal{U}-valued functors). Then, for each sort KK, we have a matching square

MK f K NK 𝕄 KM 𝕄 Kf KN \array{ MK & \overset{\rightarrow}{f_K} & NK\\ \downarrow & & \downarrow \\ \mathbb{M}_KM & \overset{\rightarrow}{\mathbb{M}_K f} & \mathbb{N}_KN }

and for each “context” z:𝕄 KMz: \mathbb{M}_KM an induced fiber map f K,z:(MK) z(NK) 𝕄 Kf(z). f_{K,z}: (MK)_z \longrightarrow (NK)_{\,\mathbb{M}_K f(z)}. Write \approx for indiscernibility at sort KK (as above). Then, what we really want to say now is: MM, NN are equivalent if they are “level-wise equivalent up to indiscernibility”. This idea gives us the \simeq_{\mathcal{L}} from the beginning, and we define it as follows:

Definition (Equivalence of \mathcal{L}-structures). ff is an equivalence if, for every sort KK, every z:𝕄 KMz : \mathbb{M}_KM, and every y:(N K) 𝕄 Kf(z)y : (N_K)_{\mathbb{M}_K f(z)}, there exists a specified x:(MK) zx : (MK)_z with f K,z(x)y f_{K,z}(x)\;\approx\; y i.e., f K,zf_{K,z} is essentially split surjective up to indiscernibility on each fiber. We write M NM \simeq_{\mathcal{L}} N for the type of equivalences.

The key innovation is the “up to indiscernibility” part; it makes our notion significantly weaker than usual notions, and hence the final result stronger. Note that we have not been able to prove our result without a splitness condition in the definition of equivalence, and to our mind this remains an open problem.

Our definition is related to Makkai’s original notion of FOLDS equivalence, but Makkai was not able to define a general notion of non-surjective equivalence directly, relying instead on spans. Our notion of indiscernibility circumvents this difficulty and allows us to consider the whole structure of equivalences between structures.

The Univalence Principle

With all our apparatus in place we prove our main result, a very general form of a univalence principle, as promised.

Theorem (“The Univalence Principle”). For a signature \mathcal{L} and univalent \mathcal{L}-structures M,NM,N, the canonical map (M=N)(M N) (M = N) \;\longrightarrow\; (M \simeq_{\mathcal{L}} N) is an equivalence of types. In other words, (M=N)(M N)(M = N) \simeq (M \simeq_{\mathcal{L}} N).

The proof proceeds by induction on the height of \mathcal{L}. The key insight is that level-wise equivalences between univalent structures must “reflect indiscernibilities”: if ff doesn’t preserve the ability to distinguish elements, then whatever structure distinguishes them in the source would transfer to structure distinguishing their images in the target, contradicting the equivalence. With the splitness assumption in the map and the assumption of univalence (of our \mathcal{L}-structure), we are able to achieve the lifting of the indiscernibility.

Our result is actually proved for an even more general class of signatures called functorial signatures, which strictly extends diagram signatures and covers “higher-order” situations (topological spaces, suplattices, DCPOs, etc.). We have stuck to the diagrammatic view in this post for intuition, but all results and definitions carry over to this more general notion.

Conclusion

In the course of this work there were quite a few questions we tried to answer, but could not. To mention a couple: Can we define a Rezk completion for arbitrary structures, providing a universal way to turn any structure into a univalent one? Can we remove the splitness condition from our definition of equivalence between structures? We list more open problems at the end of the book.

Beyond our specific result, the framework in which it is proven establishes a way to answer metamathematical questions around the univalence axiom in a precise and fruitful way. It is important to emphasize that carrying out this type of mathematical study does not require choosing one foundation over the other. In any setting that interprets the fibrant part of 2LTT, the univalence principle will hold, including in set theory.

The metamathematics of UF is the mathematical study of formalizing mathematics in terms of a hierarchy of hh-levels vs. a cumulative hierarchy of sets. Formalizing mathematics in this way has all sorts of unique mathematical properties. The Univalence Principle is one of them.

February 20, 2026

Tommaso DorigoThe Strange Case Of The Monotonous Running Average

These days I am putting the finishing touches on a hybrid algorithm that optimizes a system (a gamma-ray observatory) by combining reinforcement-learning with gradient descent. Although I published an optimization strategy for that application already, I am going back to it to demonstrate a case where the simultaneous optimization of hardware and software is necessary, for a paper on co-design I am writing with several colleagues.
In the course of the software development, I ran into a simple but still interesting statistical issue I had not paid attention to until now. So I thought I could share it with you here.

read more

February 16, 2026

Terence TaoSix Math Essentials

Just a brief announcement that I have been working with Quanta Books to publish a short book in popular mathematics entitled “Six Math Essentials“, which will cover six of the fundamental concepts in mathematics — numbers, algebra, geometry, probability, analysis, and dynamics — and how they connect with our real-world intuition, the history of math and science, and to modern practice of mathematics, both in theory and in applications. The scheduled publication date is Oct 27, but it is currently available for preorder.

February 15, 2026

David Hoggnumber of exposures per visit?

I wrote code this weekend to look at the question of how we should visit a star in the upcoming Terra Hunting Experiment. The current (straw-person) plan is that we will observe each visible star once per night for ten years, with one exposure of a sensibly-chosen exposure time at each visit. Is this a good idea? I was interested in this problem for two reasons. The first is that binning is sinning, with the corollary that bigger bins are worse than finer bins, and a single, long exposure is a very big bin. The second reason is that when there are non-trivial noise sources (like the quasi-periodic variations from p-mode oscillations of the surfaces of Sun-like stars), a few negatively- (or interestingly-) correlated noise draws can be combined in ways that are substantially more informative than by taking the average.

Of course, if you split an exposure (with a standard CCD, say) into sub-exposures, you take on real costs: There is a read time, which is time you aren't integrating, and there is a read noise, that affects each new exposure. So the best strategies are a complicated function of read time, read noise, and the signal-to-noise at which the stellar p-mode oscillations are visible in any realistic data. Related: There are amazingly different and interesting strategies with up-the-ramp detectors that are used in the infrared.

One final comment is that the objective, in my strongly held view, is to optimize the amount of information (about, say, the center-of-mass radial-velocity changes of the target star) per unit wall-clock time. We are paying for wall-clock time; let's get as much as we can out of it.

February 13, 2026

David HoggThe LLMs, and why do we do astrophysics?

Today my rant on LLMs and the practices of our field hit the arXiv. I was scared to post it, because it is such a weird contribution, and it is so revealing about myself and my own political positions and hangups. But I have to say: I got great and supportive feedback all day.

I got two comments on saying ACAB in the literature. The Astronomer Royal of Scotland quoted (on BlueSky) the last sentence, which I put there because Andy Casey (Monash, Flatiron) insisted. Many people sent me appreciation and thank-yous, and many people sent me comments and objections. Always constructive. The whole experience made me feel very happy about the state of our field and the way we all interact. I think maybe there will be critical mass to write some kind of collection of essays on the subject. That's a plan for 2026.

February 05, 2026

Matt Strassler The Physicists and Mr. Epstein

Mr. Epstein was not only a world-class child abuser, he was a big fan of theoretical high-energy physics and of theoretical physicists. Some of my colleagues, unfortunately, got to know him. A number who were famous and/or had John Brockman as a book agent were even invited to a physics conference on Epstein’s private island, well before he was first arrested. This was no secret; as I recall, a lot of us heard about the existence of this conference/trip, but we hadn’t heard Epstein’s name before and didn’t pay much attention (ho hum, just another weird billionaire).

Personally, I feel quite lucky. The Brockman agency rejected the proposal for my recent book without comment (thank you!); and my research is mostly considered unimportant by the Brian Greenes of the world. As a result, I was not invited to Epstein’s island, never made his acquaintance, and blissfully avoided the entire affair. Clearly there are some benefits to being considered ordinary. And so — I’m sorry/not-sorry to say — I can’t tell you much about Epstein at all, or about how certain physicists did and did not interact with him. Regarding my colleagues who did get to know him, I can’t speak for them, since I wasn’t there, and I don’t know to what extent Epstein hid his immoral activities when they were around. It’s up to them to tell their own stories if they feel the need to do so (and I hope a couple of them do, just to clear the air.) Personally I tend to give them the benefit of the doubt — probably some literally didn’t know what was up until Epstein’s arrest in 2008, while perhaps others felt there wasn’t much they could do about Epstein’s actions on his own private island. I imagine they are deeply embarrassed to have been caught in this horrible man’s ugly web.

Fans of physics come in all shapes and sizes, and some have large wallets, large egos, and/or large ambitions. Among the wealthy supporters, we can count Alfred Nobel himself; billionaires sit on important scientific institute and university boards, and the more recent Breakthrough Prizes were funded by deep pockets. The extreme wealthy have outsized influence in our country and in our world, and one could argue that their influence in 2025 was not for the better. Usually, though, the influence in physics and related fields tends to be relatively benign, funding postdoctoral researchers and graduate students who deeply want to do science but also need to eat. That said, sometimes donors fund non-essential fields at the expense of critical ones, or favor theoretical research over the gathering of crucial experimental data, or push money on famous rich organizations when there are poor ones that are equally deserving and far more needy.

When gazillionaires, on their own initiative, come calling on non-profit organizations, whether they be community centers, arts organizations, or universities, they pose a problem. On the one hand, it is the job of anyone in a non-profit organization to help raise money — fail to do that, and your organization will close. When a single person offers to permanently change the future of your program, you would be derelict in your duty if you did not consider that offer. On the other hand, donors who might have ethical or criminal problems could drag the organization’s name through the mud. Worse, they might be able to force the organization itself to do something ethically questionable or even illegal.

There is a clear lesson for young academics and other up-and-coming non-profit actors in the Epstein affair: the more money potentially offered to our organizations, the more carefully we must tread. Money is power; power corrupts; and every pursuit of dollars, even for the best causes, risks infection. We can’t be large-scale non-profit fundraisers without doing serious and thorough background checks of the biggest donors; we have to question motives, and we can’t look the other way when something seems amiss. Those of us with clear hearts and honest pursuits tend to assume the best in other people. But we have to beware of those hoping to bolster their reputations, or clean their consciences, by giving away “generously” what they never deserved to have.

February 03, 2026

Tommaso DorigoTurning 60

Strange how time goes by. And strange I would say that, since I know time does not flow, it is just our perception of one of the spacetime coordinates of our block universe... 
The thing is, on February 5 I will turn 60. An important date for anybody - I could say a milestone. First of all, let me say that we give for granted all the days of our life we got to live, but in truth we did not know it from the start we would make it far. I do feel rather young still, but I am very well aware that there are heaps of ways I could have ended my life earlier. Accidents, but also naturally occurring sickness.

read more

February 02, 2026

Terence TaoIPAM industrial short course in generative AI algorithms – deadline for applications closing soon

(Sharing this in my capacity of director of special projects at IPAM.)

IPAM is holding an Industrial Short Course on Generative AI Algorithms on March 5-6, 2026.

The short course is aimed at people from industry or government who want to get started in deep learning, apply deep learning to their projects, learn how to code deep learning algorithms, and upgrade their skills to the latest AI algorithms, including generative AI. The course will be taught be Professor Xavier Bresson from the Department of Computer Science at the National University of Singapore (NUS).

The course will be limited to 25 participants. The fee for this course is $2,500 for participants. Registration closes soon (Feb 5); we still have a few spots available.

January 26, 2026

n-Category Café Categorifying Riemann's Functional Equation

David Jaz Myers just sent me some neat comments on this paper of mine:

and he okayed me posting them here. He’s taking the idea of categorifying the Riemann zeta function, explained in my paper, and going further, imagining what it might mean to categorify Riemann’s functional equation

ξ(s)=ξ(1s) \xi(s) = \xi(1-s)

where ξ\xi is the ‘completed’ Riemann zeta function, which has an extra factor taking into account the ‘real prime’:

ξ(s)=π s/2Γ(s/2) pprime11p s \xi(s) = \pi^{-s/2}\Gamma(s/2) \prod_{p \; prime} \frac{1}{1 - p^{-s}}

My paper categorified the Euler product formula that writes the Riemann zeta function as a product over the usual primes:

ζ(s)= pprime11p s \zeta(s) = \prod_{p \; prime} \frac{1}{1 - p^{-s}}

I had nothing to say about the real prime.

But it’s the functional equation that sets the stage for focusing on zeroes of the Riemann zeta function with Re(s)=1/2\text{Re}(s) = 1/2… and then the Riemann Hypothesis! So it’s worth thinking about.

David wrote:

Hi John,

Hope you’re doing well!

I was just thinking about your (and James Dolan’s) definition of the zeta functors associated to a finite type scheme (from here), and I had a small thought which I figured you might find interesting.

I was thinking about the functional equation of the completed zeta functions; how might we complete the zeta functors in such a way that they satisfy a similar functional equation? I don’t know, but I do have an idea for what the transformation s1ss \mapsto 1 -s might mean in this context. I claim that it is given by the reduced suspension. Let me explain.

First, I’ll want to see the formal power n sn^{-s} as the power

(1n) s,\left(\frac{1}{n}\right)^s,

which I can then categorify by finding a group GG with cardinality nn and considering BG SB G^S. In the case of the Riemann zeta species, nn is the cardinality of a finite semisimple ring (a product of finite fields, the groupoid of which has cardinality 11 for each nn), and we can simply deloop the additive group of this ring. This gives us a Dirichlet functor ζ\zeta

S k finite semisimpleBk SS \mapsto \sum_{k \; \text{ finite semisimple}} B k^S

which categorifies the Riemann zeta function when SS is a finite set.

Taking this point of view on the zeta functor, we can ask the question: what is the transformation s1ss \mapsto 1-s? Here’s where we can look at the reduced suspension ΣS +\Sigma S_+. The universal property of the reduced suspension says that maps ΣS +Y\Sigma S_+ \to Y correspond to points of the homotopy type

(x:Y)×(y:Y)×(S +x=y)(x : Y) \times (y : Y) \times (S_+ \to x = y)

(or, more classically, maps from the terminal morphism S +1S_+ \to 1 to Path(Y)Y×Y\mathsf{Path}(Y) \to Y \times Y). Since homotopy cardinality is multiplicative for fibrations, that type has cardinality

yy(1y) s+1=(1y) s1y \cdot y \cdot \left(\frac{1}{y}\right)^{s + 1} = \left(\frac{1}{y}\right)^{s -1}

(when SS is a finite set of cardinality ss).

Taking Y=BkY = B k for kk finite semisimple of cardinality nn, we see that ΣS +Bk\Sigma S_+ \to B k has cardinality n s1=n (1s)n^{s -1} = n^{-(1 -s)}. Therefore, I think the transformation s1ss \mapsto 1 - s in the functional equation may be categorified by SΣS +S \mapsto \Sigma S_+. If this makes sense, it suggests that completing the zeta functors is a form of stabilization.

Cheers,
David

And then:

As for another eyebrow wiggle about the cardinality of ΣS +\Sigma S_+ when SS is a finite set: we have that ΩΣS +=S\Omega \Sigma S_+ = \langle S \rangle, the free group on SS generators. This is of course infinite, but it it is the group completion of the free monoid List(S)\mathsf{List}(S) on SS generators. Since

List(S)=1+S+S 2+S 3+,\mathsf{List}(S) = 1 + S + S^2 + S^3 + \cdots,

it has cardinality 11s\frac{1}{1 - s}.

Maybe it’s better to use the “free delooping” (aka weighted colimit of 111 \rightrightarrows 1 by S1S \rightrightarrows 1) instead of the reduced suspension. This doesn’t change the above argument because we’re mapping into a groupoid, but now it is true that the Euler characteristic / cardinality of that category is 1s1 - s.

January 24, 2026

Tommaso DorigoOn The Illusion Of Time And The Strange Economy Of Existence

I recently listened again to Richard Feynman explaining why the flowing of time is probably an illusion. In modern physics time is just a coordinate, on the same footing as space, and the universe can be described as a four-dimensional object — a spacetime block. In that view, nothing really “flows”. All events simply are, laid out in a 4D structure. What we experience as the passage of time is tied instead to the arrow of entropy: the fact that we move through a sequence of states ordered by increasing disorder, and that memory itself is asymmetric.

read more

December 01, 2025

Secret Blogging SeminarCongress proposes cutting of all funding to US academics who mentor Chinese students

I’m writing to point out a potential law which should be gathering more opposition and attention in math academia: The Securing American Funding and Expertise from Adversarial Research Exploitation Act. This is an amendment to the 2026 National Defense Authorization Act which has passed the House and could be added to the final version of the bill during reconcilliation in the Senate. I’m pulling most of my information from an article in Science.

This act would ban any US scientist from receiving federal funding if they have, within the last five years, worked with anyone from China, Russia, Iran or North Korea, where “worked with” includes joint research, co-authorship on papers, or advising a foreign graduate student or postdoctoral fellow. As I said in my message to my senators, this is everyone. Every mathematician has advised Chinese graduate students or collaborated with Chinese mathematicians, because China is integrated into the academic world and is one fifth of the earth.

This obviously isn’t secret, since you can read about it in Science, but I am surprised that I haven’t heard more alarm. Obvious people to contact are your senators and your representatives. I would also suggest contacting members of the Senate armed services committee, who are in charge of reconciling the House and Senate versions of the bill.

November 27, 2025

Sean Carroll Thanksgiving

 (Apologies for the ugly blog format. We had a bit of a crash, and are working to get the template back in working order.)

This year we give thanks for a crucially important idea that can mean very different things to different people: information. (We’ve previously given thanks for the Standard Model LagrangianHubble’s Law, the Spin-Statistics Theoremconservation of momentumeffective field theorythe error bargauge symmetryLandauer’s Principle, the Fourier TransformRiemannian Geometrythe speed of lightthe Jarzynski equalitythe moons of Jupiterspaceblack hole entropyelectromagnetism, Arrow’s Impossibility Theorem, and quanta.)

“Information” is an idea that is everywhere in science and technology these days. From one angle it looks like such an obvious idea that it’s a bit startling to realize that information theory didn’t really come along until the work of Claude Shannon in the 1940s. From another, the idea has so many different shades of meaning that we shouldn’t be surprised (that’s a joke you will get in a bit) that it can be hard to understand.

Information theory is obviously an enormous subject, but we’re just giving thanks, not writing a textbook. I want to mention two ideas I find especially central. First, Shannon’s idea about relating information content to “surprisal.” Second, the very different intuitive notions of information that we get from engineering and physics.

Shannon, working at Bell Labs, was interested in the problem of how to send trustworthy signals efficiently over transatlantic cables. He was thinking about various ways to express information in a code: a set of symbols, each with a defined meaning. So a code might be an alphabet, or a set of words, or a literal cipher. And he noticed that there was a lot of redundancy in natural languages; the word “the” appears much more often in English than the word “axe,” although both have the same number of letters.

Let’s refer to each letter or symbol in a code as an “event.” Shannon’s insight was to realize that the more unlikely an event, the more information it conveyed when it was received. The statements “The Sun rose in the east this morning” and “The Sun rose in the west this morning” contain the same number of letters, but the former contains almost no information — you already were pretty sure the Sun would be rising in the east. But the latter, if obtained from a reliable source, would be very informative indeed, precisely because it was so unexpected. Clearly some kind of unprecedented astronomical catastrophe was in progress.

Imagine we can assign a probability p(x) to every different event x. Shannon wanted a way to quantify the information content of that event, which would satisfy various reasonable-seeming axioms: most crucially, that the information content of two independent events is the sum of the individual information contents. But the joint probability of two events is the product of their individual probabilities. So the natural thing to do would be to define the information content as the logarithm of the probability; the logarithm of a product equals the sum of the individual logarithms. But you want low probability to correspond to high information content, so Shannon defined the information content (also called the self-information, or surprisal, or Shannon information) of an event to be minus the log of the probability, which by math is equal to the log of the reciprocal of the probability:

    \[I(x) = - \log [p(x)] =\log \left(\frac{1}{p(x)}\right).\]

Note that probabilities are numbers between 0 and 1, and the log of such a number will be negative, with numbers closer to 0 being more negative than numbers closer to 1. So I(x) goes from +\infty at p(x)=0 to 0 at p(x)=1. An impossible message is infinitely surprising, and therefore conveys infinite information; an inevitable message is completely unsurprising, and conveys no information at all.

From there, Shannon suggested that we could characterize how efficient an entire code was at conveying information: just calculate the average (expectation value) of the information content for all possible events. When we have a probability distribution p(x), the average of any function f(x) is just the sum of the the values of the function times their respective probabilities, \langle f\rangle = \sum_x p(x) f(x). So we characterize the information content of a code via the quantity

    \[H[p] = - \sum_x p(x) \log[p(x)].\]

The only question is, what to call this lovely newly-defined quantity that surely nobody had ever thought of before? Happily Shannon was friends with John von Neumann, who informed him, “You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one really knows what entropy really is, so in a debate you will always have the advantage.” So entropy it is.

Indeed, this formula is precisely that which had been put forward (unknown to Shannon) by Josiah Willard Gibbs in the 1870’s as a definition of entropy in statistical mechanics. (It is related to the definition on Ludwig Boltzmann’s tombstone, S= k \log W, and Boltzmann had also suggested similar expressions to the above.) On the one hand, it seems remarkable to find precisely the same expression playing central roles in problems as disparate as sending signals across cables and watching cream mix into coffee; on the other hand, it’s a relatively simple expression and the axioms used to derive it are actually pretty similar, so perhaps we shouldn’t be surprised; on the third hand, the connection between information theory and statistical mechanics turns out to be deep and fruitful, so it’s more than just a mathematical coincidence.

But let me highlight the one aspect of the term “information” that can be sometimes confusing to people. To the engineer, a code that is maximally informative is one for which p(x) is relatively uniform over all events x, which means H[p(x)] is maximal or close to it; in that case, every event will tell you something at least a little bit interesting. For them, high entropy = high information.

But to a physicist who might be asking “how much information do I have about the state of a system?”, you have more information when p(x) is relatively narrowly concentrated around some value, rather than being all spread out. For them, high entropy = low information! Indeed, one physically-relevant notion of “information” is the “accessible information” of a system, which can be defined as H_\mathrm{max} - H. (I talk about this a bit in my recent solo podcast on complexity.)

Perhaps we shouldn’t be so surprised that physicists and engineers posit oppositely-directed relationships between entropy and information. It’s just a reflection of the fact that “information” is so ubiquitous and has so many different uses. We should be thankful that we’re beginning to understand it so well.

November 26, 2025

Tim GowersCreating a database of motivated proofs

It’s been over three years since my last post on this blog and I have sometimes been asked, understandably, whether the project I announced in my previous post was actually happening. The answer is yes — the grant I received from the Astera Institute has funded several PhD students and a couple of postdocs, and we have been busy. In my previous post I suggested that I would be open to remote collaboration, but that has happened much less, partly because a Polymath-style approach would have been difficult to manage while also ensuring that my PhD students would have work that they could call their own to put in their theses.

In general I don’t see a satisfactory solution to that problem, but in this post I want to mention a subproject of the main project that is very much intended to be a large public collaboration. A few months ago, a call came out from Renaissance Philanthropies saying that they were launching a $9m AI for Math Fund to spend on projects in the general sphere of AI and mathematics, and inviting proposals. One of the categories that they specifically mentioned was creating new databases, and my group submitted a proposal to create a database of what we call “structured motivated proofs,” a piece of terminology that I will explain a bit more later in just a moment. I am happy to report that our proposal was one of the 29 successful ones. Since a good outcome to the project will depend on collaboration from many people outside the group, we need to publicize it, which is precisely the purpose of this post. Below I will be more specific about the kind of help we are looking for.

Why might yet another database of theorems and proofs be useful?

The underlying thought behind this project is that AI for mathematics is being held back not so much by an insufficient quantity of data as by the wrong kind of data. (For a more general exploration of this theme, see here.) All mathematicians know, and some of us enjoy complaining about it, that it is common practice when presenting a proof in a mathematics paper, or even textbook, to hide the thought processes that led to the proof. Often this does not matter too much, because the thought processes may be standard ones that do not need to be spelt out to the intended audience. But when proofs start to get longer and more difficult, they can be hard to read because one has to absorb definitions and lemma statements that are not obviously useful, are presented as if they appeared from nowhere, and demonstrate their utility only much later in the argument.

A sign that this is a problem for AI is the behaviour one observes after asking an LLM to prove a statement that is too difficult for it. Very often, instead of admitting defeat, it will imitate the style of a typical mathematics paper and produce rabbits out of hats, together with arguments later on that those rabbits do the required job. The problem is that, unlike with a correct mathematics paper, one finds when one scrutinizes the arguments carefully that they are wrong. However, it is hard to find superficial features that distinguish between an incorrect rabbit with an incorrect argument justifying that rabbit (especially if the argument does not go into full detail) and a correct one, so the kinds of statistical methods used by LLMs do not have an easy way to penalize the incorrectness.

Of course, that does not mean that LLMs cannot do mathematics at all — they are remarkably good at it, at least compared with what I would have expected three years ago. How can that be, given the problem I have discussed in the previous paragraph?

The way I see it (which could change — things move so fast in this sphere), the data that is currently available to train LLMs and other systems is very suitable for a certain way of doing mathematics that I call guess and check. When trying to solve a maths problem, you will normally write down the routine parts of an argument without any fuss (and an LLM can do them too because it has seen plenty of similar examples), but if the problem as a whole is not routine, then at some point you have to stop and think, often because you need to construct an object that has certain properties (I mean this in a rather general way — the “object” might be a lemma that will split up the proof in a nice way) and it is not obvious how to do so. The guess-and-check approach to such moments is what it says: you make as intelligent a guess as you can and then see whether it has the properties you wanted. If it doesn’t, you make another guess, and you keep going until you get lucky.

The reason an LLM might be tempted to use this kind of approach is that the style of mathematical writing I described above makes it look as though that is what we as mathematicians do. Of course, we don’t actually do that, but we tend not to mention all the failed guesses we made and how we carefully examined why they failed, modifying them in appropriate ways in response, until we finally converged on an object that worked. We also don’t mention the reasoning that often takes place before we make the guess, saying to ourselves things like “Clearly an Abelian group can’t have that property, so I need to look for a non-Abelian group.”

Intelligent guess and check works well a lot of the time, particularly when carried out by an LLM that has seen many proofs of many theorems. I have often been surprised when I have asked an LLM a problem of the form \exists x\in X \ P(x), where P is some property that is hard to satisfy, and the LLM has had no trouble answering it. But somehow when this happens, the flavour of the answer given by the LLM leaves me with the impression that the technique it has used to construct x is one that it has seen before and regards as standard.

If the above picture of what LLMs can do is correct (the considerations for reinforcement-learning-based systems such as AlphaProof are not identical but I think that much of what I say in this post applies to them too for slightly different reasons), then the likely consequence is that if we pursue current approaches, then we will reach a plateau: broadly speaking they will be very good at answering a question if it is the kind of question that a mathematician with the right domain expertise and good instincts would find reasonably straightforward, but will struggle with anything that is not of that kind. In particular, they will struggle with research-level problems, which are, almost by definition, problems that experts in the area do not find straightforward. (Of course, there would probably be cases where an LLM spots relatively easy arguments that the experts had missed, but that wouldn’t fundamentally alter the fact that they weren’t really capable of doing research-level mathematics.)

But what if we had a database of theorems and proofs that did not hide the thought processes that lay behind the non-obvious details of the proofs? If we could train AI on a database of accounts of proof discoveries and if, having done so, we then asked it to provide similar accounts, then it would no longer resort to guess-and-check when it got stuck, because the proof-discovery accounts it had been trained on would not be resorting to it. There could be a problem getting it to unlearn its bad habits, but I don’t think that difficulty would be impossible to surmount.

The next question is what such a database might look like. One could just invite people to send in stream-of-consciousness accounts of how they themselves found certain proofs, but that option is unsatisfactory for several reasons.

  1. It can be very hard to remember where an idea came from, even a few seconds after one has had it — in that respect it is like a dream, the memory of which becomes rapidly less vivid as one wakes up.
  2. Often an idea will seem fairly obvious to one person but not to another.
  3. The phrase “motivated proof” means different things to different people, so without a lot of careful moderation and curation of entries, there is a risk that a database would be disorganized and not much more helpful than a database of conventionally written proofs.
  4. A stream-of-consciousness account could end up being a bit too much about the person who finds the proof and not enough about the mathematical reasons for the proof being feasibly discoverable.

To deal with these kinds of difficulties, we plan to introduce a notion of a structured motivated proof, by which we mean a proof that is generated in a very particular way that I will partially describe below. A major part of the project, and part of the reason we needed funding for it, is to create a platform that will make it convenient to input structured motivated proofs and difficult to insert the kinds of rabbits out of hats that make a proof mysterious and unmotivated. In this way we hope to gamify the task of creating the database, challenging people to input into our system proofs of certain theorems that appear to rely on “magic” ideas, and perhaps even offering prizes for proofs that contain steps that appear in advance to be particularly hard to motivate. (An example: the solution by Ellenberg and Gijswijt of the cap-set problem uses polynomials in a magic-seeming way. The idea of using polynomials came from an earlier paper of Croot, Lev and Pach that proved a closely related theorem, but in that paper it just appears in the statement of their Lemma 1, with no prior discussion apart from the words “in the present paper we use the polynomial method” in the introduction.)

What is a structured motivated proof?

I wrote about motivated proofs in my previous post, but thanks to many discussions with other members of the group, my ideas have developed quite a lot since then. Here are two ways we like to think about the concept.

1. A structured motivated proof is one that is generated by standard moves.

I will not go into full detail about what I mean by this, but will do so in a future post when we have created the platform that we would like people to use in order to input proofs into the database. But the basic idea is that at any one moment one is in a certain state, which we call a proof-discovery state, and there will be a set of possible moves that can take one from the current proof-discovery state to a new one.

A proof-discovery state is supposed to be a more formal representation of the state one is in when in the middle of solving a problem. Typically, if the problem is difficult, one will have asked a number of questions, and will be aware of logical relationships between them: for example, one might know that a positive answer to Q1 could be used to create a counterexample to Q2, or that Q3 is a special case of Q4, and so on. One will also have proved some results connected with the original question, and again these results will be related to each other and to the original problem in various ways that might be quite complicated: for example P1 might be a special case of Q2, which, if true would reduce Q3 to Q4, where Q3 is a generalization of the statement we are trying to prove.

Typically we will be focusing on one of the questions, and typically that question will take the form of some hypotheses and a target (the question being whether the hypotheses imply the target). One kind of move we might make is a standard logical move such as forwards or backwards reasoning: for example, if we have hypotheses of the form P(x) and \forall u\ P(u)\implies Q(u), then we might decide to deduce Q(x). But things get more interesting when we consider slightly less basic actions we might take. Here are three examples.

  1. We have in our list of hypotheses the fact that a function f is given by the formula f(x)=\exp(p(x)), where p is a polynomial, and our goal is to prove that there exists z such that f(z)=1. Without really thinking about it, we are conscious that f is a composition of two functions, one of which is continuous and one of which belongs to a class of functions that are all continuous, so f is continuous. Also, the conclusion \exists z\ f(z)=1 matches well the conclusion of the intermediate-value theorem. So the intermediate-value theorem comes naturally to mind and we add it to our list of available hypotheses. In practice we wouldn’t necessarily write it down, but the system we wish to develop is intended to model not just what we write down but also what is going on in our brains, so we propose a move that we call library extraction (closely related to what is often called premise selection in the literature). Note that we have to be a bit careful about library extraction. We don’t want the system to be allowed to call up results from the library that appear to be irrelevant but then magically turn out to be helpful, since those would feel like rabbits out of hats. So we want to allow extraction of results only if they are obvious given the context. It is not easy to define what “obvious” means, but there is a good rule of thumb for it: a library extraction is obvious if it is one of the first things ChatGPT thinks of when given a suitable non-cheating prompt. For example, I gave it the prompt, “I have a function f from the reals to the reals and I want to prove that there exists some z such that f(z)=1. Can you suggest any results that might be helpful?” and the intermediate-value theorem was its second suggestion. (Note that I had not even told it that f was continuous, so I did not need to make that particular observation before coming up with the prompt.)
  2. We have a goal of the form \exists x\in X\ P(x). If this were a Lean proof state, the most common way to discharge a goal of this form would be to input a choice for x. That is, we would instantiate the existential quantifier with some x_0 and our new goal would be P(x_0). However, as with library extraction, we have to be very careful about instantiation if we want our proof to be motivated, since we wish to disallow highly surprising choices of x_0 that can be found only after a long process of thought. So we have to restrict ourselves to obvious instantiations. One way that an instantiation in our system will count as obvious is if the variable is instantiated with a term that is already present in the proof-discovery state. If the desired term is not present, then in order to continue with the proof, it will be necessary to carry out moves that generate it. A very common technique for this is the use of metavariables: instead of guessing a suitable x_0, we create a variable x^\bullet and change the goal to P(x^\bullet), which we can think of as saying “I’m going to start trying to prove P(x^\bullet) even though I haven’t chosen x^\bullet yet. As the attempted proof proceeds, I will note down any properties Q_1,\dots,Q_k that x^\bullet might have that would help me finish the proof, in the hope that (i) I get to the end and (ii) the problem \exists x\ Q_1(x)\wedge\dots\wedge Q_k(x) is easier than the original problem.” Another kind of obvious instantiation is one where we try out an object that is “extreme” in some way — it might be the smallest element of X, or the largest, or the simplest. (Judging simplicity is another place where the ChatGPT rule of thumb can be used.)
  3. We cannot see how to answer the question we are focusing on so we ask a related question. Two very common kinds of related question (as emphasized by Polya) are generalization and specialization. Perhaps we don’t see why a hypothesis is helpful, so we see whether the result holds if we drop that hypothesis. If it does, then we are no longer distracted by an irrelevant hypothesis. If it does not, then we can hope to find a counterexample that will help us understand how to use the hypothesis. Or perhaps we are trying to prove a general statement but it is not clear how to do so, so instead we formulate some special cases, hoping that we can prove them and spot features of the proofs that we can generalize. Again we have to be rather careful here not to allow “non-obvious” generalizations and specializations. Roughly the idea there is that a generalization should be purely logical — for example, dropping a hypothesis is fine but replacing the hypothesis “f is twice differentiable” by “f is upper semicontinuous” is not — and that a specialization should be to a special case that counts as an obvious instantiation in the sense discussed just above.

2. A structured motivated proof is one that can be generated with the help of a point-and-click system.

This is a surprisingly useful way to conceive of what we are talking about, especially as it relates closely to what I was talking about earlier: imposing a standard form on motivated proofs (which is why we call them “structured” motivated proofs) and gamifying the process of producing them.

The idea is that a structured motivated proof is one that can be generated using an interface (which we are in the process of creating — at the moment we have a very basic prototype that has a few of the features we will need, but not yet the more interesting ones) that has one essential property: the user cannot type in data. So what can they do? They can select text that is on their screen (typically mathematical expressions or subexpressions), they can click buttons, choose items from drop-down menus, and accept or reject “obvious” suggestions made to them by the interface.

If, for example, the current goal is an existential statement \exists x\ P(x), then typing in a formula that defines a suitable x is not possible, so instead one must select text or generate new text by clicking buttons, choosing from short drop-down menus, and so on. This forces the user to generate x, which is our proxy for showing where the idea of using x came from.

Broadly speaking, the way the prototype works is to get an LLM to read a JSON object that describes the variables, hypotheses and goals involved in the proof state in a structured format, and to describe (by means of a fairly long prompt) the various moves it might be called upon to do. Thus, the proofs generated by the system are not formally verified, but that is not an issue that concerns us in practice since there will be a human in the loop throughout to catch any mistakes that the LLM might make, and this flexibility may even work to our advantage to better capture the fluidity of natural-language mathematics.

There is obviously a lot more to say about what the proof-generating moves are, or (approximately equivalently) what the options provided by a point-and-click system will be. I plan to discuss that in much more detail when we are closer to having an interface ready, the target for which is the end of this calendar year. But the aim of the project is to create a database of examples of proofs that have been successfully generated using the interface, which can then be used to train AI to play the generate-structured-motivated-proof game.

How to get involved.

There are several tasks that will need doing once the project gets properly under way. Here are some of the likely ones.

  1. The most important is for people to submit structured motivated (or move-generated) proofs to us on the platform we provide. We hope that the database will end up containing proofs of a wide range of difficulty (of two kinds — there might be fairly easy arguments that are hard to motivate and there might be arguments that are harder to follow but easier to motivate) and also a wide range of areas of mathematics. Our initial target, which is quite ambitious, is to have around 1000 entries by two years from now. While we are not in a position to accept entries yet, if you are interested in participating, then it is not too early to start thinking in a less formal way about how to convert some of your favourite proofs into motivated versions, since that will undoubtedly make it easier to get them accepted by our platform when it is ready.
  2. We are in the process of designing the platform. As I mentioned earlier, we already have a prototype, but there are many moves we will need it to be able to do that it cannot currently do. For example, the current prototype allows just a single proof state, which consists of some variable declarations, hypotheses, and goals. It does not yet support creating subsidiary proof states (which we would need if we wanted to allow the user to consider generalizations and specializations, for example). Also, for the moment the prototype gets an LLM to implement all moves, but some of the moves, such as applying modus ponens, are extremely mechanical and would be better done using a conventional program. (On the other hand, moves such as “obvious library extraction” or “provide the simplest example” are better done by an LLM.) Thirdly, a technical problem is that LaTeX is currently rendered as images, which makes it hard to select subexpressions, something we will need to be able to do in a non-clunky way. And the public version of the platform will need to be web-based and very convenient to use. We will want features such as being able to zoom out and look at some kind of dependency diagram of all the statements and questions currently in play, and then zoom in on various nodes if the user wishes to work on them. If you think you may be able (and willing) to help with some of these aspects of the platform, then we would be very happy to hear from you. For some, it would probably help to have a familiarity with proof assistants, while for others we would be looking for somebody with software engineering experience. The grant from the AI for Math Fund will allow us to pay for some of this help, at rates to be negotiated. We are not yet ready to specify in detail what help we need, but would welcome any initial expressions of interest.
  3. Once the platform is ready and people start to submit proofs, it is likely that, at least to start with, they will find that the platform does not always provide the moves they need. Perhaps they will have a very convincing account of where a non-obvious idea in the proof came from, but the system won’t be expressive enough for them to translate that account into a sequence of proof-generating moves. We will want to be able to react to such situations (if we agree that a new move is needed) by expanding the capacity of the platform. It will therefore be very helpful if people sign up to be beta-testers, so that we can try to get the platform to a reasonably stable state before opening it up to a wider public. Of course, to be a beta-tester you would need to have a few motivated proofs in mind.
  4. It is not obvious that every proof submitted via the platform, even if submitted successfully, would be a useful addition to the database. For instance, it might be such a routine argument that no idea really needs to have its origin explained. Or it might be that, despite our best efforts, somebody finds a way of sneaking in a rabbit while using only the moves that we have provided. (One way this could happen is if an LLM made a highly non-obvious suggestion that happened to work, in which case the rule of thumb that if an LLM thinks of it, it must be obvious, would have failed in that instance.) For this reason, we envisage having a team of moderators, who will check entries and make sure that they are good additions to the database. We hope that this will be an enjoyable task, but it may have its tedious aspects, so we envisage paying moderators — again, this expense was allowed for in our proposal to the AI for Math Fund.

If you think you might be interested in any of these roles, please feel free to get in touch. Probably the hardest recruitment task for us will be identifying the right people with the right mixture of mathematical knowledge and software engineering skills to help us turn the platform into a well-designed web-based one that is convenient and pleasurable to use. If you think you might be such a person, or if you have a good idea for how we should go about finding one, we would be particularly interested to hear from you.

In a future post, I will say more about the kinds of moves that our platform will allow, and will give examples of non-motivated proofs together with how motivated versions of those proofs can be found and entered using the platform (which may involve a certain amount of speculation about what the platform will end up looking like).

How does this relate to use of tactics in a proof assistant?

In one way, our “moves” can be regarded as tactics of a kind. However, some of the moves we will need are difficult to implement in conventional proof assistants such as Lean. In parallel with the work described above, we hope to create an interface to Lean that would allow one to carry out proof-discovery moves of the kind discussed above but with the proof-discovery states being collections of Lean proof states. Members of my group have already been working on this and have made some very interesting progress, but there is some way to go. However, we hope that at some point (and this is also part of the project pitched to the AI for Math Fund) we will have created another interface that will have Lean working in the background, so that it will be possible to generate motivated proofs that will be (or perhaps it is better to say include) proofs in Lean at the same time.

Another possibility that we are also considering is to use the output of the first platform (which, as mentioned above, will be fairly formal, but not in the strict sense of a language such as Lean) to create a kind of blueprint that can then be autoformalized automatically. Then we would have a platform that would in principle allow mathematicians to search for proofs while working on their computers without having to learn a formal language, with their thoughts being formalized as they go.

November 25, 2025

David Hoggsubstellar objects (brown dwarfs)

I spent the day at the NSBP / NSHP meeting in San José. My favorite session of the day was the morning astro session, which was entirely about brown dwarfs. I learned a lot in a very short time. Caprice Phillips (UCSC) introduced the session with an introduction to the scientific and technical questions in play. She put a lot of emphasis on using binaries and clusters to put detailed abundance ratios onto substellar objects. This was what I expected: I thought (walking in to this session) that all known abundance ratios for brown dwarfs were from such kinds of studies. I learned different (keep reading).

Gabriel Munoz Zarazua (SFSU) followed by showing spectra from M-dwarfs, brown dwarfs, and Jupiter. It definitely looks like a sequence. He does spectral fitting (what they call, in this business, retrievals). It looks like he is getting very good, somewhat precise, abundance ratios for the photospheres of substellar objects! I asked more about this in the question period, and apparently I am way behind the times (Emily Rauscher, Michigan, helpfully pointed this out to me): Now brown-dwarf photosphere models are so good, they can be used to measure abundances, and pretty well.

I also learned in this session (maybe from Jorge Sanchez, ASU, or maybe from Efrain Alvarado, SFSU) that there is a very strong mass–abundance relation in the Solar System. That is, we don't expect, if brown dwarfs form the way planets do, that the detailed abundances of the brown dwarfs will match exactly the detailed abundances of the primary stars. But now we are really in a position to test that. Sanchez showed that we can get, from even photometry, abundances for substellar objects in the Milky Way halo. Again, totally new to me! And he finds metallicities at or below −3. Alvarado showed data on an amazing system J1416, which is an L–T binary with no stellar companion. Apparently it is the only known completely substellar binary.

November 14, 2025

Matt Strassler Event with Professor Daniel Whiteson on Monday November 17 at 7pm

Next Monday, November 17th at 7pm, I’ll be at the Harvard Bookstore with particle physicist and author Daniel Whiteson. Professor Whiteson and his co-author Andy Warner have a nice new book, for the general science-aware reader, exploring an age-old and unanswered question: how universal is the knowledge and understanding that we call “physics”? How much of modern physics is actually telling us about the universe, and how much of it is created by, or an accident of, the humans who have helped bring it about?

For instance, if we started all over again and reran history from scratch, would the physics (and science more generally) of this re-run culture look much like our own, or might it turn out very differently? If another culture on Earth had had time to develop highly mature science (or something like it) in its own direction, independent of Western Europe’s influence, how different might that science be? (Indeed, would our word “science” even be translatable into their worldview?) Or if we encountered aliens with far greater understanding of the universe than we have, would we be able to recognize, parse, grok, appreciate, comprehend, and/or otherwise make sense of their notions of scientific knowledge?

Whiteson and his co-author, wanting to write a popular book rather than a scholarly one, and desiring nevertheless to take on these serious and challenging intellectual questions, have set their focus mostly on the aliens, accompanied by amusing cartoons and a generous helping of dad jokes (hey, some dad jokes are actually very funny.) They’re looking for a broad audience, and hopefully they will get it. But don’t let the light-hearted title (“Do Aliens Speak Physics?“) or the charmingly goofy cover fool you: this book might well make you laugh, but I guarantee it will make you think. Whether you’re just curious about science or you’ve been doing science yourself for years, I suspect that, within the vast array of problems and issues that are raised in this broad-minded book, there will be some you’ve never thought of.

Among scientists and philosophers, there are some who believe that any aliens with the capacity to reach the Earth will obviously “speak physics” — that math and physics float above contingencies of culture and species, and will easily be translated from any intelligent creature to any other. But are they perhaps flying too high? It’s clear that Whiteson and Warner are aiming to poke some holes — lots of holes —- in their hot-air balloon, and to do so in a way that a wide variety of readers can appreciate and enjoy.

I tend to agree with Whiteson on a lot of these issues, but that won’t stop me from asking him some tough questions. You can ask him some tough questions too, if you like — just come to the Harvard Bookstore at 7:00 on Monday and join the conversation!

October 15, 2025

Clifford JohnsonNobel Prize in Physics 2025: Who/What/Why

I started a tradition a little while back where every year we have a special departmental colloquium entitled "The Nobel Prize in Physics: Who/What/Why". This year my job in finding speakers was made easier by having 2/3 of this years newly-minted Nobel Prize winners in physics in the Department! (Michel Devoret and John Martinis.) So our room was a bit more well-attended than normal...(hundreds and hundreds rather than dozens and dozens). Here is a recording of the event, which I was delighted to host, and there's a celebration afterwards too. (Pls share widely!)
[...] Click to continue reading this post

The post Nobel Prize in Physics 2025: Who/What/Why appeared first on Asymptotia.

October 01, 2025

Robert HellingHolosplit

 Recently I had to update Mathematica on my laptop and after having solved the challenges of the license manager that keeps looking different every time I have to use it, I learned that Mathematica 14 can now officially work with finite fields.

This reminded me that for a while I wanted to revive an old project that had vanished together with the hard drive of some old computer: Holosplit. So, over the last two days and with the help of said version of Mathematica I did a complete rewrite which you can now find on Github.

It consists of two C programs "holosplit" and "holojoin". To the first you give a positive integer \(N\) and a file and it spits out a new file ("fragment") that is roughly \(1/N\) of the size. Every time you do that you obtain a new random fragment.

The later you give any collection of \(N\) of these fragments and it reproduces the original file. So you can for example distribute a file over 10 people such that when any 3 of them work together, they can recover the original. 

How does it work? I uses the finite field \(F\) of \(2^3=256\) elements (in the Github repository, there is also a header file that implements arithmetic in \(F\) and matrix operations like product and inverse over it). Each time, it is invoked, it picks a random vector \(v\in F^N\) and writes it to the output. Then it reads \(N\) bytes from the file at a time which it also interprets as a vector \(d\in F^N\). It then outputs the byte that corresponds to the scalar product \(v\cdot d\).

To reassemble the file, holojoin takes the \(N\) files with its random vectors \(v_1,\ldots,v_N\) and interprets those as the rows of a \(N\times N\) matrix \(A\). With probability

$$\frac{\prod_{k=1}^N \left(256^N-k\right)}{(256)^{N^2}}$$

which exponentially in \(N\) approaches 1 this matrix is invertible (homework: why?). So we can read one byte from each file, assemble those into yet another vector \(e\in F^N\) and recover

$$d=A^{-1}e.$$

Besides the mathematics, it also poses philosophical/legal questions: Consider for example the original file is copyrighted, for example an mp3 or a video. The fragments are clearly derived works. But individually, they do not contain the original work, without sufficiently many other fragments they are useless (although not in a cryptographic sense). So by publishing one fragment, I do not provide access to the original work. What if others publish other fragments? Then my fragment could be the last remaining one that was missing. If there are more, any individual fragment is redundant so publishing it strictly speaking does not provide new information. 

September 26, 2025

Peter Rohde Photo albums

Peter’s photos: https://www.icloud.com/sharedalbum/#B275oqs3qKSZvQ

Screenshots: https://www.icloud.com/sharedalbum/#B27532ODWjIQb9

Climbing book launch: https://www.icloud.com/sharedalbum/#B27GWZuqDGnuOyN

Salisbury waters: https://www.icloud.com/sharedalbum/#B275qXGF1JQFkx

Christmas with Ash: https://www.icloud.com/sharedalbum/#B27G6XBubAhoT6

Hosin BBQ duck: https://www.icloud.com/sharedalbum/#B27GY8gBYG3b5mD

Hawks Nest to Smiths Lake: https://www.icloud.com/sharedalbum/#B2759UlCqSH5bE

Europe & Alps: https://www.icloud.com/sharedalbum/#B275ON9t3W0lu

Point Perpendicular: https://www.icloud.com/sharedalbum/#B27GqkRUiGivXD2

Newnes canyoning: https://www.icloud.com/sharedalbum/#B27GfnH8tgHSmX

Coffs Harbour to Yamba: https://www.icloud.com/sharedalbum/#B27J0DiRHJKuuWr

Wendy Bruere Christmas (2020): https://www.icloud.com/sharedalbum/#B27G4TcsmGoHysj

Six Foot Track: https://www.icloud.com/sharedalbum/#B2753qWtHZA9EX

Kosciusko to Kiandra: https://www.icloud.com/sharedalbum/#B27GgZLKuGaewVm

Camping food: https://www.icloud.com/sharedalbum/#B27GtnIORgbmHu

The Aardvark: https://www.icloud.com/sharedalbum/#B275VaUrzvmAiT

Kangaroo Valley kayaking: https://www.icloud.com/sharedalbum/#B27JEsNWnJrCpi0

Claustral canyon: https://www.icloud.com/sharedalbum/#B2755Z2WMOTpsk

Budawang: https://www.icloud.com/sharedalbum/#B27GDdyTvGvpINL

Mother’s Day panoramas (2021): https://www.icloud.com/sharedalbum/#B27GFssfGG9WmJP

Point Perpendicular & Nowra: https://www.icloud.com/sharedalbum/#B27GRMtznGPdeuZ

Blood moon: https://www.icloud.com/sharedalbum/#B27GdIshaG8NgGX

La Perouse to Coogee: https://www.icloud.com/sharedalbum/#B275aVbMK4h7qo

Canberra ASPI launch: https://www.icloud.com/sharedalbum/#B27GQOeMmGj4Zcv

Edible foraging: https://www.icloud.com/sharedalbum/#B275ejO179Si0N

Sydney to Wollongong: https://www.icloud.com/sharedalbum/#B275M7GFPUasMe

Album for Dad, Father’s Day (2021): https://www.icloud.com/sharedalbum/#B2752plgjnnkUe

Vaucluse (with Cheryl, Nestor & Wendy): https://www.icloud.com/sharedalbum/#B275CmvAS4uA0Z

Bouddi National Park: https://www.icloud.com/sharedalbum/#B27GdPblXG8WdOo

Tom Thumb (the 2nd): https://www.icloud.com/sharedalbum/#B275aDWbr4CN2w

Eden to Victoria: https://www.icloud.com/sharedalbum/#B27GJDfWGArX8l

Wendy’s book launch (the 2nd): https://www.icloud.com/sharedalbum/#B27GIcgc2G7h08y

Mark & Pat Bruere visit Sydney: https://www.icloud.com/sharedalbum/#B27G0ehgLbyWyg

New Years Eve climb (2021): https://www.icloud.com/sharedalbum/#B27Ju8EH6JOZxmU

Newnes Canyoning (2022): https://www.icloud.com/sharedalbum/#B275BydzFU0GZ8

Royal National Park (2022): https://www.icloud.com/sharedalbum/#B27GlxzuqGVI5nE

Peter & Wendy: https://www.icloud.com/sharedalbum/#B27Gf693ZG52tfd

Book photo shoots: too rude…

Wendy & Peter’s mushroom trip: https://www.icloud.com/sharedalbum/#B27GrhkPxG27So8

Post-mushroom hike: https://www.icloud.com/sharedalbum/#B27GdFryYG8i3Ur

Wendy Kalymnos favourites: https://www.icloud.com/sharedalbum/#B27JqstnBJEXkH2

Wendy Frenchmans screenshots: https://www.icloud.com/sharedalbum/#B27Jr1PPdJpd7Dq

Instagram: https://www.icloud.com/sharedalbum/#B27GzFCC1Gb4tqr

Haute route: https://www.icloud.com/sharedalbum/#B27J8GySPJtWoQ1

Kim’s KKKalendar: https://www.icloud.com/sharedalbum/#B275fk75vIL0sH

Frenchmans Cap Wild: https://www.icloud.com/sharedalbum/#B27G4VTwGGoFBkz

Photoshoot with Zixin: https://www.icloud.com/sharedalbum/#B27GPCdxkGKPkM4

Wendy birthday hike (2023): https://www.icloud.com/sharedalbum/#B27GWBC59GnHpQW

Bateman’s Bay to Bawley Point: https://www.icloud.com/sharedalbum/#B27JsHvHoJ8bxWf

Stockton Sand dunes (2023): https://www.icloud.com/sharedalbum/#B27GVfZ2vGloFZV

Wendy book launch (2023): https://www.icloud.com/sharedalbum/#B27J058xyJR4IBM

Dolomites (2023): https://www.icloud.com/sharedalbum/#B0Z5kuVsbGJUzKO

Mount Arapiles: https://www.icloud.com/sharedalbum/#B275GH8Mq8Uh2X

Mount Solitary loop: https://www.icloud.com/sharedalbum/#B275nhQST2mETE

Klaus Hanz Franz Rohde Kunst: https://www.icloud.com/sharedalbum/#B27GqQrCLGiY3vb

Klaus Rohde funeral slideshow: https://www.icloud.com/sharedalbum/#B27GDZLe8GXP58K

Dad (old, B&W): https://www.icloud.com/sharedalbum/#B27GLLXGLJ5mbT2

Klaus & Ursula wedding: https://www.icloud.com/sharedalbum/#B275cLqfN7154g

Test Greece: https://www.icloud.com/sharedalbum/#B27Jq4WnLJ6JMNd

From Will Skea (Alps): https://www.icloud.com/sharedalbum/#B27JHciePJFwacG

From Will Skea (Frenchmans Cap): https://www.icloud.com/sharedalbum/#B275ZhN2v3EVq6

From Will Skea (Arapiles): https://www.icloud.com/sharedalbum/#B27JPrgBGJu3BTD

Coffs Harbour to Yamba (2): https://www.icloud.com/sharedalbum/#B27GFqhgJG9LHgT

Mark magic show (2021): https://www.icloud.com/sharedalbum/#B27G60dj6ARCvd

Wendy Christmas present (2020): https://www.icloud.com/sharedalbum/#B275FrPQ6GxvRu

AHS 25 year reunion: https://www.icloud.com/sharedalbum/#B275O3DjHUvSv

WhatsApp: https://www.icloud.com/sharedalbum/#B275tzEA5fX1nc

Armidale High School: https://www.icloud.com/sharedalbum/#B27GnbeumG4PnAF

Book photos for Mum & Dad: https://www.icloud.com/sharedalbum/#B27Gtec4XQkASe

Miscellaneous: https://www.icloud.com/sharedalbum/#B27Gq6kMgGKn7GR

Three Capes Trail (2022): https://www.icloud.com/sharedalbum/#B27G7HOIlGrDUGZ

Childhood computer programming: https://www.icloud.com/sharedalbum/#B275fu2MutDU8N

Magic with Mark in Maroubra: https://www.icloud.com/sharedalbum/#B27Gv6DhEGD9U3G

Photoshoot with Zixin (2024): https://www.icloud.com/sharedalbum/#B27GCATCnJGoRfW

Butt Crack (2021): https://www.icloud.com/sharedalbum/#B275VtHQfMv0zw

Greece photos new (edited to remove photos from wrong album): https://www.icloud.com/sharedalbum/#B27GY3uThGoBcGj

Singapore (all combined): https://www.icloud.com/sharedalbum/#B275qsTcwJKJjl

Hong Kong (transit): https://www.icloud.com/sharedalbum/#B2759v1AbS8Hve

Taiwan: https://www.icloud.com/sharedalbum/#B27GQD2D7Gw0hAp

India (combined): https://www.icloud.com/sharedalbum/#B27Gtue8VQy83g

Freycinet: https://www.icloud.com/sharedalbum/#B27G5VpecGE5Tbg

Triglav: https://www.icloud.com/sharedalbum/#B275MbK9Vy8erz

Shared with me: https://www.icloud.com/sharedalbum/#B27GGXqixzPOrm

Mount Wellington climbing: https://www.icloud.com/sharedalbum/#B27Gd59qiG8Kjy4

New Zealand combined (2004): https://www.icloud.com/sharedalbum/#B27GIZ8BIGNN5jy

New Zealand combined (2005): https://www.icloud.com/sharedalbum/#B27GcuRfIGFVIcL

Yea: https://www.icloud.com/sharedalbum/#B27GZYbYHGhFIir

Mount Pleasant: https://www.icloud.com/sharedalbum/#B275Iy2hC0JTTL

D’Aguilar: https://www.icloud.com/sharedalbum/#B27Gh7fzTGZBosS

Bali (2001): https://www.icloud.com/sharedalbum/#B27G1qNHBGOTbIr

Samba Ninjas: https://www.icloud.com/sharedalbum/#B27GG34bAzqQ0v

Armidale (misc): https://www.icloud.com/sharedalbum/#B27GSkLVwGyobbX

Emma’s party (2008): https://www.icloud.com/sharedalbum/#B275S2ms99Zyby

Goettingen (2011): https://www.icloud.com/sharedalbum/#B27JIrbT3Jsgxhd

South Coast track: https://www.icloud.com/sharedalbum/#B27G58NWBG6QyN7

Minsk (2006): https://www.icloud.com/sharedalbum/#B27G3JpSBGX1UkQ

Baden-Baden (2019): https://www.icloud.com/sharedalbum/#B27595X5HTVzJr

Berlin (combined): https://www.icloud.com/sharedalbum/#B27JqWzChJ6qizD

Switzerland (combined): https://www.icloud.com/sharedalbum/#B275zXwoYGJ6HMF

Italy highlights: https://www.icloud.com/sharedalbum/#B27G47PHQGoJium

Germany (misc): https://www.icloud.com/sharedalbum/#B275hPMfYGu5xVJ

Garmisch (2022): https://www.icloud.com/sharedalbum/#B27GFsbvlG9Xrr6

Germany (2019): https://www.icloud.com/sharedalbum/#B27G6Mn98G56Ncb

Garmisch (2006): https://www.icloud.com/sharedalbum/#B27J5lIdKGLC9KG

Baden-Baden (2005): https://www.icloud.com/sharedalbum/#B275sWRpHHQkt9

Berlin (2005): https://www.icloud.com/sharedalbum/#B27GgOQtrGjQrpH

Zugspitze (2005): https://www.icloud.com/sharedalbum/#B27G81mNdGcApGt

Amsterdam, Bristol (2006): https://www.icloud.com/sharedalbum/#B275B9SRzyBjlH

Baden-Baden (2006): https://www.icloud.com/sharedalbum/#B275eD9V79I2XR

Berlin (2006): https://www.icloud.com/sharedalbum/#B275toRf1fH8MD

Berlin, Jena (2007): https://www.icloud.com/sharedalbum/#B27GTI3fvGVgNit

Erlangen (2006): https://www.icloud.com/sharedalbum/#B27JrotZ2JpMb0i

Garmisch (2010): https://www.icloud.com/sharedalbum/#B27JPJPSiJurzNg

Germany (2010): https://www.icloud.com/sharedalbum/#B275FhYPQP650

Stuttgart (2006): https://www.icloud.com/sharedalbum/#B27GmitydGVVaZh

Changi (2019): https://www.icloud.com/sharedalbum/#B27GnmlKoG4JHpX

Japan (2007): https://www.icloud.com/sharedalbum/#B275AerZbG6FxVL

Japan (2012): https://www.icloud.com/sharedalbum/#B27GjBjobGg6PUa

Miscellaneous (including Japan 2013): https://www.icloud.com/sharedalbum/#B27GTpbybGySbE8

Currumbin & Tugin (2021): https://www.icloud.com/sharedalbum/#B275vBKZ4xH9X6

Brisbane (2021): https://www.icloud.com/sharedalbum/#B275YHsSjxQnm0

Weed in Byron (26/6/2025): https://www.icloud.com/sharedalbum/#B275Q2ydoGsQ4O5

Weed in Byron 2: https://www.icloud.com/sharedalbum/#B27GQDYhLGwsuY4

August 24, 2025

Peter Rohde You just can’t get rid of some useless ASIO sluts

August 23, 2025

Peter Rohde Stalking

August 22, 2025

Peter Rohde Why?

  1. The person dressed up as Ursula pretending to be my mother clearly isn’t and hasn’t been for a long time.
  2. When I went back to Armidale after leaving BTQ and being left unemployed she made numerous ongoing promises to provide me with assistance, both in obtaining my own accommodation and providing financial assistance.
  3. These didn’t materialise and the promises were revoked.
  4. Instead I was evicted from the family home and subject to ongoing stalking and harassment that required multiple referrals to law enforcement, both to the police and the Attorney-General, demanding cease and desist.
  5. These have been systematically ignored and up until the last message she continues to bypass these requests, approaching my personal friends to harass me and stalk me indirectly. The messages passed on are the usual fake “I’m worried about him” bullshit.
  6. Why has my family home been confiscated by security, who actively break the law by ignoring cease and desist from stalking notices made to law enforcement, forcing an unemployed civilian into ongoing homelessness since early in the year?
  7. What is the rational for my eviction and being barricaded from my own home?
  8. I continue to face a medical blockade and am unable to access essential medicines. Seroquel scripts are deliberately delayed past known script deadlines to try and destabilise me.
  9. Vyvanse scripts are denied outright as the psychiatrist does not respond. He is also known to be a state actor.
  10. It has been repeatedly indicated to me not to worry about finances because they have my back. Instead now the only cash I have is that obtained from fully drawing out a cash advance against my credit card and it will only last days. At that point I’m on the street.
  11. Is everyone here on the same page as to what the deal is? If not, who is playing you off? They clearly need to be deposed.
  12. These are violations of human rights and constitute war crimes and crimes against humanity. Whoever is behind it needs to be removed. End of story.
  13. Who else is being subject to this kind of high level manipulation?
  14. It has been repeatedly suggested that full accountability for the lives of those I care for would be provided. This has not been forthcoming. It is also a violation international law to not provide accountability for the lives of those who are known to have been threatened by the state. These are grounds for removal.
  15. Can anyone answer the question as to why I am in this situation? Who is even living in the family home? Some stooge dressed up as Ursula? It’s a poor lifestyle choice to say the least.
  16. It’s pretty obvious they’re trying to get rid of me and once they do they’ll get rid of all of you too.