## February 13, 2007

### Category Theoretic Probability Theory II

#### Posted by David Corfield Encouraged by the comment of a statistician as eminent as Phil Dawid, I shall continue with what category theory has to say about probability theory. Recall that we were considering a monad on the category of measurable spaces, $Meas$. For a number of reasons, researchers prefer to work with $Pol$, the category of Polish spaces, i.e., separable metric spaces for which a complete metric exists. We have an endofunctor $P$, which sends a space, $X$, to the space of probability measures on the Borel sets of $X$. $P(X)$ is equipped with the weakest topology which makes $\tau \mapsto \int_{X}f d\tau$ continuous for any $f$, a bounded, continuous, real function on $X$. We also need a map $\mu_{X}: P(P(X)) \to P(X)$:

$\mu_X (M)(A) := \int_{P(X)} \tau(A) M(d\tau)$

Now, there are interesting developments of this idea, such as Abramsky et al.’s Nuclear and Trace Ideals in Tensored ∗-Categories, which looks to work Giry’s monad into a context even more closely resembling the category of relations, but let’s look about to see if we can make contact with some probability theory. Well, a natural kind of thing to do would be to work out the algebras for the monad. And this is just what Ernst-Erich Doberkat does in Characterizing the Eilenberg-Moore Algebras for a Monad of Stochastic Relations. What you’re looking for are measurable maps between $P(X)$ and $X$, such that the ‘fibres’ are convex and closed, and such that $\delta_{x}$, the delta distribution on $x$, is in the fibre over $x$. And there’s another condition which requires a compact subset of $P(X)$ to be sent to a compact subset of $X$.

Now, as ever, $P(X)$ will support an algebra, $\mu_{X}: P(P(X)) \to P(X)$. This is the analogue of a free group being an algebra of the group monad. But just as there are many interesting groups which are not free, we should want to find algebras of Giry’s monad which are not of the $\mu_{X}$ form. Doberkat shows that for such an algebra $X$ must be connected, and suggests this example

$h: P([0, 1]) \ni \tau \mapsto \int_{0}^{1} t \tau (dt) \in [0, 1]$.

I haven’t worked out the details, but I have the suspicion that this might be $\mu_{\{0, 1\}}$. After all, probability measures on $\{0, 1\}$ are just binomial, parameterised by $p \in [0, 1]$.

The other example he gives has $X$ a bounded, closed and convex subset of $\mathbb{R}^n$, and probability measures being sent to their barycentre. It wouldn’t surprise me if this were the paradigmatic example of an algebra for the Giry monad. It subsumes the example above.

Finding centres of gravity also occurs in Bayesian information geometry. As shown on page 4 of Hichem Snoussi and Ali Mohammad-Djafari’s Information geometry and prior selection, given a prior distribution over probability distributions, a best estimator for the generating distribution after data has been received is expressible as the barycentre of the posterior distribution. But as we’re working in $P(X)$, this barycentre only makes sense with respect to a geometry on it. Now, the ‘sensible’ geometries form a 1-dimensional family, although only for one member of this family, corresponding to the Kullback-Liebler divergence, is the geometry of $P(X)$ convex, which I guess is what we need for an algebra. Perhaps this could be fixed, otherwise it’s yet another point for the KL-divergence.

Update: Doberkat has a longer article on Eilenberg-Moore algebras of the Giry monad as item 5 here. (Unfortunately, the monograph ‘Stochastic Relations: Foundations for Markov Transition Systems’ doesn’t appear to be available.) There are two monads being treated here, one which sends a Polish space to the space of all probability measures, the other to the space of all subprobability measures. The extra structure relating to these monads, is that of a (positive) convex structure. In the case of a convex structure, this intuitively captures the idea that a weighted sum of points in the space has barycentre within the space.

Doberkat’s work relates to the category of Polish spaces with continuous maps. He notes that it would be interesting to develop the theory for the general case of Borel measurable maps.

Posted at February 13, 2007 6:37 PM UTC

TrackBack URL for this Entry:   https://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/1163

### Re: Category Theoretic Probability Theory II

Nice.. I wish I would have read this before.. I basically created my own formulation of these types of ‘stochastic categories’ (they are really determinstic!) as I was creating ways to analytically solved infinite-dimensional problems ‘exactly’ and without numerical methods except at the final stage of evaluating.. had to use loads of measure theory, capacities, lifting maps, harmonic and complex analysis, the Weil topology, analytical number theory and some other stuff.. I absolutely hate reading this stuff abstractly if no concrete examples are given.. give me some concrete generating functions and something I can *compute*! But, now that I have actual concrete realizations of things I can go back and find out what abstract concept they implement, sort of a backwards approach but it works.

Posted by: Stephen Crowley on February 13, 2007 9:28 PM | Permalink | Reply to this

### Re: Category Theoretic Probability Theory II

Something intriguing about this monad perspective is that it fits neatly with a Bayesian outlook. Bayesians like to think in terms of distibutions over distributions and form averages. For example, ask me about a coin you’re about to toss, and, so long as I take all exchangeable sequences (permutations) to be equally likely, I can represent my belief in a distribution over $p$, the binomial parameter. Ask me for my probability that heads will be tossed and I’ll calculate the expectation of that distribution.

I said

Now, the ‘sensible’ geometries form a 1-dimensional family, although only for one member of this family, corresponding to the Kullback-Liebler divergence, is the geometry of $P(X)$ convex, which I guess is what we need for an algebra. Perhaps this could be fixed, otherwise it’s yet another point for the KL-divergence.

One way out is to observe that for $\delta \in [0,1]$, where $\delta$ parameterises the family of geometries, the space of subprobability measures is $\delta$-convex. Hence the map taking a distribution over subprobability measures to its barycentre with respect to one of these geometries constitutes an Eilenberg-Moore algebra for the subprobability monad.

Posted by: David Corfield on February 14, 2007 12:56 PM | Permalink | Reply to this

### Re: Category Theoretic Probability Theory II

David wrote:

For example, ask me about a coin you’re about to toss, and, so long as I take all exchangeable sequences (permutations) to be equally likely, I can represent my belief in a distribution over p, the binomial parameter. Ask me for my probability that heads will be tossed and I’ll calculate the expectation of that distribution.

Indeed, if you do not begin with the apriori assumption that the coin is unbiased but only that your state of ignorance (since you have yet to flip any coins) implies that the conditional probability is 1/2, but Laplace’s rule of succession implies that if your first flip turns up heads, then you are less likely to receive another heads in the next flip, but you still can and usually do, but if you do, then you are even LESS likely to receive another, but it is still possible, in fact there is a non-zero chance that you could flip a coin and get heads each time but Laplace’s rule ensures that that usually never happens and is a law in some sense. Frequentist probabability is totally bogus based on such a revelation because it ignores the “time-dynamics of probability”. Good gamblers know this well. Now, extending these concepts to continuous-time is quite a challenge because the problem immediately becomes infinite dimensional, as in, how do you separate ‘instants’ and project expectations for the future? It can be done but it is very unintuitive. Naely, analytical number theory and analytic continuation comes in handy by using the fact that the gamma function is the real/complex generalization of factorials(combinatorics)! Now, build a sigma-algebra or delta-ring out of all this..

Posted by: Stephen Crowley on February 14, 2007 5:59 PM | Permalink | Reply to this

### Re: Category Theoretic Probability Theory II

Laplace’s rule of succession implies that if your first flip turns up heads, then you are less likely to receive another heads in the next flip

Isn’t this backwards? If your first flip is heads, then this is evidence that the coin is biased. If you believe that the coin flips are independent (the coin has no memory), then the probability of subsequent heads increases. In an extreme case, after 1000 successive coin flips, I have extremely high confidence that the next flip will be heads (it’s a trick coin).

Certainly this is how Laplace applied his rule of succession to the Sunrise Problem. (After 2 million successive sunrises, the probability of no sunrise tomorrow has shrunk to less than 1 in 2 million.)

Posted by: Toby Bartels on February 14, 2007 7:20 PM | Permalink | Reply to this

### Re: Category Theoretic Probability Theory II

Toby wrote:

Isn’t this backwards? If your first flip is heads, then this is evidence that the coin is biased. If you believe that the coin flips are independent (the coin has no memory), then the probability of subsequent heads increases. In an extreme case, after 1000 successive coin flips, I have extremely high confidence that the next flip will be heads (it’s a trick coin).

I might have it exactly backwards. I’m thinking of the conditional probability of successive coinflips given apriori no knowledge of the biasedness of the coin and starting off with a ‘prior’ of 1/2. It also depends on what you call ‘memory’, as I know of no system, even in theory, that does not have memory. I realize it is popular to call Markov processes ‘memoryless’ but in fact they have memory, otherwise an infinite set of flips would not have an “average” outcome of 1/2..and in fact they don’t, as this mathematical coin only has 2 states, 1 and 0 (or whatever symbols you want to use in place of 1 and 0.. -1 and +1 are natural choices). So, what good is a theory that gives a single probability for “the outcome” of a coinflip to be 1/2 when in fact that cannot be an outcome.

The real answer is that before event 1(1st event) say that E(1)=1/2 and E(1)=1/2.

event 1:
If you get observe 1, then the conditional probability becomes E(0)=1/3 and E(1)=2/3.

event 2:
If you then observe another 1, then the conditional probability becomes E(0)=1/4 and E(1)=3/4, but if you were to have observed 0 then the probability come back to E(0)=2/4=1/2 and E(1)=2/4=1/2

Obviously, this is a simplified model and applies only to coins in discrete time but it can be generalized to complex numbers and dependent coin-flips and continuous-time, some sort of self-avoiding “quantum” random walk.

Certainly this is how Laplace applied his rule of succession to the Sunrise Problem. (After 2 million successive sunrises, the probability of no sunrise tomorrow has shrunk to less than 1 in 2 million.)

Right, but it should track the probability as closely as possible should there suddenly be deviations from that sequence. It has to observe and ‘reason’ in some sense.

The point is, in the continuous dependent coin-flipping thought-experiment the biasedness is time-varying and must be estimated ‘optimally’ and its infinite-dimensional, think of an infinitely expanding tree structure where you confirm or falsify paths down the tree as observations arrive, this is also even “infinitely infinitely dimensional” if there is error in the observation and you cannot even be completely certain of things you ‘observed’.

In the observation with error (think about it, what other useful observations are there besides ones with error), you cannot completely exclude paths that did not occur because they all occured to some extent and are simply weighted rather than removed.

Posted by: Stephen Crowley on February 14, 2007 7:59 PM | Permalink | Reply to this

### Re: Category Theoretic Probability Theory II

Actually, those probabilties are inverse, but swap “E(0)’ and “E(1)” where it is wrong, then it is right. The point is that the ratio of heads to tails is a mean-reverting process around a mean value of 1/2 for an unbiased perfect coin. This number, 1/2 is somehow “built into” the law of chance and this is what I mean by saying that coinflips have memory, because it somehow remembers that it should be “hovering around” 1/2 in a mean-reverting fashion, and its impossible to actually attain a probability of 0 or 100%, but it can come vanishingly close to either in theory,

Posted by: Stephen Crowley on February 14, 2007 11:07 PM | Permalink | Reply to this

### Re: Category Theoretic Probability Theory II

If you get observe 1, then the conditional probability becomes E(0)=1/3 and E(1)=2/3.

Correct! I mean that this is what Laplace’s Rule of Succession gives.

(Of course, if you have some a priori reason to believe the coin fair —or biased for that matter—, then you’ll have different expectations.)

Posted by: Toby Bartels on February 14, 2007 11:43 PM | Permalink | Reply to this

### Re: Category Theoretic Probability Theory II

Assume that you are going to put all your faith in exchangeability, e.g., that whatever you see will never cause you to think the flipping mechanism has changed, even if you see 1000 heads then 1000 tails. In this case de Finetti showed that your degree of belief is representable as a probability distribution over Binomial distributions. Think of a function, $f(p)$, on $[0, 1]$ which integrates to 1, representing how you think the coin may be biased. E.g., if I’m dealing with a crook, I might have a spread out $f$. If I’ve examined the coin and flipping mechanism, I might have $f$ peaked around 0.5.

Then as data comes in my $f$ is updated, via Bayes’ theorem, eventually leading to a highly peaked distribution about the relative frequency of heads.

When asked for my odds for the next toss I calculate $\int_{0}^{1} p f(p) d p$, with my latest $f(p)$.

Now I certainly don’t want to get into a philosophy of probability debate. What I’m interested in is that one large difference in statistical learning theory is that one side (frequentists) tends to find modes and the other side (Bayesians) means. It’s curious that the approach via monads has a mechanism for taking the mean of a distribution of distributions, and thus appears to favour the Bayesian side.

Posted by: David Corfield on February 14, 2007 11:09 PM | Permalink | Reply to this

### Re: Category Theoretic Probability Theory II

Assume that you are going to put all your faith in exchangeability, e.g., that whatever you see will never cause you to think the flipping mechanism has changed, even if you see 1000 heads then 1000 tails. In this case de Finetti showed that your degree of belief is representable as a probability distribution over Binomial distributions. Think of a function, f(p), on [0,1] which integrates to 1, representing how you think the coin may be biased. E.g., if I’m dealing with a crook, I might have a spread out f. If I’ve examined the coin and flipping mechanism, I might have f peaked around 0.5.

Yes, it’s all based on exchangeable sequences. For that to be valid, you have to have some reason for assuming dependence between observations. The independence assumption seems only valid for “low frequency” or “static” applications, which I find boring. de Finetti’s classical theorem is nice for sure, but unless I’m missing something it only applies to real numbers and hence is not algebraicaly complete. The Quantum de Finetti theorem which is used in quantum measurement theory makes a lot more sense to me because of the nice properties of complex numbers.

Dealing with only real numbers leads to all sorts of weirdness if you try to take time into account, and the weirdness goes away in the complex/time domain.

Now I certainly don’t want to get into a philosophy of probability debate. What I’m interested in is that one large difference in statistical learning theory is that one side (frequentists) tends to find modes and the other side (Bayesians) means. It’s curious that the approach via monads has a mechanism for taking the mean of a distribution of distributions, and thus appears to favour the Bayesian side.

I guess I live in a camp of my own then, because I rather deal with an infinite set of complex modes, time-evolution operators on these modes, and rules for updating modes based on likelihoods and indeed the system boils down to a Hamiltonian due to the conservation of probability. In practice and out of necessity only a finite number of these modes can be calculated, but I can take to any degree of accuracy that I desire.

Posted by: Stephen Crowley on February 15, 2007 12:28 AM | Permalink | Reply to this

### Re: Category Theoretic Probability Theory II

This monadic perspective is exactly how some people model probability distributions in the programming language Haskell. There’s at least one Haskell library that does this. Of course in Haskell we’re dealing with finite spaces and so the definition of μ there involves only a finite summation. But in essence it’s the same thing.

Posted by: Dan Piponi on February 17, 2007 1:35 AM | Permalink | Reply to this

### Re: Category Theoretic Probability Theory II

If you’ll permit a question from a humble cognitive scientist…

In the article “The converse of a stochastic relation” Doberkat seems to be reinventing Bayesian inference in the category theoretic setting. Indeed, the first equation in that paper is (i think) just Bayes’ rule. Yet he never mentions Bayesian learning… so am I missing something here?

If not, to what extent is the Giry monad the “right” structure to describe the logic of Bayesian induction? Should we Bayesians be describing our models in the internal language of one of the categories associated with the Giry monad? If so, which and what does that language it look like?

Posted by: Noah Goodman on February 18, 2007 2:03 PM | Permalink | Reply to this

### Re: Category Theoretic Probability Theory II

Yes you’re right that the first equation is Bayes’ theorem. (Paper 15 here.) However, Bayes’ theorem is not intrinsically Bayesian. It’s just a truth about probabilities.

On the other hand, the Theorem is used often by Bayesians as a means of updating their probability assignments (or ‘degrees of belief’) in situations where it makes no sense to a frequentist. For example, one uses $P(h|e) = P(e|h).P(h)/P(e)$ to find the support given to hypothesis $h$ by evidence $e$. For many hypotheses $P(h)$ will make no sense for the frequentist. Further, but there are provisos, if $e$ is all you learn, the new $P(h)$ will be the old $P(h|e)$.

Doberkat doesn’t discuss what kinds of element are to be found in his Polish spaces so may be a frequentist for all I know. My point about the monad picture fitting nicely with Bayesianism is that the map $\mu_{X}: P P(X) \to P(X)$ is something the Bayesian can understand directly, while for the frequentist I can’t see that it could be more than a tool for composing conditional probabilities.

Posted by: David Corfield on February 18, 2007 6:46 PM | Permalink | Reply to this
Read the post Category Theory in Machine Learning
Weblog: The n-Category Café
Excerpt: Does category theory have a future in machine learning?
Tracked: September 5, 2007 4:03 PM
Weblog: The n-Category Café
Excerpt: My colleague here in Canterbury Jon Williamson is part of an international research group, progicnet, whose aim is to find a good integration of probability theory and first-order logic. For one reason or another, some technical projects get counted...
Tracked: September 18, 2007 11:38 AM

### Re: Category Theoretic Probability Theory II

Roland Friedrich and I were thinking on Giry-Lawvere monad and the variants thereof in few meetings in last two years. We never thought enough through to finish a research but we realized that there are many interesting twins of the Giry monad, for example when one considers projection valued measures instead of scalar valued and then one sees some connections to the subject of coherent states, the subject I was earlier working on in connection to quantum groups. We checked few new analytic details which are more nontrivial when the measure is valued in projective operators rather than scalars, but I do not have any complete account of this. We discussed many other measure like concepts which could benefit from this approach.

For newcomers all that business of the Giry monad is an elaborate version of thinking via characteristic functions like in the primitive case of power set monad, and also delta functions, thus the idea has to do with classifying objects on one side and with measure theoretic concepts on analysis side, and also of reproducing kernels in coherent states and evolution story business. Giry was a student of Lawvere and she wrote an elaboration of an earlier preprint of Lawvere and published it as

Giry, Michèle
A categorical approach to probability theory. Categorical aspects of topology and analysis (Ottawa, Ont., 1980),
pp. 68–85,
Lecture Notes in Math., 915, Springer.
MR83k:60007

There are some analytic incorrectnesses there I was told. Prof. Voevodsky gave some talk at IHES where he had approached some biology-motivated stochastics via a refined Giry approach, I was told by Roland who got then interested in. A priori I think the approach is not giving something deep, but it is an interesting source of analogies. There are many papers after Giry, mainly in the computer science literature. Roland drew my attention to Doberkat’s paper as one of very few which are actually good among the mentioned papers. Though good part of the paper is only a survey.

Posted by: Zoran Skoda on September 19, 2007 6:19 PM | Permalink | Reply to this

### Re: Category Theoretic Probability Theory II

Does anyone know more about Voevodsky’s approach? It was announced here:

…a categorical study of probability theory where “categorical” is understood in the sense of category theory. Originally, I developed this approach to probability to get a better understanding of the constructions which I had to deal with in population genetics. Later it evolved into something which seems to be also interesting from a purely mathematical point of view. On the elementary level it gives a category which is useful for the work with probabilistic constructions involving complicated combinations of stochastic processes of different types. On a more advanced level, applying in this context the old idea of a functor as a generalized object one gets a better view of the relationship between probability and the theory of (pre-)ordered topological vector spaces.

Posted by: David Corfield on March 25, 2009 11:05 AM | Permalink | Reply to this

### Re: Category Theoretic Probability Theory II

Those reading this topic may be interested in: N. N. Cencov. Statistical Decisions Rules and Optimal Inference. Translations of Mathematical Monographs. Volume 53. American Mathematical Society. 1982.

Cencov’s “category of statistical decisions” coincides with the category of Giry and Lawvere. I believe that Cencov discovered it independently although many years after Lawvere.

Posted by: Ralph Wojtowicz on December 23, 2009 4:13 AM | Permalink | Reply to this

### Re: Category Theoretic Probability Theory II

Does anyone know more about the connexion between Bayes learning rule and the multiplication of the Giry monad? I would be curious to know.

Posted by: Vincent Danos on September 9, 2010 6:33 PM | Permalink | Reply to this

### Re: Category Theoretic Probability Theory II

One thing that’s interesting about conditionalizing is that it is an example of projection from the initial distribution onto a subspace of distributions satisfying a constraint. I discuss a paper by Harremoes on this in this post. Projection like this happens very frequently in inference such as when matching a moment of an empirical sample. Inference takes the form of passing down a geodesic in the space of distributions.

I wish I had a better sense of how category theory and the information geometry picture could be made to work together. An arrow between two objects $A$ and $B$ in the Kleisi category of the Giry monad corresponds to a conditional probability $P(b_i|a_j)$. Could one see the category as enriched over the category of some differential geometric spaces? Perhaps Cencov’s book might help.

Anyhow, conditionalizing on $A = a$, is a little like you’ve taken the Kleisli category of Giry’s monad, and then turned $A$ into a single element set, such that an old distribution on $B$, $f: 1 \to B$, becomes $f': 1 \to B$, $P(b_i|a)$.

Posted by: David Corfield on September 10, 2010 6:11 PM | Permalink | Reply to this

### Re: Category Theoretic Probability Theory II

Hi David

Thanks for your answer. I had a look a the paper. Generally I am interested in how the Bayesian language seems to be a naturally higher-order probability language, and in understanding all these confusing statements about “outliers”, “confidence intervals”, “moderately strong evidence for a significant level of something” …

Following on your earlier post I thought the Giry monad could be a nice way to clarify the grammar of beliefs and learning. And then, for instance, to connect this with the various notions of approximations of Markov chains Panangaden and others (including myself) have developed over the years.

Anyway, here is my rephrasing of Bayes learning.

Say we have the following ingredients:
- X a finite measurable space with discrete sigma-algebra (say)
- p ∈ GX a hidden probability on X
- P = ⊕ f_i p_i ∈ GGX a belief represented as a probability on GX
- s an observation on multisets over X

We can equally write P(p_i)=f_i (or more rigorously P({p_i})=f_i).

By multiplication, we have μP(A) = \sum_i f_i p_i(A) - a sort of majority vote where f_i is the weight accorded to p_i in the prediction. The idea for learning is that we sample repeatedly from the hidden p, which gives us the observation s above, and we modify the weights in the majority vote of P in order to get closer to the real p.

Specifically, one modifies f_i (note that the support of P ′ remains the same as that of P) as follows:

f_i′/f_i = p_i(s)/μP(s) (1)

I am sligthly abusing notation here since p and μP are not really defined on multisets, but we can extend them using the “independence map” GX → G(multiset(X)); defined as p(s)=\prod x\in X p(x)^(s(x)) where s(x) is the number of occurrences of x in s.

P is called the prior, P’ the posterior. If we rewrite the above as a sort of Bayes formula

P(p_i | s) · μP(s) = P(s | p_i) · P(pi | ∅) (1’)

we see that up to the normalising factor μP(s), we are saying that P(p_i) has to be “nudged” by how likely s -which is what we get- is, assuming p_i.

Write s · P for P’. I like that notation since it reminds me that the observable acts on GGX, ie gives me a deterministic update instruction.

This defines a Markov chain (MC) on GGX defined as

K(P, s · P ) = p(s) (2)

that is to say we are ‘walking’ stochastically on GGX, so the kernel K ∈ [GGX;GGGX] might have a steady state in GGGX - but in fact the interesting limit is a “point-mass” in GGX. That is to say s · P → δp as |s| → ∞, where δ: GX → GGX is the unit, p the real probability (which we suppose here is in the support of P), |s| the size of s. To prove this, one can see readily that:

log(s · f_i/s · f_j) ∼ |s| × KL(p, p_j) ≥ 0 (3)

assuming p=p_i is the real probability, where KL is the relative entropy of p and p_j (aka the Kullback-Leibler divergence).

I imagine that the above justifies the update rule (1) as a good one, as it does eventually find the solution; it also shows that KL is an natural tool to assess convergence of this update rule.

So my question is, is that kind of view elaborated in the compsci/maths literature?

Posted by: Vincent Danos on September 15, 2010 1:13 PM | Permalink | Reply to this

### Re: Category Theoretic Probability Theory II

Even thinking through the set up of your $\sum f_i p_i$ is not so obvious. If that sum is over $i = 1,...,n$, I guess we could have the composition of the arrows

$1 \stackrel{f}{\to} [n] \stackrel{p_{(-)}}{\to} X,$

giving a single distribution over $X$.

If the $p_i$ were parameterised in another way, say continuously, we would need that parameter space instead of $[n]$.

Posted by: David Corfield on September 16, 2010 10:58 AM | Permalink | Reply to this

### Re: Category Theoretic Probability Theory II

I’ve long been fond of the thought that one should be in a position to know how one will respond to data even before one knows what the data will be. In an idealised sense, one may be surprised by the world, but one shouldn’t be surprised by one’s response to the world.

Let me give a relevant extract from a paper of mine, ‘Varieties of Justification in Machine Learning’, Minds and Machines 20(2), 291-301.

Imagine my task is to find the best estimator taking values in a class of probability distributions over a space $Z$, based on a sample from $Z$. Best is meant here in the sense of a least ‘distance’ from the true distribution to my estimate, the distance often being the Kullback-Leibler divergence. Now, there’s no reason I shouldn’t be able to describe my estimator before I see any data, in other words my state of belief should tell me how I’ll act whatever data I meet with. So let $\tau$ be my estimator, taking any finite set of data to a probability distribution. Then define the generalisation error for this estimator as follows:

$E(\tau) = \int_p P(p) \int_{\mathbf{z}} P(\mathbf{z}|p)D(p,\tau(\mathbf{z})),$

where $D(p, .)$ is the cost of misspecifying the true distribution $p$. As you can see, this quantity is what you expect your loss to be, based on your belief about the true distribution $p$. What we seek is an estimator which minimises this expected error. Snoussi and Mohammead-Djafari explain (p. 312) how, in the case where $D$ is the Kullback-Leibler divergence, the best estimator is the function which sends $\mathbf{z}$ to $\int p P(p|\mathbf{z})$, the mean of the posterior distribution. This shows the coherence of the Bayesian approach. The idea then is, if you do not behave in a Bayesian way here, either not all relevant beliefs were encoded in your prior, or you received extraneous information, or you are behaving inconsistently with respect to the cost of misspecification.

Snoussi and Mohammad-Djafari (2002). ‘Information geometry and Prior Selection.’ In C. J. Williams (Ed.) Proceedings of 22nd International workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering (MaxEnt’02), pp. 307-327, Moscow, Idaho, USA, August 2002.

In the language of your comment, I have a set of probabilities, $p_i$, among which I’m looking to find the true one. I give them a prior weighting $f_i$. Now I want a rule, $\tau$, so that when fed a sample from the set $X$, I will return an updated best estimate of $p$. The best rule is the one for which

$\sum_i f_i \sum_s P(s|p_i)KL(p_i, \tau(s))$

is minimised. This best rule is

$\tau(s) = \sum_i p_i f^'_i,$

where $f^'_i$ is the posterior weighting you described.

Posted by: David Corfield on September 16, 2010 2:56 PM | Permalink | Reply to this

### Rutherford’s tissue; Re: Category Theoretic Probability Theory II

David Corfield wrote: “one should be in a position to know how one will respond to data even before one knows what the data will be. In an idealised sense, one may be surprised by the world, but one shouldn’t be surprised by one’s response to the world…”

Since I have merely average common sense, and am a theoretician, I asked my wife, who has lots of common sense and is a good experimental Physicist.

To paraphrase her comment, of course one can be surprised by the world. She mentions Rutherford’s famous statement was that the result of the experiment of shooting alpha particles at gold foil, was “like firing a cannon ball at a piece of tissue paper and having it come back and hit you.”

The question is, was he surprised that he was surprised? He would hardly have thought beforehand “the gold foil will let most alpha particles through, because the gold atoms aren’t too close together, but some might be deflected by a small angle, and that will tell me something. And what if that small angle is roughly 180 degrees?”

This metaphysical stance of your neglects, my wife and I suspect, what Sir Arthur C. Clarke called “failure of imagination.”

She makes the toy problem: “I am going to measure a voltage. If it is roughly 1 volt, I will think and feel A. If it is roughly 2 volts, I will think and feel B. If it is roughly 3 volts, I will think and feel C.” But then she does the measurement, and it is 1,234,567 volts. Isn’t it rational for her to be surprised when her reaction is neither A, B, nor C?

Similarly, Sir Arthur C. Clarke describes “failure of nerve.” You are surprised by nature. You come up with an hypothesis outside the paradigm. Then you can’t believe your own idea, and reject it.

In short, I don’t think this is usefully cast in terms of probability distributions. I think it’s an epistemic situation.

Posted by: Jonathan Vos Post on September 18, 2010 4:21 AM | Permalink | Reply to this

### Re: Rutherford’s tissue; Re: Category Theoretic Probability Theory II

I wasn’t talking about surprise about one’s reaction of surprise. Of course from one’s own perspective it’s surprising that one’s surprised. Pr(surprise) = Pr(surprising event). But given that the surprising event occurred, the surprise is inevitable - Pr(surprise|surprising event) = 1.

I would be very surprised if a meteorite crashed through my roof, but I wouldn’t be surprised, given that such an event occurred, to find myself running away down the street.

Posted by: David Corfield on October 7, 2010 2:45 PM | Permalink | Reply to this

### Another Look at the Jackknife; Re: Rutherford’s tissue; Re: Category Theoretic Probability Theory II

I need to read and ponder your initial discussion again. My wife and my son don’t quite agree with me, and none of us seem to agree with you, yet you are an expert from whom I have learned a great deal.

I had a related conversation with Professor Emeritus Gary Lorden at a Caltech Math Department tea yesterday afternoon, but that’s slightly off topic. He’s also a noted expert in Statistics. We were also talking about Bootstrapping with a T.A. for one of Caltech’s Stat courses. Bradley Efron was one of this department’s brightest lights since 1970.

Posted by: jonathan vos post on October 7, 2010 7:17 PM | Permalink | Reply to this

### Burning Down the House; Re: Another Look at the Jackknife; Re: Rutherford’s tissue; Re: Category Theoretic Probability Theory II

I think that one of my son’s comments (neither he nor I are home right now, so this is paraphrase):

When civilians who’ve just rescued a stranger from a burning house, or soldiers who survived throwing themselves on a live grenade, are interviewed, they are almost always claiming to be surprised by what they did.

“If you’d asked me beforehand, I would never have thought of myself as a hero. I still can’t explain WHY I rushed through those flames.”

“Of course I hope that anyone in my platoon would have done the same for me, but I honestly wasn’t thinking when I did it.”

In a sense, your analysis is good math, and backed by a consistent metaphysical stance, but in an epistemic sense perhaps conflates what you know or believe with what you know that you know, know that you believe, or believe what you know.

This also came up in the Blue-Eyes Islanders puzzle.

Posted by: jonathan vos post on October 7, 2010 7:24 PM | Permalink | Reply to this

### Re: Another Look at the Jackknife; Re: Rutherford’s tissue; Re: Category Theoretic Probability Theory II

I agree my account is highly idealised. It’s not just that with our cognitive limitations we don’t have the power to work out in advance our responses to all eventualities. I don’t even think our responses on the fly are worked out in any kind of well-articulated way, as in your grenade case.

I was largely thinking of situations where the range of possible eventualities is restricted sharply – e.g., sampling from a population. Then it’s reasonable to have a policy of response in place. That the idealisation acts as a regulatory ideal more widely, however, may been seen in our preference for a government which has a broad range of contingency plans in place (to cope with terrorist acts, oil rig explosions, etc.).

Posted by: David Corfield on October 8, 2010 10:16 AM | Permalink | Reply to this

### Keep Watching the Skies; Re: Another Look at the Jackknife; Re: Rutherford’s tissue; Re: Category Theoretic Probability Theory II

I agree. Much of my two decades in the Space Program and Aerospace industry was precisely on Contingency Planning. I would search deep in probability space of very complex systems (Space Shuttle, Voyager Spacecraft, Galileo spacecraft) or low probability but high value anomalies that might occur. Then build an interdisciplinary team to come up with ways to detect these anomalies quickly, and retrofit the system with detection and execution ability to autonomously have the response, not merely available, but operational before Ground Systems even knew what happened. As to how well this maps into Public Policy space can be debated, but I personally and professionally think it is more than mere analogy. For example:
Using a Complex Systems Approach to Study Educational Policy.

S. Maroulis, R. Guimera, H. Petry, M. J. Stringer, L. M. Gomez, L. A. N. Amaral, U. Wilensky. Complex Systems View of Educational Policy Research. Science, 2010; 330 (6000): 38 DOI: 10.1126/science.1195153

Posted by: Jonathan Vos Post on October 11, 2010 2:24 AM | Permalink | Reply to this
Read the post What Is This Category Enriched In?
Weblog: The n-Category Café
Excerpt: Searching for enrichment in a stochastic category
Tracked: September 14, 2010 11:29 AM

### Re: Category Theoretic Probability Theory II

Does anyone have an electronic copy of F. W. Lawvere’s “The Category of Probabilistic Mappings,” 1962? I would appreciate getting a copy of it.
Thank you,
Joe Peri

Posted by: Joe Peri on May 3, 2012 6:13 PM | Permalink | Reply to this

### Re: Category Theoretic Probability Theory II

You could try to request it from here.

Posted by: David Roberts on May 4, 2012 1:56 AM | Permalink | Reply to this

### Re: Category Theoretic Probability Theory II

Jeremy Gibbons announced on the categories list that he found a copy.

Posted by: Bas Spitters on May 6, 2012 3:03 PM | Permalink | Reply to this

### Re: Category Theoretic Probability Theory II

Thanks to David Roberts, I see there is a new article out – A Categorical Foundation for Bayesian Probability by Jared Culbertson and Kirk Sturtz.

We covered quite a lot of ground over the years (here, here and here, the Progic series I, II, III, IV, V, convex spaces and some machine learning posts).

From a brief glance, something new in the paper is the treatment of Bayesian updating as an inference map in section 4.

Posted by: David Corfield on May 9, 2012 1:33 PM | Permalink | Reply to this

Post a New Comment