Skip to the Main Content

Note:These pages make extensive use of the latest XHTML and CSS Standards. They ought to look great in any standards-compliant modern browser. Unfortunately, they will probably look horrible in older browsers, like Netscape 4.x and IE 4.x. Moreover, many posts use MathML, which is, currently only supported in Mozilla. My best suggestion (and you will thank me when surfing an ever-increasing number of sites on the web which have been crafted to use the new standards) is to upgrade to the latest version of your browser. If that's not possible, consider moving to the Standards-compliant and open-source Mozilla browser.

November 27, 2006

NIPS 2006

Posted by David Corfield

In a week’s time I shall be in Vancouver attending the NIPS 2006 conference. NIPS stands for Neural Information Processing Systems. I’m looking forward to meeting some of the people whose work I’ve been reading over the past twenty months. Later in the week I shall be speaking up in Whistler at a workshop called ‘Learning when test and training inputs have different distributions’, and hopefully fitting in some skiing.

In a way you could say all of our use of experience to make predictions encounters the problem addressed by the workshop. If we include time as as one of the input variables, then our experience or ‘training sample’ has been gathered in the past, and we hope to apply it to situations in the future. Or from our experience gathered here, we expect certain things to happen there. How is it, though, that sometimes you know time, space, or some other variable, don’t matter, whereas other times you know they do?

As concerns machine learning more generally, at last, I’m beginning to get a picture of what I think needs to be done. We have a beautiful picture when performing inference with finite-dimensional parametric models. Take a space over which there is a measure. Now consider the Banach space, BB, of all densities on XX, and consider a linear map AA from BB to R nR^n which measures features of a density, perhaps its moments. Now, we also need a map from BB to RR which measures the Kullback-Leibler distance to a fixed reference density, and finally a map from R nR^n to RR which is the indicator function of a point in R nR^n, so that it takes the value 0 at that point, and is infinite elsewhere. We then have a function from BB to RR formed by the sum of the two paths.

Now construct the Legendre-Fenchel dual situation, adding the dual maps (R n) *B *R(R^{n})^* \to B^* \to R and (R n) *R(R^{n})^* \to R. The optimised values of these dual functions are equal, given certain conditions. When the smoke has cleared you see that the density which matches the moments of the empirical distribution and which is ‘closest’ to the reference density corresponds to the member of an exponential family which matches the empirical moments. Projecting from the reference density to the manifold of distributions with the same moment as the empirical distribution is equivalent to projecting from the empirical distribution to an exponential family, these being subspaces of the space of all distributions which are always perpendicular to the subspaces of fixed moments. Many famous families of distribution - Gaussian, exponential, Binomial, Poisson - are exponential.

Last time I mentioned attempts to extend this analysis to infinite-dimensional exponential families. This is, in general, ill-posed. A finite amount of data to select a point in an infinite-dimensional space. But one can regularize the situation by requiring only approximate moment matching - the map from R nR^n to RR picking out a region around a designated value. The Legendre-Fenchel dual of this is to put a prior on the exponential family, the solution then picking the maximum a posteriori distribution.

To capture much of the work going on in machine learning we need to do all this in the space of conditional densities. If we’re looking to make predictions of output given input, there’s no need to model the input distribution. We just need a discriminative model to capture the functional relationship between inputs and outputs. And even in the situation to be discussed at the workshop, if we have the true distribution in our family, our solution will converge to it as data increases.

Now it looks like all of these ingredients - conditional densities, Legendre-Fenchel duality, infinite-dimensional exponential families - can be put together such that Gaussian process classification and Gaussian process regression are specific cases. However, Gaussian process people, as true Bayesians, aren’t happy with this, preferring the mean of the posterior to its mode. But all is not lost. Something I didn’t mention last time was Bayesian information geometry. What is the best estimator, i.e., before I see any data, what is the best I can do? Well let τ\tau be our estimator, taking any finite set of data to a probability distribution. Then define the generalization error:

E(τ)= pP(p) zP(z|p)D(p,τ(z))E(\tau) = \int_p P(p) \int_z P(z|p)D(p, \tau(z)),

taking D(p,.)D(p, .) to be the cost of misspecifying the true distribution pp. What we want is an estimator which minimises this error. Snoussi explains (p. 6) how this best estimator sends zz to

pP(p|z)\int p P(p|z).

Even if we need to work with a limited space of models for computational ease, we merely project this posterior distribution onto that space.

Now, as he goes on to explains, there is plenty of scope for variation, using different distances between densities as costs, and then further using a weighted sum of the generalization error of the reference prior and the divergence of the prior from the Jeffreys prior (p. 10).

I can’t see anything standing in the way of transporting all of this over to conditional spaces. But whether it could tell us what to do when we know our models misspecify the true distribution, and when the training data follows a different distribution to the test data, is a different matter.

Posted at November 27, 2006 4:06 PM UTC

TrackBack URL for this Entry:

10 Comments & 1 Trackback

Re: NIPS 2006

Learning seems to be hard!

As a curious blogger I am reminded by the difficulties neuroscientists have had to build neural networks models which mimics our ability to learn symbols, as explained on Developing Intelligence. But by by incorporating complexity from biologically-plausible simulations of neural computation recently a system with generalized and symbol-like processing was built.

It doesn’t seem to become overtrained, but deals well with novel data sets:

“Regier et al. note that one critical feature is shared by all these tasks: they all require that the network attend to only a single feature dimension at a time.

After this training, the prefrontal layer had developed peculiar sensitivities to the output. In particular, it had developed abstract representations of feature dimensions, such that each unit in the PFC seemed to code for an entire set of stimulus dimensions, such as “shape,” or “color.” This is the first time (to my knowledge) that such abstract, symbol-like representations have been observed to self-organize within a neural network.

Furthermore, this network also showed powerful generalization ability. If the network was provided with novel stimuli after training - i.e., stimuli that had particular conjunctions of features that had not been part of the training set - it could nonetheless deal with them correctly.

This demonstrates clearly that the network had learned sufficiently abstract rule-like things about the tasks to behave appropriately in a novel situation. Further explorations involving parts of the total network confirmed that the “whole enchilada” was necessary for this performance.”



A no-brainer hint that it can be done. :-)

Posted by: Torbj�rn Larsson on November 28, 2006 12:26 AM | Permalink | Reply to this

Re: NIPS 2006

Very interesting, thanks. I feel myself pulled between the two poles of a pretty mathematical theory of learning, which seems more relevant to the algorithms performable on a digital computer (at least, the algorithms are practical approximations), and a more neuroscientific learning theory, which is not so easily mathematisable, but which recognises that we have extremely good examples of learning happening all around us in nature.

Posted by: David Corfield on November 28, 2006 12:13 PM | Permalink | Reply to this

Re: NIPS 2006

Hm. I botched that attempt to joke. I guess it is really a question if it was a no-brainer or a brainer.

Anyway, thanks for fixing the link formats. Hopefully this comment will display a more readable name too.

Posted by: Torbjorn Larsson on November 28, 2006 5:13 AM | Permalink | Reply to this

Re: NIPS 2006

The 17 May 2006 PNAS article ‘Prefrontal cortex and flexible cognitive control: rules without symbols’ by NP Rougier et al may be most consistent with mathematical game theory through decision analysis of multiple sources of and pathways for information processing.

Perhaps biological evolution is an example of mathematical experimentation?

The neurosystem appears to be both a structural and functional tree pathway that should be graphable. The presences of saddle points is consistent with game theory.

Posted by: Doug on November 28, 2006 12:56 PM | Permalink | Reply to this

Re: NIPS 2006

Still, my sympathies lie with Gelfand.

Posted by: David Corfield on November 29, 2006 9:57 PM | Permalink | Reply to this

Re: NIPS 2006

Thank you for the reference to Gelfand.
Our perspectives may not be that different.

From my perspective, “molecular algorithms” are likely mathematical games based upon previous mathematical games [fundamental particle and force algorithms?] as in a manner of continuous transformations.

“… the behavior of physical systems is governed by minimality / maximality principles, and the optimal points …” sounds very much like saddle points which may serve as Nash Equilibria [NE].

The quest for GUT / TOE in physics may first require a mathematical unification and graph theory appears to link VOA with game theory [particularly with respect to energy economics].

Steven M LaValle [U-IL] has written ‘Planning Algorithims’ with this section:
13.5.2 Differential [Dynamic] Game Theory
1 There may be any number of players
2 The game may be zero-sum or nonzero-sum
3 The state may or may not be known
4 Nature can interfer with the game
5 Different equilibriun concepts such as NE can be defined
following a discussion of Newtonian, Lagrangian and Hamiltonian mechanics.

Posted by: Doug on December 1, 2006 1:59 AM | Permalink | Reply to this

Re: NIPS 2006

For how to harness the power of human cognition try the videos discussed here. From the first video you learn that in 2003 9 billion man hours were spent playing solitaire, whereas only 20 million man hours were necessary to construct the Panama canal. The speaker, Luis von Ahn, uses people’s enjoyment of games to get them to do something more useful, such as giving labels to images.

Posted by: David Corfield on December 1, 2006 11:05 AM | Permalink | Reply to this

Re: NIPS 2006

David wrote:

From the first video you learn that in 2003 9 billion man hours were spent playing solitaire, whereas only 20 million man hours were necessary to construct the Panama canal.

So, we just need to convince people that digging a ditch in a malaria-ridden jungle is more fun than solitaire. Then we can build 450 new Panama Canals a year - for free!

Sorry. The idea of harnessing the human desire to play games is indeed intriguing. It’s also fascinating how the quest for more realistic video games is one of the main forces pushing personal computers, and chips in general, towards higher computing power. The computational power needed for text processing, playing music, even playing videos pales in comparison to that needed for realistic real-time simulation of complex scenes!

Speaking of which: did you know that over a million people live part of their lives in a 3d virtual reality environment called Second Life? This is run on a computer network called the Grid. They’ve got their own stores in there, their own art galleries, their own blogs, their own steam-powered skates, and everything!

In October 2006, the Grid was overrun by self-replicating grey goo. So, the worst nightmare scenario in nanotechnology has now come true — but in virtual reality, thank goodness.

You can tell we’re heading for a technological singularity, when stuff like this happens and lots of people don’t even hear about it!

Posted by: John Baez on December 1, 2006 11:29 PM | Permalink | Reply to this

Re: NIPS 2006

Perhaps converting the task of digging a canal into something pleasurable is beyond us, but I’m sure there is huge scope to convert boring tasks into more enjoyable ones. Alexandre Borovik remarks:

from a mathematical point of view, solving a (elementary level) Sudoku puzzle is nothing more than solving a triangle system of Boolean equations by back substitution, something very similar to what we do after a Gauss-Jordan elimination in a system of simultaneous linear equations. But has anyone ever seen people on a train solving systems of linear equations from a newspaper?

So this concerns tasks at which computers excel, which can be configured to yield more or less pleasure for humans. Von Ahn reconfigures tasks which computers cannot do such as labelling images, by linking up pairs of people, unknown to each other, and getting them to try to score points by getting their labels to coincide. A boring task has become one which leads to people feeling they are making a deep connection with others. Apparently, some participants requested the e-mail addresses of their partners, although all they’d ever seen were strings of labels. This brings us to the virtual worlds you mentioned, and the need to find channels for emotional connection beyond the usual ones.

Hmm, no-one’s going to bring that grey goo into the Café, are they?

Posted by: David Corfield on December 3, 2006 4:01 PM | Permalink | Reply to this

Re: NIPS 2006

This is a fascinating subject. I work in the finance industry where there is a rapid flight in the opposite direction from human computation (trading floors) to computer computation (electronic exchanges)

By von Ahn’s prescription, perhaps the trading floors which have already become extinct are those which *didn’t* require human computational power to solve the problems of trade execution in their particular markets?

The few which continue to thrive (the CME/CBOT for example) perhaps have an element of human-reliance built in, intentionally or otherwise. But if someone could encode whatever that essence was into a von Ahn-style game, then they might be in trouble too!

Posted by: Allan Erskine on December 4, 2006 6:34 AM | Permalink | Reply to this
Read the post Back from NIPS 2006
Weblog: The n-Category Café
Excerpt: Background knowledge in machine learning
Tracked: December 13, 2006 10:18 PM

Post a New Comment