Back from NIPS 2006
Posted by David Corfield
Aside from thinking up the second n-Café millennium prize, there wasn’t much time for higher-dimensional algebra last week while I attended the NIPS 2006 machine learning conference/workshop in Vancouver/Whistler. To find a role even for ordinary category theory in machine learning presents an interesting challenge. Most of what I’ve seen relates to neural networks, e.g., here.
Now, it looked as though neural networks had had their day, superseded by kernel methods, both support vector machines (SVMs) and Gaussian processes (GPs), as used by most of the people I work with in Tübingen. But there’s been a recent revival of interest in neural networks with the discovery of a new way to train deep networks, as Yoshua Bengio described.
Up in Whistler, days were like sandwiches, workshops providing the bread to the skiing filler. On Friday, I attended the Causality and Feature Selection workshop. Bayesian networks featured prominently here, which I happen to know quite a lot about through my friend Jon Williamson.
On Saturday, I attended the related Learning when test and training inputs have different distributions workshop. The topic of this workshop is one where background knowledge is indispensable. Where the test and training distributions coincide, given enough training data I may expect my algorithm to converge to its optimal performance. I can’t rely on this if my algorithm has to learn from data unrepresentative of what it will eventually be used for. And yet such inference is very much a part of human reasoning. I may never have stepped foot in Africa before, nor ever been to a rainforest, nor ever seen a creature with ears that shape, with that colour fur, but if I see something suckling its young I can feel confident about a huge amount of facts relating to its anatomy, and physiology.
So how to encode sufficient background knowledge to allow transferral of inference to a broader domain. To do so in too domain-specific a way risks the autonomy prized in machine learning, which plays to a computer’s strengths, rapid calculation and string-searching. An illustration I like to give here is when searching for the word kangaroo in a long text, the human is very interested to learn that one of the chapter titles is ‘Australian Marsupials’, sensing that this will reduce the search space, whereas a computer can happily ignore this and just go string-searching. So, researchers are looking to feed in a less specific form of background knowledge, perhaps ideally just a representative scheme which could allow the learning of this background.
We were encouraged to give talks which would provoke discussion rather than just turn the workshop into a mini-conference. My questions to the audience were whether they felt satisfied with the ways found so far to encode background knowledge, and further whether they felt that the frequentists (statistical learning theorists, Vapnikians, SVM users) had been more imaginative to date than the Bayesians (GP users) in doing so. Frequentists present naturally answered ‘yes’ to the latter question. If true, this might be due to the hampering effect of forcing background knowledge to be encoded as a probability distribution, or it might just be through lack of effort. Here are some examples of background knowledge:
Clustering: we can expect that the input variables of data points bearing the same label will lie close to one another. Imagine a data set consisting of two interspersed crescent shapes. One point in crescent 1 is labelled red, one in crescent 2 is labelled blue. Intuitively you think to colour each crescent the same as its labelled member. But if you relied only on the labelled data, commonly used algorithms would just drive a classifying line through the perpendicular bisector of the line joining them. To use the unlabelled data you must feed in an assumption that points with the same label will tend to cluster. Vapnik’s transductive inference managed to achieve this. Lawrence and Jordan recently added the notion of the null category to the GP set-up to the same end for the Bayesians.
Invariance under small changes: we can expect small translations, rotations, and thinning or thickening of lines of images of numbers to represent the same number. Frequentists got here first by finding support vectors (most informative inputs), then forming virtual support vectors by making slight modifications to the originals. This was copied in the GP setting.
We heard of further methods devised by frequentists at the workshop:
Robustness under adversarial deletion: we may expect that informative inputs are robust enough that even if an adversary does their best to put us off with a given number of feature deletions, we’ll still be able to classify. It would be a bad symbol system if the removal of a single pixel from the image of a letter made it look like another. On the other hand, not many auditory features of ‘nineteen’ need be removed before it sounds like ‘ninety’, and letters are notoriously hard to hear over the telephone, such as ‘F’ and ‘S’.
Structural correspondence: sentences taken from a medical journal may be very different from sentences taken from the Wall Street Journal, but still we may expect that words close to certain ‘pivots’ will be the same parts of speech. So a word appearing after ‘a’ and before ‘required by’ will most likely be a noun in both financial and medical contexts. If we have tagged a large corpus of financial text, then an algorithm which has learned to classify these tags well on the basis of the selected pivots, should continue to work well on untagged medical texts. Shai Ben-David talked about generalization guarantees relating to this Structural Correspondence Learning. Keen as ever to get category theory in on the scene, I note that this work resonated with the discussion we had on analogies as spans.
Re: Back from NIPS 2006
Interesting. In the abstract behind that link it says, among other things:
Is there a quick way to explain the main idea behind this, not requiring to read the entire document?