Marginally Interesting: /Research

Showing posts with label /Research. Show all posts

Wednesday, January 13, 2010

The Open Source Process and Research

(Cross posted on mloss.org)

I think there is more to be learned from the open source software development process than just publishing the code from your papers. So far, we’ve mostly focused on making the software side more similar to publishing scientific papers, for example, through creating a special open source software track at JMLR.

However, there is more to be learned from the open source software development process:

“Release early, release often” Open source software is not only about making your software available for others to reuse, but it is also about getting in touch with potential users as early as possible, as closely as possible.

Contrast this with the typical publication process in science where there lie months between your first idea, the submission of the paper, its publication, and the reactions through follow-up and response papers.

Self-organization collaboration One nice thing about open source software is that you can often find an already sufficiently good solution for some part of your problem. This allows you to focus on the part which is really new. If existing solutions look sufficiently mature and their projects healthy, you might even end up relying on others for part of your project, which is really interesting given that you don’t even know these people or have ever talked to them. But if the project is healthy, there is a good chance that they will do their best to help you out, because they want to have users for their own project.

Again, contrast this with how you usually work in science, where it’s much more common to collaborate with people from your group or people within the same project only. Even if there were someone working on something which would be immensely useful for you, you wouldn’t know till months later when their work is finally published. The effect is that there is lots of duplicate work, research results from different groups don’t usually interact easily, and much potential for collaboration and synergy is wasted.

While there are certainly reasons while these two areas are different, I think there are ways to make research more interactive and open. And while probably most people aren’t willing to switch to open notebook science, I think there are a few things which you can try out now:

Communicate to people through your blog, or by Twitter or Facebook, and let them know what you’re working on, even before you have polished and published it. And if you don’t feel comfortable to disclose everything, how about some preliminary plots or performance numbers? To see how others are using social networks to communicate about their research, have a look at the machine learning twibe, or my (entirely non-authoritative) list of machine learning twitterers, or lists of machine learning people others have compiled, or another list of machine learning related blogs.
Release your software as early as possible, and make use of available infrastructure like blogs, mailing lists, issue trackers, or wikis. There are almost infinitely many options to go about this, either using some site like github, sourceforge, kenai, launchpad, savannah, or by setting up a private repository, for example using trac, or just a bare subversion repository. It doesn’t have to be that complicated, though. You can even just put a git repository on your static homepage and have people pull from there. And of course, register your project with mloss, such that others can find it and stay up to date on releases.
Turn your research project into a software project to create something others can readily reuse. This means making your software usable for others, interface it with existing software, and also, start reusing existing software as well. It doesn’t have to be large if it’s useful. Have a look at mloss for a huge list of already existing machine learning related software projects.

Thursday, November 12, 2009

Some Benchmark Data Sets

Sören Sonnenburg recently brought my attention to a few possibly lesser known benchmark data sets. Of course, benchmark data sets are always a double-sided sword: On the one hand, they are a great way to test and compare your learning algorithms, but on the other hand you’re usually not really solving any real problems anymore.

So you probably already know the UCI repository, or the DELVE repository. Here are a few links to probably lesser known benchmark data sets:

Generic Benchmarking

IDA dataset repository (a.k.a. “the Gunnar Benchmark Data Set”)
libsvm datasets
Datasets from cervisia.org
KDD datasets
Pascal Large Scale Challenge

Multiple Kernel Learning

Bioinformatics

Image Processing

Monday, October 19, 2009

Machine Learning Feed Updates II

I’ve added a few more research blogs.

Taken from Mark Reid’s kith:

aicoder by Neal Richter
vetta project by Shane Legg
natural language processing blog by Hal Daume III
Justin Domke’s Weblog
Data Mining: Text Mining, Visualization and Social Media by Matthew Hurst

More additions:

Machine Learning for Computer Security by my colleague Konrad Rieck
oliviergrisel.name by Olivier Grisel (although he said he wanted to update everything about it in a long time)
Statistical Modeling, Causal Inference, and Social Science by Andrew Gelman (quite high volume, by the way)
mloss.org Community blog blog for mloss.org on ML software and open source
AI and Social Science by Brendan O’Connor

I should probably collect all the entries somewhere… .

Again, in case you missed it, the feeds can be found in this post.

Wednesday, October 14, 2009

Machine Learning Feed (Update)

I’ve set up two more (sub-)feeds now, one for blogs by ML researchers, and one for the papers. You can also have the full feed if you want. The reason for this split is that the preprint and journal feeds have a much higher volume and will drown the actual blog posts.

So here are all your Machine Learning Feed options:

Feed	Google Reader	Feedburner
Full	Atom Feed Web View	http://feeds.feedburner.com/mlfeed
Only blogs	Atom Feed Web View	http://feeds.feedburner.com/mlfeed-blogs
Only Papers	Atom Feed Web View	http://feeds.feedburner.com/mlfeed-papers

Again, if you have any suggestions/comments, please let me know. I think it would be great if we could collect all the interesting people from our community.

In that spirit, I’ve added mlsec.org, a blog on machine learning and computer security by my colleague Konrad Rieck.

Tuesday, October 13, 2009

Machine Learning Feeds and Twitterers

I’ve collected some information about machine learning people on twitter and machine learning blog’s.

Twitter

Mark Reid has put together a nice list of ML twitterers.
There exists a machine learning twibe, and a machine learning tlist.

Blogs

A few days ago, I thought that it would be very nice to have some kind of blog aggregator like planet debian for machine learning. After some poking around I found that the easiest way to aggregate and publish RSS feeds is actually through Google reader. Here is the resulting machine learning feed. Feel free to subscribe to it!

Here is what is currently in there:

People/Websites/Companies:

Induction ex machina by Mark Reid.
Machine Learning (Theory) by John Langford.
i2pi by Joshua Reich.
Radford Neal’s blog (not updated in a while.)
Marginally Interesting by myself (although it’s been a while since I last blogged about core-ML ;))
David R. MacIver’s blog.
Machine Learning, etc. by Yaroslav Bulatov.
Dataspora Blog by Dataspora LLC
Data Wrangling by Pete Skomoroch.

Journals/Paper feeds:

If you have a suggestion of feeds to add, or want your feed removed for some reason, please let me know!

Update: I've set up two different feeds. You can either have the original Google reader feed, or the feedburner feed. The latter has more compact summaries, while the former might have a nicer web view.

More updates:I've put up different feeds for only blogs and only papers.

Monday, August 17, 2009

Twimpact Work In Progress

Twimpact has been running smoothly in its small niche of the internet, and we're currently trying to improve the way retweets are crawled and analyzed. The problem is that people often add some comment to the end of the original message, and also edit the original message such that it's not that straightforward to really know whether you have a new tweet or not.

There are also some more bugs, which will be fixed soon. For example, apparently, we weren't handling underscores in user names correctly such that "RT @nfl_games" became a retweet of the user "nfl" with the message "games", which has been retweeted more than 1800 times.

We currently also don't filter out users who retweet a tweet repeatedly or who retweet their own tweets, leading to all kinds of retweet bots and retweet-spam networks being high up in our retweet trends. While that may not be so informative, it is still interesting to see what kind of business ideas people come up with around the twitter platform.

For example, dannywhitehouse apparently has a service called twitter-bomb which I guess does all kinds of nasty things which are certainly not covered by twitter's Terms of Service, but who still managed to amass more than thirteen thousand followers.

In any, case we'll be rolling out the improvements soon, maybe this week, so stay tuned! The only problem we'll run into is that we have to reprocess all the tweets already in our database 8-O

Monday, August 03, 2009

Twimpact!

For the last one and a half months, Matthias Jugel and I have been working on a site which computes impact scores for twitter users based on how often their tweets become retweeted.

The project was really lots of fun so far. The first time we got the thing up and running was around the time of the Iranian elections and suddenly seeing all those tweets in real-time gave a feeling of directly tapping into the twitterverse.

The winner twimpact wise is clearly mashable with a twimpact score of 89 right now and over ten thousand retweeted messages. Other top users include: news cites like breakingnews, cnnbrk, and smashingmag, or celebrities with many, many followers like aplusk (Ashton Kutcher), or iamdiddy (Puff Diddy).

On the entry page you can see a live view of what has been retweeted most in the last hour. It's quite interesting to see what is popping up there. For example, surprisingly, there are many competitions of the form "retweet this and win a laptop" like this one which has been retweeted over 1300 times. Another kind of retweet is the inspirational message from users like deepak_chopra which people like to pass on. But apart from that you have of course current news, interesting links and so on. These are mostly technology and web related, which reflects the user base of twitter quite well, I think.

So go ahead and compute your twimpact score, or just sit back and look at what people are currently retweeting.

Thursday, April 30, 2009

Machine Learning Twibe

Twibes is some new twitter-related website which manages topic-related groups of people and collects tweets based on up to three tags.

Since apparently nobody did so far, I set up a twibe on machine learning. Follow the link and click on "Join" on the right and side to join.

Currently, the group is picking up tweets with either "machine learning", "#machlearn", or "#machine-learning" in them. Anyone got an idea how to improve the tags?

Friday, January 16, 2009

Science and the Market Metaphor

John Langford had an interesting post on what he calls the "adversial viewpoint" on academia. Basically, the argument is that under this viewpoint, you assume that scientists in academia compete over a fixed set of resources (research money, positions, students) and that therefore all the other scientists are your adversaries. He suspects that this might be one of the reasons behind the decline in reviewing quality at conferences such at NIPS he has observed in the following years.

John argues that the adversarial viewpoint might make sense, but it is actually bad for science, because scientists are more focused on rejecting other papers, projects or ideas, instead of being open for new developments.

I'm not sure whether the variable quality of NIPS reviews is really due to the enormous popularity of the NIPS conference and the load this puts on the program committee and the reviewers, or if it is because people are actively destructive about their peers work.

But I think this leads to an interesting question about the environment in which academia exists, why it is like it is and how it could be changed to lead to different viewpoints. Because if it really is a zero sum game, then it is not surprising that those who want to play the game successfully adopt an adversarial viewpoint.

A Simplified History of Public Funding for Scientific Research

I'm really not an expert in history of science, but I guess it's not completely wrong to say that the way science is embedded and supported by society has changed dramatically in the 20th century. Beginning with the industrialization and in particular during World War II it became apparent that having a productive scientific community is absolutely vital, both to the overall growth of your economy, but also for national security. For example, the National Science Foundation was created specifically after World War II with that goal in mind.

Naturally, if it was that important, the government had to take control and set up ways to maximize the scientific productivity (because it had to make sure that the tax payer's money is used well). While in historic times, scientists were selected by personal preferences and paid by some rulers to work at their court, managing science and setting up the rules of the games more and more became a responsibility of politicians.

Applying the Market Metaphor to Science

The problem was of course that science had never been organized on this level, in particular not by non-scientists. The question basically was: How can we maximize the scientific output from a fixed amount of resources. I think this is a very important question, and I'm certain the optimal answer hasn't been found yet. Looking at how science is organized today, it seems that they resorted to transplanting a well known metaphor, that of a free market, where scientific ideas and research plans compete over grant money, slots in publications, and positions.

You can find this idea in many different aspects. For example, a grant call is like a customer expressing a certain need, and then companies (that is, scientists) can compete for that money and the one who offers the best research for that money will get it. I'm not saying that this is not a good way to select who to fund, but grant calls are used as a device to control the direction in which science progresses, and the question is whether this ensures the overall progress of science.

Another example is the way in which the scientific output is measured by citation counts. A scientists (=company) produces a scientific publication (=goods). Such publications are then put on show in journals (=stores) where other scientists can cite them (=buy them). The productivity of a scientist is then measured by the amount of goods sold where the quality of the store factors into the price paid by other scientists.

Science is not Economy

I'm not saying that this system does not work at all, but science and scientific research in particular have properties which conflict with the economic setup.

For one, as in art, there is an independent notion of quality for scientific work which is somewhat independent from whether it competes well in the market. For example, it might be a brilliant piece of work, but there is only very little intersection with what other people are working on right now. Or it is not what the funding agencies are focusing on right now. I think every scientist has at least once experienced the conflict between what he considers good scientific work and what he has to do to get grant money. Or put even differently, if everyone would just play the game (publish papers, get grant money, basically secure his position in the field), would that alone ensure scientific progress?

Moreover, what gets published in journals is mainly managed through the peer reviewing processes. Translated to economy, this means that your competitors have a lot of say in whether your products will actually see the market. Assume that before Apple sells its new laptops, the store will first ask Microsoft, Dell, and HP what they think of the laptops? It is clear that it's hard to do differently in science, because the you need a lot of expertise to judge whether a paper is worth publishing, which cannot be done by the store owner alone, but still, this setup introduces a lot of interaction not present in a truly free market.

In science, significant progress often comes from sidetracks. While most people are working on extending and applying a certain scheme, now approaches are often found elsewhere and take some time before they can enter the mainstream. However, a mass market (and given the number of scientists today, it certainly is a mass market) tends to produce products for the masses, and it is unclear whether a remote idea could really get enough support to work.

Science as a whole is progressing, I guess, but I believe it partly is because people manage to play the game and do the research which matters to them at the same time.

A Way Out?

I have to disappoint you right away, because I do not know the solution. But I think actually seeing the difference between a free market and science is important, and I hope it will make you think.

Others have been more brave in this respect. For example, people have thought about how to allocate money in a way which prevents us to just feed the "mass market" and also allow small independent research projects. Lee Smolin suggests an alternative way to distributed grant money in his book "The Trouble with physics". Siegfried Bär in his book Forschen auf Deutsch (Research in German) also suggests how to improve the way research money is distributed in German. I won't go into detail here, but both researchers think that the whole proposal writing business just takes up too much time, and the process should become much more flexible such that more researchers have time to actually do research, and also on the topics they are interested in. Part of the money should even be spent on ideas which really don't seem that relevant (but to scientists which have otherwise proven not to be crackpots).

If you in principle agree that citation count is a good measure of scientific progress, and you believe in the market, then the problem remains that the scientific publication culture is different from a real market because your competitors can veto that the customers see your product at all. The question boils down to how to improve the reviewing process. Marcus Hutter has archived an email discussion from 2001 on his homepage on what alternatives there are to the existing review process. John Langford suggests to also use preprint servers like the arxiv to get a time-stamp for your work, since you cannot be sure when you will manage to get it published.

I think people have naturally been thinking about improving the review process because in the age of the Internet, this is actually something we as scientists can actively control (as opposed to controlling funding policies). The whole system already depends on unpaid volunteers, so we should have enough manpower to run any other system as well if it gets enough support.

I'd like to repeat the idea of Geoffrey Hinton from the above email discussion. He proposed a system where people put endorsements for papers on their homepages, together with a trust network you define for yourself. You register other scientists whose opinions you trust when it comes to which papers are worth reading. In 2001, the setup was personal websites and a tool, but nowadays, you would certainly turn this into some Web2.0 application. citeulike seems to go in that direction, although the focus is currently more on organizing what papers you have read.

In essence, the goal is to make the path from company to customer much shorter, and in particular, to lessen the impact of your competition on whether your customer can buy your products, that is, cite your papers, or not.

Conclusion

So in summary, I think the framework within academia lives is not altogether bad, but there is always room for improvement. Currently, the market metaphor is often applied blindly without taking account the peculiarities of scientific research, or the scientific community as a whole. The perception that academia is basically a zero-sum game as voiced by John Langford is directly based on the idea that science is a competition over fixed resources. As I have pointed out, the main difference is that science is also a bit like art to the extent that it has it's own internal notion of quality and soundness which cannot be easily grasped or measured in terms of economic concepts. If we could manage to integrate these different aspects of science we might be eventually able to find better ways to run academia.

Update (Jan 30, 2009): I found an interesting blog post by Michael Nielsen. His basic argument is that we not only need new ways of exchanging, archiving, and searching existing knowledge, but also a radical change of culture, potentially backed by new online tools. For example, he argues that it would be highly advantageous if scientists could easily post problems they are stuck on and quickly find other scientists who are experts on those problems. However, people might only be willing to do this if such contributions would be tracked the same way peer-reviewed publications are.

Interestingly, I see some parallels between these ideas and the way we have been setting up mloss.org and the Open Source Software Track at JMLR. We have provided both the tool, and a means to make publishing open source software accountable under the old metrics - peer-reviewed publications.

Monday, December 22, 2008

On NIPS 2008

Although I came home from last year's NIPS conference more than three weeks ago, I haven't yet found time to summarize my impressions. I've found that it's always like this, first there is the jet-lag, then there is Christmas, New Year.

But maybe it's not just the closeness to the holiday season, I think it's also that NIPS is so content-rich that you really need some time to digest all that information.

Banquet Talk and Mini-Symposia

This year, in particular, because they have managed to cram in even more session into the program. They used to have some invited high-level talk during the banquet in previous years, but this year the organizers have chosen to put two technical talks and virtually 20 poster spot lights during the banquet. Actually, I'm not that sure whether this decision was wise as I and most of my colleagues felt that dinner and technical talks don't go well together. Maybe it was also my jet-lag, as I arrived on Monday afternoon, not on Sunday like some people.

The second addition where the mini-symposia on Thursday afternoon, conflicting with the earlier buses to Whistler. I attended the computational photography mini-symposium and found it very entertaining. The organizers have managed to put together a nice mix of introductory and overview talks. For example, Steve Seitz from the University of Washington had a nice talk on how to reconstruct tourist sites in Rome from pictures collected from flickr. Based on these 3d reconstruction you could go on a virtual tour of famous monuments, or compute closest paths based on where pictures were taken.

So if I had anything to say, I'd suggest to keep the mini-symposia, but replace the technical talks during the banquet by the invited talk as in previous years.

Presentations

With over 250 presentations, it's really hard to pick out the most interesting ones, and as the pre-proceedings aren't out yet (or at least not that I'm aware of), it's also hard to collect some pointers here.

There was an interesting invited talk by Pascal Van Hentenryck on Online Stochastic Combinatorial Optimization. The quintessence of his approach to optimization in stochastic environments was that often, the reaction of the environment does not depend on you the action you take, so you can build a pretty reliable model for the environment and the optimize against that.

Yves Grandvalet had a paper "Support Vector Machines with a Reject Option", which proposes a formulation of a support vector machine which can also opt to say "I don't know which class this is".

John Langford had a paper which was already a preprint at arxiv.org on sparse online learning which basically has the option to truncate certain weights if the become too small.

Every now and then there was an interesting poster with nobody attending it. For example, Patrice Bertail had a poster on "Bootstrapping the ROC Curve" which looked interesting and highly technical, but we could find nobody. At some point I started to discuss the poster with a colleague, but we had to move away from the poster after people started to cluster around us as if one of us were actually Patrice.

Michalis Titsias had an extension of Gaussian Processes in his paper "Efficient Sampling for Gaussian Process Inference using Control Variables" to the case where the model is not just additive random noise but actually depends non-linearly on the function where the Gaussian process is on. It looked pretty complicated, but it might be good to know that such a thing exists.

There were many more interesting papers, of course, but let me just list one more: "Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models" by Tong Zhang seemed like a simple method which combines feature addition with removal steps and comes with a proof (of course). I guess similar schemes exist in dozens, but this one seemed quite interesting to try out.

The question I always try to answer is whether there are some big new developments. A few years ago, everybody suddenly seemed to do Dirichlet processes and some variant of eating place. Last year (as I have been told), everybody was into deep networks. But often, I found it very hard, and this year was also one of those. Definitely no deep learning, maybe some multiple kernel learning. There were a few papers on which try to include some sort of feature extraction or construction into the learning process in a principled manner, but such approaches are (necessarily?) often quite application specific.

I also began to wonder whether a multi-track setup wouldn't be better for NIPS. This question has been discussed every now and then, always in favor of keeping the conference single-track. I think one should keep in mind that what unites machine learning as a community are new methods, because the applications are quite divers, and often very specific. For a bioinformatics guy, a talk on computer vision might not be very interesting, unless there is some generic method which is application-agnostic to a certain degree.

It seems that currently, most of the generic methods are sufficiently well researched, and people now start to think about how to incorporate automatic learning of features and preprocessing into their methods. As I said above, such methods are often a bit ad-hoc and application specific. I'm not saying that this is bad. I think one first has to try out some simple things before you can find more abstract principles which might be more widely applicable.

So maybe having a multi-track NIPS would mean that you can listen more selectively to talks which are relevant to your area of research and the list of talks wouldn't appear to be somewhat arbitrary. On the other hand, you might become even more unaware of what other people are doing. Of course, I don't know the answer, but my feeling was that NIPS is slowly approaching a size and density of presentations that something has to change to optimize the information flow between presenters and attendees.

Workshops

I've heard that some people come to NIPS only for the workshops, and I have to admit that I really like them a lot, too. Sessions are more focused topic-wise, and the smaller size of the audience invites some real interaction. Whereas I sometimes get the impression that the main conference is mostly for big-shots to meet over coffee-breaks and during poster sessions, it's in the conferences where they participate in the discussion.

We had our own workshop on machine learning and open source software which I have summarized elsewhere.

I attended the multiple kernel learning workshop which really was very interesting, because most of the speakers concluded that in most cases, multiple kernel learning does not work significantly better than a uniform average of kernels. For example, William Stafford Noble reported that he had a paper with multiple kernel learning for the Bioinformatics journal, and only afterwards decided to check whether unoptimized weights would have worked as well. He was quite surprised when the differences where statistically insignificant and concluded that he wouldn't have written the paper in that way had he known the results before.

Francis Bach also gave a very entertaining talk where he presented Olivier Chapelle's work, who couldn't attend. He did a very good job, including comments like "So on the y-axis we have the relative duality gap - I have no idea what that is", and raising his hand after his talk to have the first question.

All in all, I think this workshop was quite interesting and exiting and also important for the whole field of multiple kernel learning, basically, to see that it doesn't just work, and to try to understand better when it doesn't give the improvements hoped for and why.

Finally, many workshops were taped by videolectures.net. I've collected the links here:

Saturday, December 06, 2008

On my way to NIPS 2008

Next week is the annual NIPS conference in Vancouver. In case you don't know, it's one of the most important annual conferences in the area of machine learning. It's actually its 22nd installment, and while the full title (Neural Information Processing Systems) hits at the beginnings of the fields with artificial neural networks, such methods cover only a small percentage of the presented papers nowadays.

Together with Soeren Sonnenburg and Cheng Soon Ong we have organized a workshop on machine learning open source software. I'm pretty excited about our program, I only fear that our decision to drop the coffee breaks in favor of a few additional talks will backfire severly. But we'll see.

I plan to use my twitter account this year to cover the workshop, so if you're interested, make sure to have a look. Most importantly, I will cover our workshop so that you can see how far we are behind schedule. ;)

By the way, the picture above shows the swimming gas station. It mostly services seaplanes, and you have a very nice view of it from the top floor of the hotel where the NIPS conference takes place.

Monday, November 17, 2008

On Relevant Dimensions in Kernel Feature Spaces

I finally found some time to write a short overview article about our latest JMLR paper. It discusses some interesting insights into the dimensionality of your data in a kernel feature space when you only consider the information relevant to your supervised learning problem. The paper also presents a method for estimating this dimensionality for a given kernel and data set.

The overview also contains a few lines of matlab code with which you can have a look at the discussed effect yourself. It all boils down to pictures like these:

What you see here is the contribution of individual kernel PCA components to the Y samples, divided by the smooth part (red), and the noise (blue), on a toy data set, of course. Kernel PCA components are sorted by decreasing variance. What you can see is that the smooth part is contained in the leading kernel PCA components, while the later components only contain information relevant for the noise. This means that even in infinite-dimensional feature spaces, the actual information is contained in a low-dimensional feature space.

If you are looking for more information, have a look at the overview article, the paper, or a video lecture I gave together with Klaus-Robert Müller in Eindhoven last year. There is also some software available.

Monday, October 20, 2008

What is Machine Learning? Revisited

I always hated questions like "what is an organism" in school. What makes this kind of question hard is while you typically have quite an elaborate idea about what an organism is, summarizing all of this information in on sentence can be very hard.

For me, the question "what is machine learning" is very similar. I've already tried to answer that question, or to better characterize what the role of data sets are in machine learning, but I never found those definitions sufficient.

But is it really that important to be able to define what machine learning is? After all, we all know what it is, right? Well, I found that answering what machine learning is to you can actually be very helpful when it comes to deciding which new ideas to follow or planning what to do in the long run. And it also helps a lot answering the question what I do for a living. Being able to describe what you're doing at least is better than laughing out load and then saying "well... ." (that really happens...)

Okay, so the usual definition you often hear is that machine learning is about writing programs which can learn from examples. This is about as vaguely correct and non-committal as it gets. Put differently, machine learning is about programs which can learn from examples, and solve complex problems which are hard to formalize.

So far, so good, but when you take this characterization seriously, you see that machine learning is mostly a certain approach, or conceptual framework, for solving computational problems. In particular, machine learning is not directly linked to an application area (like, I don't know, building fast and reliable data bases or computer networks), but it is more of a meta-discipline. In this respect, it is maybe more similar to theoretical computer science, or even mathematics, where you develop theory and neglect the question what the practical implications are because you are confident that eventually, somebody will find a way to use it (After all, who would have thought that number theory would one day be so important for cryptography?)

This also means that we often find ourselves in the position of an intruder in other areas, telling people that a "learning approach" works better than the tools they have been building for the last couple of years. Those of you who on projects with a strong application focus like bioinformatics, computational chemistry, or computer security certainly know what I am talking about... .

So on the one hand, machine learning deals with problems which are hard to formalize, and concrete data sets become a substitute for formal problem definitions. On the other hand, machine learning can and is often studied in quite an abstract fashion, for example, by developing general purpose learning methods.

But is it really possible to study something abstractly where we often need concrete data sets to represent certain questions and problems? Of course, elaborate theoretical frameworks for modeling learning exist, for example, for supervised learning. But these formalizations are ultimately too abstract to capture what really characterizes the problems we wish to solve in the real world.

I personally think the answer is no, meaning that it's not possible to do machine learning abstractly, because we need concrete applications and data sets to provide real challenges to our methods.

What this also means is that you need to compete not only with other machine learning people, but also with people who approach the problems in the classical way, by trying to write programs which solve the problem. While it is always nice (and also very impressive) if you can solve a hard task with a generic machine learning method, you need to compete with "classical" approaches as well to make sure that the "learning approach" really has an advantage.

Monday, September 08, 2008

NIPS outcome and MLOSS Workshop

Well, the NIPS results are out. If you don't know it, it is one of the largest (maybe the largest) conferences in machine learning held each year in early December, and they have just sent around which papers are accepted and which are not on Saturday.

Unfortunately, none of my papers made, although one got quite close. On the other hand, I'm very glad to announce that our workshop on machine learning open source software has been accepted. This we be the second (actually third) installment: In 2005, the workshop was not included into the program, but many people found the issue important enough to come to Vancouver a day earlier and take part in a "Satellite Workshop".

In 2006 we were accepted and actually had a very nice day in Whistler. When I noticed that I was personally enjoying the workshop, I knew that we had managed to put together a nice program. Maybe the highlight was the final discussion session with Fernando Pereira stating that there is little incentive for researchers to work on software because there is no measurable merit in doing so. Eventually, this discussion lead to a position paper and finally to a special track on machine learning software at the Journal of Machine Learning Research.

I'm looking forward to this years workshop, and hope that it will be equally interesting and productive!

Thursday, September 04, 2008

New Paper Out

I'm very happy to announce that my latest paper just came out at the Journal of Machine Learning research. Actually, it is the second half of work which developed out of my Ph.D. thesis. The other paper came out almost two years go, which tells you a bit about how long reviews can take if you're unlucky.

Anyway, from a theoretical point of view, the first paper studies eigenvalues of the kernel matrix while the second one studies eigenvectors, and both derive approximation bounds between the eigenvalues and the eigenvalues you'd get as the number of training points tends to infinity. These bounds have the important property that they scale with the eigenvalue under consideration. You don't have one bound for all eigenvalues, but instead a bound which becomes smaller as the eigenvalue becomes smaller. This means that you won't have the same bound on eigenvalues of the order of 10^3 and 10^-6, but actually, the error will be much smaller on the smaller eigenvalue. If you run numerical simulations, you will immediately see this, but actually proving this was a bit harder.

What this practically means is a different story, and I personally think it is actually quite interesting (of course :)): If you have some learning problem (a supervised one), and you take a kernel and start to train you support vector machine, then these results tell you that even if the kernel feature space might be very high-dimensional, the important information is contained in a low-dimensional subspace, which also happens to be spanned by the leading kernel PCA components.

In other words, when you choose the right kernel, you can do a PCA in feature space, and just consider a space spanned by the directions having largest variance. In essence, if you choose the correct kernel, you are dealing with a learning problem in a finite-dimensional space, which also explains why these methods work so well.

The "old" story was that you're using hypothesis classes which have finite VC dimension, and then everything's fine. This is still true, of course, but these new results also show why you can expect low-complexity hypothesis classes to work at all: Because a good kernel transforms the data such that the important information is contained in a low-complexity subspace of the feature space.

So I hope I could make you a bit interested in the paper. I'm trying to put together a bit more of an overview on my home page, so make sure to check that out as well in a week from now.

Wednesday, July 09, 2008

Data Mangling Basics

When you're new to machine learning, there usually comes the point where you first have to face real data. At that point you suddenly realize that all your training was, well, academic. Real data doesn't come in matrices, real data has missing values, and so on. Real data is usually just not fit for being directly digested by your favorite machine learning method (well, if it does, consider yourself to be very lucky).

So you spent some time massaging the data formats until you have something which can be used in, say, matlab, but the results you get aren't just so good. If you're lucky, there is some senior guy in your lab you can ask for help, and he will usually do some things to preprocess the data you've never heard of in class.

Actually, I think that it should be taught in class, even when there is no systematic methodology behind it, and no fancy theorems to prove.So here is my by no means exhaustive set of useful preprocessings.

Take subsets You might find this pretty obvious, but I've seen students debugging their methods directly on the 10000 instances data set often enough to doubt that. So when you have a new data set and you want to quickly try out several methods, take a random subset of the data until your method can handle the data set in seconds, not minutes or hours. The complexity of most data sets is also such that you usually get somewhat close to the achievable accuracy with a few hundred examples.
Plot histograms, look at scatter plots Even when you have high-dimensional data, it might be very informative to look at histograms of individual coordinates, or scatter plots of pairs of coordinates. This already tells you a lot about the data: What is its range? Is it discrete or continuous? Which directions are highly correlated, and which are not? And so on. Again, this might seem pretty obvious, but often students just run the specified method without looking at the data first.
Center and normalize While most method papers make you think that the method "just works", in reality, you often have to do some preprocessing to make the methods work well. One such step is to center and normalize your data to have unit variance in each direction. Don't forget to save the offset and normalization factors: you will need them to correctly process the
features for prediction!
Take Logarithms Sometimes, you have to take the logarithms of your data, in particular when the range of the data is extremely large and the density of your data points decreases as the values become larger. Interestingly, many of our own senses work on a logarithmic scale, as for example, hearing and vision. Loudness is for example measured in decibel, which is a logarithmic scale.
Remove Irrelevant Features Some kernels are particularly sensitive to irrelevant features. For example, if you take a Gaussian kernel, each feature which is irrelevant for prediction increases the number of data points you need to predict well, because for practically each realization of the irrelevant feature, you need to have additional data points in order to learn well.
Plot Principal Values Often, many of your features are highly correlated. For example, the height and weight of a person is usually quite correlated. A large number of correlated features means that the effective dimensionality of your data set is much smaller than the number of features. If you plot the principal values (eigenvalues of the sample covariance matrix C'*C, you usually will notice that there are only a few large values, meaning that a subset of the space will already contain most of the variance of your data. Also note that a projection to those dimensions is the best low-dimensional approximation with respect to the squared error, so using this information, you can transform your problem in one with fewer features.
Plot Scalar Products of the output variable with Eigenvectors of the Kernel Matrix Finally, while principal values only tell you something about the variance in the input features, if you plot the scalar products between the eigenvectors of the kernel matrix and the output variable in a supervised learning setting, you can see how many (kernel) principal components you need to capture the relevant information about the learning problem. The general shape of these coefficients can also tell you if your kernel makes sense or not. Watch out for an upcoming JMLR paper for more details.
Compare with k-nearest neighbors Finally, before you apply your favorite method on the data set, try something really simple like k-nearest neighbors. This can get you a good idea of what kinds of accuracies you can expect.

So in summary, there is a lot of things to do with your data before you plug it into your data set. If you do it correctly, you will learn quite a lot about the nature of the data, and have transformed your data to learn more robustly.

I'd be interested to know what initial procedures you apply to your data, so feel free to add some comments.

Friday, May 30, 2008

NIPS, mloss.org and jruby-1.1.2

Currently it's one week till this years NIPS deadline and there still is a lot of work to do, as always. I just wanted to quickly point you to a post I've written on our homepage mloss.org, where I try to make the point that you could actually do useful things using a modern scripting language like python or ruby for building machine learning toolboxes. At least, you'll have more powerful modeling possibilities than in matlab.

And then, jruby-1.1.2 has been released. Beside other things they have managed to reduce the startup time from about 1.2s (on my machine) to 0.4s, which is pretty great. I'm planning to write an article about how useful the jruby/java integration is, in particular for machine learning, but that will be after NIPS, I guess... .

Sunday, February 24, 2008

Machine Learning and Data Sets

I've been busy taking care of my 11 month old daughter lately which leaves almost no time to do something as remotely useful as posting on my blog - not that I have been doing it more often when I was still working full time. At the same time you get a lot of ideas and potentially interesting insights, now that your brain has time to idle now and then, for example while picking up toys thrown to the ground again and again.

Anyway, I lately came to think that machine learners as a whole should devote much more time to working with actual data. In particular machine learners who think of themselves as being "method guys" (yes, this also includes me). It usually works likes this: You have some technique you really like a lot and you use it to extend an already existing method until you come up with something you think is really neat. It may have some interesting properties other algorithms don't have, and you really would like to write a paper on it.

But then, the problem starts, because in order to prove that your extension is actually useful, you will have to prove that it makes a difference practically. So you go around your group asking colleagues if they have or know of some intersting data set. We call this the "have method, need data" phenomenon.

Of course, if you had started with a concrete application in mind, you would never have to ask yourself "oh, this is great, but what is it good for?"

Also, in machine learning, the formally defined problems we have are very abstract (like minimizing the expected risk from i.i.d. drawn data points), and many of the actual challenge are actually only "defined" by specific data sets.

Anyway, data wrangling have recently posted a huge list of links to data sets on the web which is certainly an interesting starting point.

And yes, if you already have your method, you might also find some interesting "real world application" there.

Sunday, August 26, 2007

What is Machine Learning

I'm often asked what it actually is what I do for a living. I'm not sure if you can understand this, but "we Germans" have a special relationships with anything which involves mathematics. There is a whole family of jokes about a mathematician who has a date and is asked what his job is. They all end with the disclosure "I'm a mathematician" followed by "Oh... . I was never good at math in school". And that is the end of it.

The next thing is that "Machine Learning" is somewhat close to "Artificial Intelligence" and, come on, who would be able to hear that somebody is working on building "intelligent machines" and keep a straight face. Have you ever had to call a company and realized that their replaced their already annoying menu by number scheme by an incredibly more annoying "speech recognition" software. "If you are calling to ask about your contract, say 'contract', if you want to ask about your invoice, say 'invoice'..." - "CON-TRACT!!!" - "Sorry, I could not understand you. If your are calling to ask..." and repeat ad infinitum.

Where was I? Ah, so the question is, what is Machine Learning about. In my Ph.D. thesis, I state that "machine learning is concerned which constructing algorithms which are able to learn from data". Well, this is certainly accurate but it does not answer a few important questions: Why would you want to learn from data? And what?

I lately have come to realize that machine learning is nothing else than an extension of how to write programs which solve complex problems. In fact problems which are so complex that you don't manage to come up with a formal specification of what the program should accomplish. Classically, programs have been written to address problems which could be formalized well: Basic arithmetics (but make it really fast), how to compute the shortest path in a graph, how to optimize network flow, and so on. But there were always these problems whose solution were elusive: making computers see, understand natural language, control robots which can interact with the real world. Most of these problems are "easily" solved by humans, but maybe only because evolution has outfitted us with the right hardware, er, wetware.

For most of these problems, it is easy to find partial solutions. For example, for object recognition in images, it is clear that the size or position of an object in an image does usually not make a difference (well, maybe unless the relative position of objects in an image does. Or since when can elephants fly?). But things quickly become less clear, and the "old way" of solving such problems by first understanding the problem fully and then devising a list of basic operations which always result in the right solution plainly does not work.

Enters Machine Learning! ML algorithms learn a mapping from input-output pair examples and state-of-the-art ML algorithms can deal with sets which contain up to several million examples. But wait, this does not mean that the "old way" is obsolete. As it turns out just taking a few million images and throwing some vanilla ML algorithm against the data won't work pretty well.

As almost every partitioner of ML knows, it is all in the preprocessing. In principle, the methods might be able to eventually work, but they work so much better (read: require less data) if you perform some form of preprocessing. For object recognition, it might help if the object is nicely centered, for example. ML people are often a bit annoyed that finding the right preprocessing is so important. They want to build machines which can do everything by themselves.

But if you look at it at a different angle, you see that the preprocessing is actually the place where the "old way" and the "ML way" nicely meet. The preprocessing roughly amounts to solving the problem partially, taking all available information into account. The remaining part of the problem is then handed over to the machine which solves it the way it works best: by brute force, quenching the required information from several thousand to millions of examples.

So, the next time somebody asks me what I do for a living, I'll try to tell them nothing about intelligence or statistics, but solving problems which are so hard that nobody has found a solution yet using the sheer computational power of modern processors. Let's see if I'll manage to circumvent the inevitable "I was bad at math" reaction. Or at least delay it for 5 minutes.