Showing posts with label /Projects/mloss. Show all posts
Showing posts with label /Projects/mloss. Show all posts

Wednesday, January 13, 2010

The Open Source Process and Research

(Cross posted on mloss.org)

I think there is more to be learned from the open source software development process than just publishing the code from your papers. So far, we’ve mostly focused on making the software side more similar to publishing scientific papers, for example, through creating a special open source software track at JMLR.

However, there is more to be learned from the open source software development process:

  • “Release early, release often” Open source software is not only about making your software available for others to reuse, but it is also about getting in touch with potential users as early as possible, as closely as possible.

Contrast this with the typical publication process in science where there lie months between your first idea, the submission of the paper, its publication, and the reactions through follow-up and response papers.

  • Self-organization collaboration One nice thing about open source software is that you can often find an already sufficiently good solution for some part of your problem. This allows you to focus on the part which is really new. If existing solutions look sufficiently mature and their projects healthy, you might even end up relying on others for part of your project, which is really interesting given that you don’t even know these people or have ever talked to them. But if the project is healthy, there is a good chance that they will do their best to help you out, because they want to have users for their own project.

Again, contrast this with how you usually work in science, where it’s much more common to collaborate with people from your group or people within the same project only. Even if there were someone working on something which would be immensely useful for you, you wouldn’t know till months later when their work is finally published. The effect is that there is lots of duplicate work, research results from different groups don’t usually interact easily, and much potential for collaboration and synergy is wasted.

While there are certainly reasons while these two areas are different, I think there are ways to make research more interactive and open. And while probably most people aren’t willing to switch to open notebook science, I think there are a few things which you can try out now:

  • Communicate to people through your blog, or by Twitter or Facebook, and let them know what you’re working on, even before you have polished and published it. And if you don’t feel comfortable to disclose everything, how about some preliminary plots or performance numbers? To see how others are using social networks to communicate about their research, have a look at the machine learning twibe, or my (entirely non-authoritative) list of machine learning twitterers, or lists of machine learning people others have compiled, or another list of machine learning related blogs.

  • Release your software as early as possible, and make use of available infrastructure like blogs, mailing lists, issue trackers, or wikis. There are almost infinitely many options to go about this, either using some site like github, sourceforge, kenai, launchpad, savannah, or by setting up a private repository, for example using trac, or just a bare subversion repository. It doesn’t have to be that complicated, though. You can even just put a git repository on your static homepage and have people pull from there. And of course, register your project with mloss, such that others can find it and stay up to date on releases.

  • Turn your research project into a software project to create something others can readily reuse. This means making your software usable for others, interface it with existing software, and also, start reusing existing software as well. It doesn’t have to be large if it’s useful. Have a look at mloss for a huge list of already existing machine learning related software projects.

Monday, December 22, 2008

On NIPS 2008


Although I came home from last year's NIPS conference more than three weeks ago, I haven't yet found time to summarize my impressions. I've found that it's always like this, first there is the jet-lag, then there is Christmas, New Year.

But maybe it's not just the closeness to the holiday season, I think it's also that NIPS is so content-rich that you really need some time to digest all that information.

Banquet Talk and Mini-Symposia

This year, in particular, because they have managed to cram in even more session into the program. They used to have some invited high-level talk during the banquet in previous years, but this year the organizers have chosen to put two technical talks and virtually 20 poster spot lights during the banquet. Actually, I'm not that sure whether this decision was wise as I and most of my colleagues felt that dinner and technical talks don't go well together. Maybe it was also my jet-lag, as I arrived on Monday afternoon, not on Sunday like some people.

The second addition where the mini-symposia on Thursday afternoon, conflicting with the earlier buses to Whistler. I attended the computational photography mini-symposium and found it very entertaining. The organizers have managed to put together a nice mix of introductory and overview talks. For example, Steve Seitz from the University of Washington had a nice talk on how to reconstruct tourist sites in Rome from pictures collected from flickr. Based on these 3d reconstruction you could go on a virtual tour of famous monuments, or compute closest paths based on where pictures were taken.

So if I had anything to say, I'd suggest to keep the mini-symposia, but replace the technical talks during the banquet by the invited talk as in previous years.

Presentations

With over 250 presentations, it's really hard to pick out the most interesting ones, and as the pre-proceedings aren't out yet (or at least not that I'm aware of), it's also hard to collect some pointers here.

There was an interesting invited talk by Pascal Van Hentenryck on Online Stochastic Combinatorial Optimization. The quintessence of his approach to optimization in stochastic environments was that often, the reaction of the environment does not depend on you the action you take, so you can build a pretty reliable model for the environment and the optimize against that.

Yves Grandvalet had a paper "Support Vector Machines with a Reject Option", which proposes a formulation of a support vector machine which can also opt to say "I don't know which class this is".

John Langford had a paper which was already a preprint at arxiv.org on sparse online learning which basically has the option to truncate certain weights if the become too small.

Every now and then there was an interesting poster with nobody attending it. For example, Patrice Bertail had a poster on "Bootstrapping the ROC Curve" which looked interesting and highly technical, but we could find nobody. At some point I started to discuss the poster with a colleague, but we had to move away from the poster after people started to cluster around us as if one of us were actually Patrice.

Michalis Titsias had an extension of Gaussian Processes in his paper "Efficient Sampling for Gaussian Process Inference using Control Variables" to the case where the model is not just additive random noise but actually depends non-linearly on the function where the Gaussian process is on. It looked pretty complicated, but it might be good to know that such a thing exists.

There were many more interesting papers, of course, but let me just list one more: "Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models" by Tong Zhang seemed like a simple method which combines feature addition with removal steps and comes with a proof (of course). I guess similar schemes exist in dozens, but this one seemed quite interesting to try out.

The question I always try to answer is whether there are some big new developments. A few years ago, everybody suddenly seemed to do Dirichlet processes and some variant of eating place. Last year (as I have been told), everybody was into deep networks. But often, I found it very hard, and this year was also one of those. Definitely no deep learning, maybe some multiple kernel learning. There were a few papers on which try to include some sort of feature extraction or construction into the learning process in a principled manner, but such approaches are (necessarily?) often quite application specific.

I also began to wonder whether a multi-track setup wouldn't be better for NIPS. This question has been discussed every now and then, always in favor of keeping the conference single-track. I think one should keep in mind that what unites machine learning as a community are new methods, because the applications are quite divers, and often very specific. For a bioinformatics guy, a talk on computer vision might not be very interesting, unless there is some generic method which is application-agnostic to a certain degree.

It seems that currently, most of the generic methods are sufficiently well researched, and people now start to think about how to incorporate automatic learning of features and preprocessing into their methods. As I said above, such methods are often a bit ad-hoc and application specific. I'm not saying that this is bad. I think one first has to try out some simple things before you can find more abstract principles which might be more widely applicable.

So maybe having a multi-track NIPS would mean that you can listen more selectively to talks which are relevant to your area of research and the list of talks wouldn't appear to be somewhat arbitrary. On the other hand, you might become even more unaware of what other people are doing. Of course, I don't know the answer, but my feeling was that NIPS is slowly approaching a size and density of presentations that something has to change to optimize the information flow between presenters and attendees.

Workshops

I've heard that some people come to NIPS only for the workshops, and I have to admit that I really like them a lot, too. Sessions are more focused topic-wise, and the smaller size of the audience invites some real interaction. Whereas I sometimes get the impression that the main conference is mostly for big-shots to meet over coffee-breaks and during poster sessions, it's in the conferences where they participate in the discussion.

We had our own workshop on machine learning and open source software which I have summarized elsewhere.

I attended the multiple kernel learning workshop which really was very interesting, because most of the speakers concluded that in most cases, multiple kernel learning does not work significantly better than a uniform average of kernels. For example, William Stafford Noble reported that he had a paper with multiple kernel learning for the Bioinformatics journal, and only afterwards decided to check whether unoptimized weights would have worked as well. He was quite surprised when the differences where statistically insignificant and concluded that he wouldn't have written the paper in that way had he known the results before.

Francis Bach also gave a very entertaining talk where he presented Olivier Chapelle's work, who couldn't attend. He did a very good job, including comments like "So on the y-axis we have the relative duality gap - I have no idea what that is", and raising his hand after his talk to have the first question.

All in all, I think this workshop was quite interesting and exiting and also important for the whole field of multiple kernel learning, basically, to see that it doesn't just work, and to try to understand better when it doesn't give the improvements hoped for and why.

Finally, many workshops were taped by videolectures.net. I've collected the links here:

Monday, September 08, 2008

NIPS outcome and MLOSS Workshop


Well, the NIPS results are out. If you don't know it, it is one of the largest (maybe the largest) conferences in machine learning held each year in early December, and they have just sent around which papers are accepted and which are not on Saturday.

Unfortunately, none of my papers made, although one got quite close. On the other hand, I'm very glad to announce that our workshop on machine learning open source software has been accepted. This we be the second (actually third) installment: In 2005, the workshop was not included into the program, but many people found the issue important enough to come to Vancouver a day earlier and take part in a "Satellite Workshop".

In 2006 we were accepted and actually had a very nice day in Whistler. When I noticed that I was personally enjoying the workshop, I knew that we had managed to put together a nice program. Maybe the highlight was the final discussion session with Fernando Pereira stating that there is little incentive for researchers to work on software because there is no measurable merit in doing so. Eventually, this discussion lead to a position paper and finally to a special track on machine learning software at the Journal of Machine Learning Research.

I'm looking forward to this years workshop, and hope that it will be equally interesting and productive!