Monday, December 22, 2008

On NIPS 2008


Although I came home from last year's NIPS conference more than three weeks ago, I haven't yet found time to summarize my impressions. I've found that it's always like this, first there is the jet-lag, then there is Christmas, New Year.

But maybe it's not just the closeness to the holiday season, I think it's also that NIPS is so content-rich that you really need some time to digest all that information.

Banquet Talk and Mini-Symposia

This year, in particular, because they have managed to cram in even more session into the program. They used to have some invited high-level talk during the banquet in previous years, but this year the organizers have chosen to put two technical talks and virtually 20 poster spot lights during the banquet. Actually, I'm not that sure whether this decision was wise as I and most of my colleagues felt that dinner and technical talks don't go well together. Maybe it was also my jet-lag, as I arrived on Monday afternoon, not on Sunday like some people.

The second addition where the mini-symposia on Thursday afternoon, conflicting with the earlier buses to Whistler. I attended the computational photography mini-symposium and found it very entertaining. The organizers have managed to put together a nice mix of introductory and overview talks. For example, Steve Seitz from the University of Washington had a nice talk on how to reconstruct tourist sites in Rome from pictures collected from flickr. Based on these 3d reconstruction you could go on a virtual tour of famous monuments, or compute closest paths based on where pictures were taken.

So if I had anything to say, I'd suggest to keep the mini-symposia, but replace the technical talks during the banquet by the invited talk as in previous years.

Presentations

With over 250 presentations, it's really hard to pick out the most interesting ones, and as the pre-proceedings aren't out yet (or at least not that I'm aware of), it's also hard to collect some pointers here.

There was an interesting invited talk by Pascal Van Hentenryck on Online Stochastic Combinatorial Optimization. The quintessence of his approach to optimization in stochastic environments was that often, the reaction of the environment does not depend on you the action you take, so you can build a pretty reliable model for the environment and the optimize against that.

Yves Grandvalet had a paper "Support Vector Machines with a Reject Option", which proposes a formulation of a support vector machine which can also opt to say "I don't know which class this is".

John Langford had a paper which was already a preprint at arxiv.org on sparse online learning which basically has the option to truncate certain weights if the become too small.

Every now and then there was an interesting poster with nobody attending it. For example, Patrice Bertail had a poster on "Bootstrapping the ROC Curve" which looked interesting and highly technical, but we could find nobody. At some point I started to discuss the poster with a colleague, but we had to move away from the poster after people started to cluster around us as if one of us were actually Patrice.

Michalis Titsias had an extension of Gaussian Processes in his paper "Efficient Sampling for Gaussian Process Inference using Control Variables" to the case where the model is not just additive random noise but actually depends non-linearly on the function where the Gaussian process is on. It looked pretty complicated, but it might be good to know that such a thing exists.

There were many more interesting papers, of course, but let me just list one more: "Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models" by Tong Zhang seemed like a simple method which combines feature addition with removal steps and comes with a proof (of course). I guess similar schemes exist in dozens, but this one seemed quite interesting to try out.

The question I always try to answer is whether there are some big new developments. A few years ago, everybody suddenly seemed to do Dirichlet processes and some variant of eating place. Last year (as I have been told), everybody was into deep networks. But often, I found it very hard, and this year was also one of those. Definitely no deep learning, maybe some multiple kernel learning. There were a few papers on which try to include some sort of feature extraction or construction into the learning process in a principled manner, but such approaches are (necessarily?) often quite application specific.

I also began to wonder whether a multi-track setup wouldn't be better for NIPS. This question has been discussed every now and then, always in favor of keeping the conference single-track. I think one should keep in mind that what unites machine learning as a community are new methods, because the applications are quite divers, and often very specific. For a bioinformatics guy, a talk on computer vision might not be very interesting, unless there is some generic method which is application-agnostic to a certain degree.

It seems that currently, most of the generic methods are sufficiently well researched, and people now start to think about how to incorporate automatic learning of features and preprocessing into their methods. As I said above, such methods are often a bit ad-hoc and application specific. I'm not saying that this is bad. I think one first has to try out some simple things before you can find more abstract principles which might be more widely applicable.

So maybe having a multi-track NIPS would mean that you can listen more selectively to talks which are relevant to your area of research and the list of talks wouldn't appear to be somewhat arbitrary. On the other hand, you might become even more unaware of what other people are doing. Of course, I don't know the answer, but my feeling was that NIPS is slowly approaching a size and density of presentations that something has to change to optimize the information flow between presenters and attendees.

Workshops

I've heard that some people come to NIPS only for the workshops, and I have to admit that I really like them a lot, too. Sessions are more focused topic-wise, and the smaller size of the audience invites some real interaction. Whereas I sometimes get the impression that the main conference is mostly for big-shots to meet over coffee-breaks and during poster sessions, it's in the conferences where they participate in the discussion.

We had our own workshop on machine learning and open source software which I have summarized elsewhere.

I attended the multiple kernel learning workshop which really was very interesting, because most of the speakers concluded that in most cases, multiple kernel learning does not work significantly better than a uniform average of kernels. For example, William Stafford Noble reported that he had a paper with multiple kernel learning for the Bioinformatics journal, and only afterwards decided to check whether unoptimized weights would have worked as well. He was quite surprised when the differences where statistically insignificant and concluded that he wouldn't have written the paper in that way had he known the results before.

Francis Bach also gave a very entertaining talk where he presented Olivier Chapelle's work, who couldn't attend. He did a very good job, including comments like "So on the y-axis we have the relative duality gap - I have no idea what that is", and raising his hand after his talk to have the first question.

All in all, I think this workshop was quite interesting and exiting and also important for the whole field of multiple kernel learning, basically, to see that it doesn't just work, and to try to understand better when it doesn't give the improvements hoped for and why.

Finally, many workshops were taped by videolectures.net. I've collected the links here:

Saturday, December 06, 2008

On my way to NIPS 2008

Next week is the annual NIPS conference in Vancouver. In case you don't know, it's one of the most important annual conferences in the area of machine learning. It's actually its 22nd installment, and while the full title (Neural Information Processing Systems) hits at the beginnings of the fields with artificial neural networks, such methods cover only a small percentage of the presented papers nowadays.

Together with Soeren Sonnenburg and Cheng Soon Ong we have organized a workshop on machine learning open source software. I'm pretty excited about our program, I only fear that our decision to drop the coffee breaks in favor of a few additional talks will backfire severly. But we'll see.

I plan to use my twitter account this year to cover the workshop, so if you're interested, make sure to have a look. Most importantly, I will cover our workshop so that you can see how far we are behind schedule. ;)

By the way, the picture above shows the swimming gas station. It mostly services seaplanes, and you have a very nice view of it from the top floor of the hotel where the NIPS conference takes place.

Tuesday, November 25, 2008

Java Integration in JRuby

Update: See also the updated entry in the JRuby wiki here.

JRuby has some very nice features to make accessing Java easy, but unfortunately, documentation about these features is a bit hard to come by. Existing documentation is sometimes a bit out-dated, or doesn't cover all features. The ultimate documentation is of course the source code itself, in particular rspec code to be found in spec/java_integration in the jruby-trunk, but reading source code is maybe not the first choice for everybody when trying to learn a new feature.

Fortunately, Java integration in JRuby is actually pretty good and useful. JRuby does quite a few tricks to make Java classes feel more Ruby-like, for example, by translating between javaNamingSchemes and ruby_naming_schemes, or by automagically converting blocks to implementations of the appropriate Java interfaces, to the effect that you can pass a block to a Java method whenever it expects a Runnable, for example.

I personally think there is great potential in this kind of bilingual development approach which uses Java for all the computationally expensive details and Ruby for the dynamic high-level plumbing. Compared with the Ruby/C alternative, probably based on tools like Swig for the plumbing, JRuby/Java is much easier and cleaner. For example, you can easily deal with Java exceptions on the Ruby side, and even if you don't you get a call stack with Ruby and Java classes happily mixed when an error occurs.

So I hope that the following is both helpful and correct enough to be of use for others.

The Basics

The first thing to do is to require 'java'. I'm not sure if this is strictly necessary as many things also work without that, but I guess it's safe to do this for the future.

The first question is of course how to access Java classes.

There seem to be several different alternatives for naming a Java class. The most "ruby-esque" way looks like this: java.lang.System becomes Java::JavaLang::System. The prefix is always Java::, followed by the package names with dots turned into CamlCase notation and finally, the class itself.

If the top-level package is java, javax, com, or org, you can also access it just like in Java, for example, by typing java.lang.System. If you want to have this functionality for your own packages as well, because they start (for example) with edu, you can use the following code taken from path-to-jruby-installation/site_ruby/1.8/builtin/javasupport/core_ext/kernel.rb:
def edu
JavaUtilities.get_package_module_dot_format('edu')
end
Or, which is even simpler
def edu
Java::Edu
end
Java's primitive types can be found directly in the Java module: For example, long is Java::long (but not java.long). Actually, I cannot think of too much you could do with these objects. But you need them when you want to create primitive arrays. For example, Java::long[10].new creates a new array of long integer values.

Loading jar-files

If you want to access classes beyond Java's own runtime, you have to load the jar file in JRuby. One way to do this is to explicitly require the jar-file. You can supply a full path, or you can just pass the jar file which is then searched according to the RUBYLIB environment variable.

Alternatively, you can put the jar-file into your CLASSPATH. Then, you can directly access your classes without an explicit require.

Calling Java Methods

You can then access the methods just as in Java, for example
java.lang.System.currentTimeMillis
but JRuby also provides transformed method names to make them more similar to Ruby naming conventions. The typical CamelCase is converted to underscores. The call above can therefore also be written as
java.lang.System.current_time_millis
Moreover, JRuby maps the bean conventions as follows
obj.getSomething()        becomes   obj.something
obj.setSomething(x) becomes obj.something = x
obj.isSomething() becomes obj.something?

When accessing methods, or constants, you have to follow the usual Ruby conventions. That is, constants have to be accessed by Class::Constant (for example, JFrame::EXIT_ON_CLOSE), and it seems you cannot access member variables of objects in Ruby outside of the object's methods, even if it is declared public in Java.

For each java object there are number of methods defined, but most of them aren't documented anywhere. I think the more useful ones are java_class (because the plain class only returns some JRuby-side proxy object) and java_kind_of?, which I assume is like instanceof for Java objects.

Conversion of Parameters

When calling a Java method, JRuby goes through some pain of mapping JRuby types to Java types. For example, if necessary, a Fixnums is also converted to a double, and so on. If this fails and JRuby does not find a corresponding method, it prints an error message along the lines of "no foo with arguments matching [class org.jrub.RubyObject] on object #<Java::...>". Actually, I think JRuby could be a bit more helpful here and tell you what class the ruby object was. In any case, when this happens it means that the types you wanted to pass couldn't be converted.

JRuby does not automatically try to convert arrays to Java arrays. You have to use the to_java method for that. By default, this constructs Object[] arrays. if you want to construct arrays with a given primitive type, you pass an argument to to_java which can either be a symbol like :long or :double, or a class. For example,
[1,2,3].to_java              becomes an Object[]
[1,2,3].to_java :long becomes long[]
[1,2,3].to_java(Java::long) same thing


Importing Classes and Including Packages

Typing the full name quickly gets old, so you can import classes into modules, and at the top-level.

You might run into slight difficulties with how to name a class. These functions actually expect a string, but you can also pass the Java-style fully qualified class name (that is, with package), if the package starts with java, javax, or one of the other default package names. However, currently (at least as of 1.1.5) it seems that you cannot use the Java::... notation if the top-level package isn't one of those for which you could have used the Java-style name anyway. So I guess it's save to go with the strings in that cases.

This means that you can do
import java.lang.String
but not
import Java::EduSomeuniversity::SomeClass
Instead, you should type
import 'edu.someuniversity.SomeClass'
or do the "def edu; Java::Edu; end" hack and then "import edu.someunversity...".

While you can import classes into modules and at top-level, you can include whole packages into classes or modules. For example,
module E
include_package javax.swing
end
Makes all of Swing available in E and also as, for example, E::JFrame. At least it used to be like this, currently (in 1.1.5) E::JFrame gives an odd "wrong number of arguments (1 for 0)" error... .

Currently, there is no way to import whole packages at the top-level (or at least none I'm aware of). Taking the short-cut through a module and then including the module at top-level currently also doesn't work. You have to access the class first through the module before you can use it a top-level.

Implementing Java Interfaces and Adding Syntactic Sugar to Java Classes in Ruby

Classes are always open in Ruby. This means that you can add methods later just be re-opening the class again with a class Name ... end statement. This is a nice way to add, for example, syntactic sugar to your Java classes like operators, or other methods like each to make your classes behave more Ruby-like.

Of course, these additions are only on the Ruby side, and not visible from the Java side. You have to remember that Java is a statically typed language and these run-time additions aren't visible. Still I found this feature to be tremendously useful for making Java classes more usable on the JRuby side.

What is possible, though, is to implement a Java interface with a JRuby class and pass that new class back to Java. The preferred way to do this is by including the interface in the class definition:
class MyLittleThread
include java.lang.Runnable

def run
10.times { puts "hello!"; sleep 1 }
end
end
Then you can start this thread with
java.lang.Thread.new(MyLittleThread.new).run


If the interface has only a single method, you can even pass a block which then gets automatically converted:
java.lang.Thread.new { 10.times { puts "hello!"; sleep 1} }.start


I think this was not possible in earlier versions, but you can also subclass a Java class and pass it back to Java code expecting the super class. This is actually pretty cool, as it means that you can freely intertwine Ruby and Java code.

Collections, Regexs, and so on.

JRuby already adds some syntactic sugar on the JRuby side for Java collections, regular expressions and so on, to make them more Ruby-like. For example, Enumerable is mixed in into java.util.Collection. In addition, JRuby also defines some additional methods like <<, or +. For all the details, have a look at the files in JRUBY_DIR/lib/ruby/site_ruby/1.8/builtin/java/.

Exceptions

There isn't much to say here, only that it's possible to catch Java exceptions using Ruby's begin ... rescue ... end construction.

Summary

We've seen that there are some very nice features in the JRuby/Java integration, namely coercing some basic types, automatic conversion between Java and JRuby naming schemes, automatic translation of Bean patterns. You can even implement Java interfaces, providing callbacks to Java code in JRuby code.

Some things are maybe missing (or I haven't figured them out yet) like a way to import whole packages at the top-level. Sometimes I wished that the conversion between Ruby and Java objects would also be more configurable. But given the enormous speed at which JRuby is progressing, I'm sure that these oddities will be smoothed out soon.

Monday, November 17, 2008

On Relevant Dimensions in Kernel Feature Spaces

I finally found some time to write a short overview article about our latest JMLR paper. It discusses some interesting insights into the dimensionality of your data in a kernel feature space when you only consider the information relevant to your supervised learning problem. The paper also presents a method for estimating this dimensionality for a given kernel and data set.

The overview also contains a few lines of matlab code with which you can have a look at the discussed effect yourself. It all boils down to pictures like these:



What you see here is the contribution of individual kernel PCA components to the Y samples, divided by the smooth part (red), and the noise (blue), on a toy data set, of course. Kernel PCA components are sorted by decreasing variance. What you can see is that the smooth part is contained in the leading kernel PCA components, while the later components only contain information relevant for the noise. This means that even in infinite-dimensional feature spaces, the actual information is contained in a low-dimensional feature space.

If you are looking for more information, have a look at the overview article, the paper, or a video lecture I gave together with Klaus-Robert Müller in Eindhoven last year. There is also some software available.

Thursday, October 23, 2008

Matrices, JNI, DirectBuffers, and Number Crunching in Java

One thing Java lacks is a fast matrix library. There things like JAMA and COLT (not using its LAPACK bindings), but compared to state-of-the-art implementations like ATLAS, they are orders of magnitude slower.

Therefore I've set out some time ago to write my own matrix classes based on ATLAS. It turns out that while this is quite straightforward for most of the time, there are few issues which become complicated very quickly, to the point where I wonder if Java is really the right platform for this kind of application at all.

Basically, there are two areas where the Java platform can give you quite a headache:

  • Pass large amounts of data to native code.
  • Lack of control about how good the JIT compiles code.

Passing Data to Native Code

For a number of reasons, producing highly tuned linear algebra code still requires a lot of low-level fiddling with machine code. If you want really fast code, you have to rely on libraries like ATLAS. The question is then how you pass the data from Java to ATLAS.

Maybe the easiest choice is to store the data in primitive type arrays like double[] and then access them via JNI. Only that the documentation is quite explicit about the fact that this will not work well for large arrays as you typically will get a copy of the array, and afterwards, the result has to be copied back.

The solutions proposed by Sun is to use direct buffers which permits to allocate a piece of memory outside of the VM which is then passed directly to native code.

This was also the solution adopted by our jblas library. However, it quickly turned out that the memory management for DirectBuffers is anything but perfect. As has already been pointed out by others, the main problem is that Java does not take the amount of DirectBuffers into account to compute its memory footprint. What this means is that if you allocate a large number of DirectBuffers, the JVM will happy fill all of your memory, and potentially the swap space without triggering a single garbage collection.

After this happened a few times to me I began wondering whether DirectBuffers are garbage collected at all, but it seems that everything is implement reasonably well: the memory allocated with direct buffers is managed by PhantomReferences in order to be notified when the corresponding object is garbage collected, and if you call System.gc() by hand often enough your DirectBuffers eventually get garbage collected. That is, most of the time, at least.

The problem is that there is hardly a way around this. Maybe it's my fault because direct buffers were never intended to be allocated in large numbers. I guess the typical use would be to allocate a direct buffer of some size and then use that one fixed direct buffer to transfer data between the native code and the JVM.

The only problem is that this is basically the copying scenario which direct buffers should have helped to avoid... .

I've written a little program which compares the time it takes to copy an array using the GetDoubleArrayElements() function and using the bulk put and get functions of DoubleBuffer. The results look like this (on an Intel Core2 with 2 GHz):



What is a bit funny (and also brings me directly to the second point) is that GetDoubleArrayElements takes longer than DoubleBuffer.put() and get() after a size of about 100kb. Both routines basically have to deal with the same task of copying data from a Java primitive array into a piece of raw memory. But for some reason, the DoubleBuffer variant seems to be able to uphold a certain throughput for much higher amounts of memory.

Code Optimization

The second point basically means that it's more difficult than in a traditional compile language to control how fast your code becomes after it has been compiled JIT. In principle, there is no reason why this should be true, but I've observed a few funny effects which have caused me to doubt that the HotSpot technology is optimized for number crunching.

For example, as I've discussed elsewhere, copying data between arrays, or accessing DirectBuffers leads to drastically different performances on the same virtual machine depending on the byte-code compiler. For some reason, the eclipse compiler seems to produce code which leads to much faster machine code than Sun's own compiler on Sun's HotSpot JVM.

What this means that it is very hard to optimize code in Java, because you cannot be sure whether it will lead to efficient machine code by the JIT compiler of the virtual machine. With compile languages the situation is in principle similar, but at least you can control which compiler you use and optimize against that.

Summary

So I'm not quite sure what this means. Maybe some JVM guru can enlight me on what is happening here. And I don't even see what is wrong with the way direct buffers are implemented, but on the bottom line, memory management for direct buffers is not very robust, meaning that your program will typically consume much more memory than it should.

Monday, October 20, 2008

What is Machine Learning? Revisited

I always hated questions like "what is an organism" in school. What makes this kind of question hard is while you typically have quite an elaborate idea about what an organism is, summarizing all of this information in on sentence can be very hard.

For me, the question "what is machine learning" is very similar. I've already tried to answer that question, or to better characterize what the role of data sets are in machine learning, but I never found those definitions sufficient.

But is it really that important to be able to define what machine learning is? After all, we all know what it is, right? Well, I found that answering what machine learning is to you can actually be very helpful when it comes to deciding which new ideas to follow or planning what to do in the long run. And it also helps a lot answering the question what I do for a living. Being able to describe what you're doing at least is better than laughing out load and then saying "well... ." (that really happens...)

Okay, so the usual definition you often hear is that machine learning is about writing programs which can learn from examples. This is about as vaguely correct and non-committal as it gets. Put differently, machine learning is about programs which can learn from examples, and solve complex problems which are hard to formalize.

So far, so good, but when you take this characterization seriously, you see that machine learning is mostly a certain approach, or conceptual framework, for solving computational problems. In particular, machine learning is not directly linked to an application area (like, I don't know, building fast and reliable data bases or computer networks), but it is more of a meta-discipline. In this respect, it is maybe more similar to theoretical computer science, or even mathematics, where you develop theory and neglect the question what the practical implications are because you are confident that eventually, somebody will find a way to use it (After all, who would have thought that number theory would one day be so important for cryptography?)

This also means that we often find ourselves in the position of an intruder in other areas, telling people that a "learning approach" works better than the tools they have been building for the last couple of years. Those of you who on projects with a strong application focus like bioinformatics, computational chemistry, or computer security certainly know what I am talking about... .

So on the one hand, machine learning deals with problems which are hard to formalize, and concrete data sets become a substitute for formal problem definitions. On the other hand, machine learning can and is often studied in quite an abstract fashion, for example, by developing general purpose learning methods.

But is it really possible to study something abstractly where we often need concrete data sets to represent certain questions and problems? Of course, elaborate theoretical frameworks for modeling learning exist, for example, for supervised learning. But these formalizations are ultimately too abstract to capture what really characterizes the problems we wish to solve in the real world.

I personally think the answer is no, meaning that it's not possible to do machine learning abstractly, because we need concrete applications and data sets to provide real challenges to our methods.

What this also means is that you need to compete not only with other machine learning people, but also with people who approach the problems in the classical way, by trying to write programs which solve the problem. While it is always nice (and also very impressive) if you can solve a hard task with a generic machine learning method, you need to compete with "classical" approaches as well to make sure that the "learning approach" really has an advantage.

Tuesday, September 09, 2008

Why are people following me on twitter?


A few months ago I created an account on twitter. Part of me justed wanted to try out the newest Web 2.0 thing everybody's crazy about, but I also got myself convinced to perceive a certain need as I was going to a wedding without my family and thought that this way I could keep them up to date on my whereabouts.

So basically, I posted a few tweets for about one weekend in German, and that was more or less it.

The funny thing is that after that weekend I got 3 followers, most of which didn't even speak German. By now, I have 10 followers, and I really don't think I deserve them. So why are the people subscribing to my feed?

Well, one person I know personally, and a few seem to follow me as I stated in my profile that I'm working on machine learning, but that still leaves about 5 people, and frankly, I don't even know how they even found my feed.

Anyway, I also haven't really yet understood what twitter could do for me. I'm not saying that it doesn't make sense at all. For example, I'm following Charles Nutter, one of the main guys working on jruby, and I found his tweets to be a nice way to track what he is doing and what he is working on.

In my case, however, it doesn't really work. I'm involved in so many things that people would get seriously confused if I wrote down every little bit (writing proposal/discussing with students/thinking about world-domination (muahahah)/reviewing a paper/fixing cron jobs). I could tweet about my research, but I'm not even sure if it would be wise if I told everybody what I'm working on, because either it doesn't work out, and then it could be kinda embarassing, or it actually works, and then I'm just giving other people ideas what to look into.

Lately, I've had kind of an insight: the penalty for subscribing to a low-volume twitter is quite small (apart from you loosing track of what the heck you're subscribed to). Some people have like ten thousand subscriptions. But if most of them don't post anything useful, everything's fine. And if you subscribe to somebody who posts a lot but you lose interest, you can get rid of him easily. So maybe everything's making sense.

Well, I'll be attending this years NIPS conference in December. An excellent opportunity to try twitter again ;)

Monday, September 08, 2008

NIPS outcome and MLOSS Workshop


Well, the NIPS results are out. If you don't know it, it is one of the largest (maybe the largest) conferences in machine learning held each year in early December, and they have just sent around which papers are accepted and which are not on Saturday.

Unfortunately, none of my papers made, although one got quite close. On the other hand, I'm very glad to announce that our workshop on machine learning open source software has been accepted. This we be the second (actually third) installment: In 2005, the workshop was not included into the program, but many people found the issue important enough to come to Vancouver a day earlier and take part in a "Satellite Workshop".

In 2006 we were accepted and actually had a very nice day in Whistler. When I noticed that I was personally enjoying the workshop, I knew that we had managed to put together a nice program. Maybe the highlight was the final discussion session with Fernando Pereira stating that there is little incentive for researchers to work on software because there is no measurable merit in doing so. Eventually, this discussion lead to a position paper and finally to a special track on machine learning software at the Journal of Machine Learning Research.

I'm looking forward to this years workshop, and hope that it will be equally interesting and productive!

Thursday, September 04, 2008

New Paper Out

I'm very happy to announce that my latest paper just came out at the Journal of Machine Learning research. Actually, it is the second half of work which developed out of my Ph.D. thesis. The other paper came out almost two years go, which tells you a bit about how long reviews can take if you're unlucky.

Anyway, from a theoretical point of view, the first paper studies eigenvalues of the kernel matrix while the second one studies eigenvectors, and both derive approximation bounds between the eigenvalues and the eigenvalues you'd get as the number of training points tends to infinity. These bounds have the important property that they scale with the eigenvalue under consideration. You don't have one bound for all eigenvalues, but instead a bound which becomes smaller as the eigenvalue becomes smaller. This means that you won't have the same bound on eigenvalues of the order of 10^3 and 10^-6, but actually, the error will be much smaller on the smaller eigenvalue. If you run numerical simulations, you will immediately see this, but actually proving this was a bit harder.

What this practically means is a different story, and I personally think it is actually quite interesting (of course :)): If you have some learning problem (a supervised one), and you take a kernel and start to train you support vector machine, then these results tell you that even if the kernel feature space might be very high-dimensional, the important information is contained in a low-dimensional subspace, which also happens to be spanned by the leading kernel PCA components.

In other words, when you choose the right kernel, you can do a PCA in feature space, and just consider a space spanned by the directions having largest variance. In essence, if you choose the correct kernel, you are dealing with a learning problem in a finite-dimensional space, which also explains why these methods work so well.

The "old" story was that you're using hypothesis classes which have finite VC dimension, and then everything's fine. This is still true, of course, but these new results also show why you can expect low-complexity hypothesis classes to work at all: Because a good kernel transforms the data such that the important information is contained in a low-complexity subspace of the feature space.

So I hope I could make you a bit interested in the paper. I'm trying to put together a bit more of an overview on my home page, so make sure to check that out as well in a week from now.

Wednesday, September 03, 2008

Google Chrome

Well, I decided not to blog about Google Chrome. We can't all be non-conformists, right?

I mean, It's Only A Browser™, after all.

Then again, I wonder how fast that V8 Javascript machine really is... .

If you're still interested in an not entirely glorifying post, you might want to check out the post by JRuby's Charles Nutter.

And now excuse me while I go shopping.

Monday, August 25, 2008

Benchmarking javac vs. ecj on Array Access

One of the projects I'm currently working on is a fast linear algebra matrix library for Java. I know that there already exist some alternatives, but I usually found that they were either no longer actively maintained or their performance was much slower than what, for example, matlab would give you.

So in principle, all I wanted is a matrix library which directly interfaces with some fast BLAS or LAPACK library, for example ATLAS, but hides all of the Fortran greatness behind a nice object-oriented wrapper. Turns out that something like that does not seem to exist, so we started to work on our own little library called jBLAS which is going to be released soon.

Anyway, while working on that library, I got a bit worried about the performance of the stuff I'm doing in Java. These are mainly element-wise operations like multiplying all elements of a matrix with the corresponding elements of another matrix.

Since we're interfacing to C (actually Fortran) code a lot, all of the matrix data is stored in direct buffers, in particular java.nio.DoubleBuffer, and I wondered how their put and get access methods compared to array access and of course to C. So I wrote a little program in eclipse which did just that and compared the outcome with what I got in C, and to my suprise, Java was very much on par, and the performance differences between accessing arrays and direct buffers was very small.

But when I tried to repeat the results at home, I got very different numbers, both for arrays and for direct buffers, which suddenly took five times longer than arrays. At first I suspected that the different processor maker was the issue, but after some more tweaking I found out that again to my great surprise, the issue was the compiler.

It turns out that the eclipse compiler ecj is much better at compiling my benchmark than Sun's own javac. To see this for yourself, here is my benchmark:
import java.nio.*;

class TicToc {
private long savedTime;

public void tic(String msg) {
System.out.print(msg);
System.out.flush();

saveTime();
}

public void tic(String fmt, Object... args) {
System.out.printf(fmt, args);

saveTime();
}

private void saveTime() {
savedTime = System.currentTimeMillis();
}

public void toc() {
long elapsedTime = System.currentTimeMillis() - savedTime;

System.out.printf(" (%.3fs)\n", elapsedTime / 1000.0);
}
}


public class Main {
private static void copyArray(int size, int iters) {
double[] source = new double[size];
double[] target = new double[size];
for (int i = 0; i < iters; i++)
for (int j = 0; j < size; j++)
target[j] = source[j];
}

private static DoubleBuffer createBuffer(int size, ByteOrder bo) {
return ByteBuffer.allocateDirect(Double.SIZE / Byte.SIZE * size)
.order(bo).asDoubleBuffer();
}

private static void copyDoubleBuffer(int size, int iters, ByteOrder bo) {
DoubleBuffer source = createBuffer(size, bo);
DoubleBuffer target = createBuffer(size, bo);
for (int i = 0; i < iters; i++)
for (int j = 0; j < size; j++)
target.put(j, source.get(j));
}

public static void main(String[] args) {
TicToc t = new TicToc();

final int SIZE = 1000;
final int ITERS = 1000000;

t.tic("copying array of size %d %d times...", SIZE, ITERS);
copyArray(SIZE, ITERS);
t.toc();

t.tic("copying DoubleBuffer of size %d %d times... (LITTLE_ENDIAN)", SIZE, ITERS);
copyDoubleBuffer(SIZE, ITERS, ByteOrder.LITTLE_ENDIAN);
t.toc();
}
}

So let's compare the run-times of these programs using the latest Java JDK 6 update 7 and the stand-alone eclipse compiler 3.4.

Then, I compile the source with
javac Main.java
and run it with java -cp . Main:
copying array of size 1000 1000000 times... (2.978s)
copying DoubleBuffer of size 1000 1000000 times... (LITTLE_ENDIAN) (4.291s)
Now the same with the eclipse compiler: I compile it with
java -jar ..path_to_jar../ecj-3.4.jar -1.6 Main.java
and get the following results:
copying array of size 1000 1000000 times... (0.547s)
copying DoubleBuffer of size 1000 1000000 times... (LITTLE_ENDIAN) (0.633s)
Note that both files only differ in the bytecode, but the eclipse compiler manages to produce code which is about 6 times faster, and which is only slightly slower for the direct buffer.

I'll have to look into this quite some more, but if this is true, this is pretty amazing!

Thursday, August 21, 2008

My keychain Tool

I wrote a little script in ruby which manages passwords. It is a strictly command line tool and allows you to store keys, retrieve them, and even generate more or less pronounceable passwords of arbitrary length using a Markov chain. All the data is stored in a blowfish-encrypted file and should therefore be somewhat safe.

Since the script depends on two other ruby gems (namely crypt and highline), I've uploaded it to rubyforge. You can now simply install the script with
  gem install keychain

The interface is quite easy, you can either pass the command as first argument, or invoke keychain which then enters an interactive mode. Simple try
   keychain help
to see a list of commands.

Needless to say, I cannot guarantee that this tool will not one day forget all your valuable passwords, but at least I can assure you that I'm personally trusting my code. :)

Wednesday, August 20, 2008

Nokia E61i

I own a Nokia E61i since last December and all in all I must say that it has been quite an agreeable experience. It is solidly build, has a small but usable keyboard, real multitasking, a large screen and a really nice browser which even support JavaScript. Recently I found out that it supports the SIP standard for Voice Over IP which integrates very nicely with many "open" commercial offerings. Using a software called JoikuSpot even turns your cell phone into a WiFi HotSpot, giving you almost config-free access to the Internet wherever you are (okay, only for http in the free version).

Maybe the weakest component (and incidentally one of the most important) is the messenger, and I like to elaborate a bit on the small weaknesses and nuisances of that piece of software. Nokia, if you're reading this, I politely ask you to consider fixing these things in the next update! Just do it. Pleeease!

Robustness

Generally, the phone is quite robust software-wise. It almost never locks up, and even if it does, you can usually still kill the program easily (How? Push the "Swirl" button for a few seconds, scroll down to the program and press the backspace key). However if there is one program which hands more often than others, it's the messenger.

Now, you would expect that having unstable connections is really part of the life of a cell phone. After all, it's a mobile device which means that it's position is constantly changing and reception may be quite unstable. Still, whenever the phone tries to connect but really can't, it decides that it's safer to turn off the automatic retrieval. I wonder if just waiting for few minutes wouldn't be a more reasonable alternative.

So if you can connect but something happens during the connect, the mail software also really likes to die in a number of less obvious ways. Sometimes it claims that it is still connected to the mailbox, but actually it isn't. If you start to read emails, they vanish. If you manually close the connection is closed, it sometimes helps but often it doesn't.

All in all, I know that IMAP is a complex protocol, but pleeease couldn't you try to make the error handling a bit more reasonable?

Certificates

The other big nuisance is that automatic retrieval only works when you have the certificate of the server. Now if you're working for a big company (which might be using Exchange anyway), which has paid enough money to get decently trusted certificates, everything's okay, but when you're working at a university, or try to sync emails from your own server, you're out of luck.

The situation might not be entirely bad if there would be some way to install missing certificates easily. For example in the dialog for untrusted certificates. I wonder why they chose not to allow that... .

So how do you get the certificate on the phone? First of all, you have to obtain the certificate. If you're lucky, there is some web page where you can download the certificate. Now you only have to pray that the web server is setting the correct mime type and everything's good.

If not, you have to get the certificate. After some trial-and-error I found that the Internet Explorer (but not Firefox or Thunderbird) allow to safe a certificate to disc. So again if you're lucky, the same certificate is used for secure web connections and you can safe that certificate.

Well, so how do you get the certificate on the phone? Putting the certificate somewhere on the phone using the file explorer won't work, because the phone doesn't recognize certificates in it's own file browser. So either you've to mail it to you (saying one more time that yes, I want to connect to that server), again hoping that the mime types will be set correctly, or you have to put it somewhere on the web from where you can then download it.

To cut a long story short: Nokia, please let us install certificates whenever we encounter an untrusted one.

Hyperlinks

Finally, one last nuisance is that (at least on the E61i, other versions of Symbian might have only one browser for Internet and WAP) if you click on a link in a message, the WAP browser is opened, not the Internet browser. Which really doesn't make any sense.

Nokia, will you help me out?

Update: If you're running Linux, you might try the following "one-liner" to get the certificate from a server
echo "quit" | openssl s_client -connect server-address:server-port |\
ruby -e 'flag=false; readlines.each {|x| flag ||= (x =~
/BEGIN CERT/); if flag then puts x end; flag &&= !(x =~ /END CERT/) }' |\
openssl x509 -outform DER >cert.der
Replacing "server-address:server-port" with your server information, of course.

Update: Even if you install the server certificate on the phone, it may still be that the signing certificate is unknown. Using the Internet Explorer, you can save all those other certificates as well, and you have to install them all to make the phone accept the server certificate. I haven't yet figured out how to do this with openssl, though... .

Wednesday, August 06, 2008

Static MVC

Update Feb 4, 2009: I recently learned about Jekyll, which puts these ideas into a solid framework.

I know that web applications and content management systems are the hot thing right now (and maybe will be for some time), but sometimes you wonder whether a few static html pages wouldn't be sufficient anyway. The advantages are pretty clear: No database back-end, no tweaking of apache configurations, and of course, nothing beats static html pages in terms of speed and memory requirements.

On the other hand, editing html pages by hand seems so "old-school", and there are some real maintenance issues. Appearance can be nicely controlled by CSS, but if you actually want to move bits of information around, you have to do it yourself. And let's just hope that you haven't gotten into adding some nice Javascript effects to your pages... .

This approximately describes the situation I was in when I decided to restructure my own homepage.

It turns out that it is actually possible to write a page generator along the principles used in web frameworks such as rails and end up with something quite elegant. And it's fun, too!

Some of the power of a framework like rails comes from a clear separation between data, business logic, and presentation. This kind of separation is also called the MVC-Pattern (Model-View-Controller). All the data is typically stored in a database like MySQL, and the presentation is encoded in template files.

So how can we duplicate this kind of separation with a static page-generator? Luckily, ruby comes with all the tools you need. Let us walk through a little example, the inevitable blog (without commenting functionality, of course).

Let's start with the database. We will just store all the information into YAML files. YAML is a file format which puts special emphasis on being easy to read and maintain by a human. The format is pretty self-explanatory:
- date: August 4, 2008
title: Same old, same old
text: |
Sometimes, you just wake up with that feeling that something
great is happening today. I couldn't really lay my finger on
it, but I was very sure that...

- date: August 6, 2008
title: You wouldn't believe
text: |
You know when I said that something great would happen two
days ago? And you know what...

This file contains two entries, encoded as hashes. The "|" indicates indented multi-line data.

Using the YAML library, you can easily load this data with
require 'yaml'

blog = YAML::load_file('blog.yaml')

And if you inspect blog you will find that is an array of two hashes.

Next, we need to render the result. We'll use the erb library, which you might know from rails. Here is a possible template file
<html>
<head>
<title>My little static blog</title>
</head>
<body>
<% for b in blog %>
<h1><%= b['title'] %></h1>

<p><em><%= b['date'] %></em></p>

<p><%= b['text'] %></p>
<% end %>
</body>
</html>
(Almost looks like were coding in rails, right? :))

Let's finally put all things together. We store the YAML file in blog.yaml and the template file in blog.rhtml. Then, the following script will generate blog.html:
require 'yaml'
require 'erb'

blog = YAML::load_file('blog.yaml')

blog_template = open('blog.rhtml').read
html = ERB.new(blog_template).result

open('blog.html', 'w') {|o| o.write html}

And here is the result:


This way, you can easily add new posts by editing just the YAML file, and fiddle with the presentation in the template file. You're not even restricted to a fixed number of pages, you could, for example, also generate a separate (static) page for each individual post (dynamically) by the script.

And it doesn't stop here, adding a lightweight markup engine like BlueCloth you can even start using wiki-style markup in your texts.

So, it doesn't always have to be a full fledged web application framework, you can have similar flexibility (minus the interactivity, of course), using a few readily available tools, and very few lines of code as well.

Monday, August 04, 2008

Quo vadis?

Actually, I've been thinking a bit about what to do with this blog. So far, I've even kept it sort of semi-private by not linking from my work homepage to it, maybe because I didn't want to have to worry if it is official enough, or if it reflects my scientific interest well enough. Consequently (and maybe because I don't have anything interesting to say), almost nobody is reading my blog and even friends are sometimes surprised that I have a blog.

I've thought about what I could do to make my blog more relevant and also more interesting, you know, concentrate on a few topics to give people a better idea of what to expect, and maybe also to give them a reason to come back and actually follow the blog.

Maybe the most interesting revelation was that the topics I cover in my blog are quite different from my scientific work, which is often quite heavy on the theory, and less concerned with the things I apparently like to blog about: programming languages, nice tools, and the occasional insight into some technology twist.

So I guess I'll try to accept these different aspects of my interests, and refrain from attempts to streamline my web presence.

This may be somewhat unrelated, but I also chose to rename the blog from "Trunk of Heroes" (whatever that was supposed to mean) to "Marginally Interesting" which is such a nice long phrase which also conveys only little information. At least, now I can say funny things like "My blog is Marginally Interesting" :)

Anyhow, the semester is over which means that I'll have more time doing some research and - of course - learning some exciting new piece of technology.

JRuby 1.1.3 and Jython 2.5

Just for the record, jruby 1.1.3 has been released. Startup time is again down a bit but not overly much so. All in all, I think they are doing a terrific job. On a related note, jython has also released an alpha version of jython which is going to be compatible with python 2.5. The last bigger release of jython is already a bit old and was compatible only to python 2.2.

On the other hand, I find it harder and harder to choose between the two languages. Somehow, they seem to fill in almost the same spot, and it is only a question of community if you're more into ruby or python. In any case, if you're looking for some nice integration with java, you're going to have both alternatives soon, which is a good thing, I guess.

Wednesday, July 09, 2008

Data Mangling Basics

When you're new to machine learning, there usually comes the point where you first have to face real data. At that point you suddenly realize that all your training was, well, academic. Real data doesn't come in matrices, real data has missing values, and so on. Real data is usually just not fit for being directly digested by your favorite machine learning method (well, if it does, consider yourself to be very lucky).

So you spent some time massaging the data formats until you have something which can be used in, say, matlab, but the results you get aren't just so good. If you're lucky, there is some senior guy in your lab you can ask for help, and he will usually do some things to preprocess the data you've never heard of in class.

Actually, I think that it should be taught in class, even when there is no systematic methodology behind it, and no fancy theorems to prove.So here is my by no means exhaustive set of useful preprocessings.

  • Take subsets You might find this pretty obvious, but I've seen students debugging their methods directly on the 10000 instances data set often enough to doubt that. So when you have a new data set and you want to quickly try out several methods, take a random subset of the data until your method can handle the data set in seconds, not minutes or hours. The complexity of most data sets is also such that you usually get somewhat close to the achievable accuracy with a few hundred examples.

  • Plot histograms, look at scatter plots Even when you have high-dimensional data, it might be very informative to look at histograms of individual coordinates, or scatter plots of pairs of coordinates. This already tells you a lot about the data: What is its range? Is it discrete or continuous? Which directions are highly correlated, and which are not? And so on. Again, this might seem pretty obvious, but often students just run the specified method without looking at the data first.

  • Center and normalize While most method papers make you think that the method "just works", in reality, you often have to do some preprocessing to make the methods work well. One such step is to center and normalize your data to have unit variance in each direction. Don't forget to save the offset and normalization factors: you will need them to correctly process the
    features for prediction!

  • Take Logarithms Sometimes, you have to take the logarithms of your data, in particular when the range of the data is extremely large and the density of your data points decreases as the values become larger. Interestingly, many of our own senses work on a logarithmic scale, as for example, hearing and vision. Loudness is for example measured in decibel, which is a logarithmic scale.

  • Remove Irrelevant Features Some kernels are particularly sensitive to irrelevant features. For example, if you take a Gaussian kernel, each feature which is irrelevant for prediction increases the number of data points you need to predict well, because for practically each realization of the irrelevant feature, you need to have additional data points in order to learn well.

  • Plot Principal Values Often, many of your features are highly correlated. For example, the height and weight of a person is usually quite correlated. A large number of correlated features means that the effective dimensionality of your data set is much smaller than the number of features. If you plot the principal values (eigenvalues of the sample covariance matrix C'*C, you usually will notice that there are only a few large values, meaning that a subset of the space will already contain most of the variance of your data. Also note that a projection to those dimensions is the best low-dimensional approximation with respect to the squared error, so using this information, you can transform your problem in one with fewer features.

  • Plot Scalar Products of the output variable with Eigenvectors of the Kernel Matrix Finally, while principal values only tell you something about the variance in the input features, if you plot the scalar products between the eigenvectors of the kernel matrix and the output variable in a supervised learning setting, you can see how many (kernel) principal components you need to capture the relevant information about the learning problem. The general shape of these coefficients can also tell you if your kernel makes sense or not. Watch out for an upcoming JMLR paper for more details.

  • Compare with k-nearest neighbors Finally, before you apply your favorite method on the data set, try something really simple like k-nearest neighbors. This can get you a good idea of what kinds of accuracies you can expect.


So in summary, there is a lot of things to do with your data before you plug it into your data set. If you do it correctly, you will learn quite a lot about the nature of the data, and have transformed your data to learn more robustly.

I'd be interested to know what initial procedures you apply to your data, so feel free to add some comments.

Saturday, July 05, 2008

Time Management for PostDocs

As a student, there is always a lot of stuff to do. But on the other hand, you are in the convenient position that others have decided what is important and you basically just need to make sure that you allot enough time to each single item.

Once you start working as a Ph.D. student or as a PostDoc, things gradually stop being that simple. You acquire a few responsibilities, and before you know it, you not only have to think about doing some original research, but you also have to supervise a bunch of students, give lectures, grade papers, organize a seminar, do reviews, do a bit of system administration, take care of the office furniture for the new building your moving in, keep the group's website up-to-date, and so on.

So while you might have been as busy as a student, the big difference is that now all these different duties actually compete for your time, and you have to start actively managing your time. Otherwise, only the squeaky wheel gets all the grease. And just as it happens, most of the stuff which has written "URGENT" in large, red letters written all over it is not what is most important in the long run.

The most important step might be to realize that it's not only about sorting and ranking the tasks in your inbox, but that you must go a step further and also control what gets in your inbox and what priorities you assign. Being fully aware of one's priorities certainly leads to better load balancing than just trying to satisfy all constraints of the items on your to-do-list.

I guess the language I've been slipping into in the last paragraph already points to fact that I'm a computer scientist by training. So for me (at least if you believe the stereotypes), everything is just a process, and the goal is to maximize the overall productivity with respect to some sensible criteria given certain constraints like the fact that each day only has 24 hours.

Consequently, one could expect that time management approaches appealing to hackers are a bit different from those which are interesting to the mainstream and which focus more on "soft topics" like how to get into the right frame of mind, or which explain their principles with allegories involving panda bears (just kidding).

There is one book out there called Time Management for System Administrators by O'Reilly which exactly fits the above description. For the author, Tom Tomicelli, time management is about learning useful habits, turning useful behavior into processes which become automatic to the point you forgot why you're doing it in the first place. One example he gives is that he always takes his car to the gas station on Sundays. At one point he wondered why he is actually doing this, and then he remembered that he often missed appointments on Monday morning when he still did not have this habit. But then he changed his habits and actually forgot why he did it, leading to effortless efficiency, the best of all conditions. The other big topic in his book is a way to manage your to-do-lists together with your appointments such that you do not have this one big to-do-list of doom where you know that you will never ever finish it, but instead realistic small sets for each day. Oh, and the book also sports a number of Dilbert comics, underlining what its main customers are expected to be.

On the bottom line, I found this book very fun to read and it contained a lot of nice ideas. On the other hand, I couldn't really implement the process of managing my to-do-lists. I tried, but the main problem was that my days are just too unpredictable, such that although I selected a small number of things I wanted to do on a specific day, I seldom managed to do all of them, leading exactly to the kind of never-ending backlog one should to avoid. But your mileage may vary.

The other time management framework which seems to be fairly popular among hackers is "Getting Things Done" by David Allen. Actually, the whole framework is described in one book of the same title. The actual appeal for hackers very clearly comes from the fact that the book mainly describes a single process for managing all your stuff. The main idea is that you should have some means of keeping all the information you need outside of your brain, freeing it from a lot of burden and helping you to become less worried and more productive (of course). Each single item you have to take care of, and every piece of information, goes through a process where it is analyzed, the next necessary action is decided upon, and then this information is stored in archives, or on several lists. One of this lists, the "next action list" contains all the next actions you could take in order to do some process sorted by the context where you could perform that action. The idea is that when you find some time to make phone calls or read, you just look at your list and take the appropriate actions.

Personally, what I liked about this scheme is that there is so few planning. It focuses more on what you could do instead of what you should do. Instead of planning when you could do that phone call, or even making a short list of things you want to do on a certain day, you just compile lists of things you could do and do them whenever you have the time. And let's face it, there is only so many things you can do on a day, and this approach naturally leads to a situation where you have made best use of your time (provided that you chose your actions wisely from the list).

When you look at other professions, this approach reminds me of the way physicians are organized. A doctor has to deal with many cases at once, but cannot possibly keep all of the information about his patients in his head. Therefore, he uses files, one for each patient. When a patient comes to an appointment, he (or she) has all the necessary information. He decides on the next action to be taken, and puts that information in the file. The patient then makes another appointment to keep track of the development. But the doctor can go home at night and forget almost everything he had to deal with over the day.

There is a very interesting interview with David Allen in wired, if you are interested, together with a few links to other resources.

Of course, I have to admit that I'm also not following this process exactly (although David Allen specifically warns about that), partly because sometimes things just keep getting out of hand, and then again, everybody adds his own ideas to the mix. In the end, I think it is also not that important whether you follow some process to the letter or not, but that you gain some awareness about what you do and why, and why some things never get done, and why these are sometimes those things you really should be doing more.

And, of course, there is also always the danger with hackers that they get so fascinated with a new toy that they miss the point where they maybe should stop optimizing the process and start putting it to work. I'm certainly not making fun of others, maybe I should invest more time into honing my time management skills, but if you're really into "life hacking" and finding the best physical implementation of the GTD scheme, have a look at 43folders or these pictures of people creating some amazing organizers from plain sheets of paper.

If you're more into software, you might find chandler or beeswax interesting.

Friday, June 06, 2008

Most Incredible Emacs Feature. Ever.

We were all working hard towards the NIPS deadline, when I suddenly noticed a menu in my emacs editor called "Preview". Being mostly a purely key-command and Meta-x kind of emacs user, I seldom look up at the menu, but this got me interested.

So I go up there and click on "Generate previews... for section" (you know, always be on the safe side, no idea what that'll do to the whole document), and, whoa!

It seems that somebody actually had enough time on his hands to write some code which takes all the figures and math formulas, extract them to a separate file, runs latex on them and then reincludes the previews into the editor. In LISP!!!

Man, that has got to be the most incredible emacs feature, ever.

A few minutes later, I disabled the previews. I think it is like in the movie matrix: Once you have learned to read the code, just seeing the rendered version doesn't make you happy anymore. :)

Friday, May 30, 2008

NIPS, mloss.org and jruby-1.1.2

Currently it's one week till this years NIPS deadline and there still is a lot of work to do, as always. I just wanted to quickly point you to a post I've written on our homepage mloss.org, where I try to make the point that you could actually do useful things using a modern scripting language like python or ruby for building machine learning toolboxes. At least, you'll have more powerful modeling possibilities than in matlab.

And then, jruby-1.1.2 has been released. Beside other things they have managed to reduce the startup time from about 1.2s (on my machine) to 0.4s, which is pretty great. I'm planning to write an article about how useful the jruby/java integration is, in particular for machine learning, but that will be after NIPS, I guess... .

Saturday, April 12, 2008

JRuby 1.1 released

Recently, jruby V1.1 has been released. It is a full-blown implementation of the latest version of ruby in java. Actually, there are two things which make jruby special:

  • jruby is actually sponsored by sun who pays two of the main developers full time. I guess sun's motives for this is to tie closer bounds between the Ruby on Rails web framework and java. For jruby, this means that people are working full time on it, and also that they get to talk to the java virtual machine guys at sun to address performance issues within jruby.
  • Compared to ruby 1.8.6, jruby is often faster (if you follow the link, just don't look at the memory usage :)). The reason is that jruby actually is able to compile ruby code to java bytecode. There even is a standalone tool called jrubyc which allows to compile ruby code ahead of time.

In addition, you get all the usual nice interfacing with java, just like all the other implementations like rhino or groovy, basically giving you the power to easily access the vast array of existing java libraries, but also to easily "push down" performance critical code to java.

For me, this last feature is actually most interesting. As we all know, 90% of the time is spent in 10% of the code, so a clever combination of a scripted language and a compiled language should give you the best of both worlds. But if the compiled langauge is something like C or C++, you either have to do the interfacing by hand or use a tool like swig. Now swig is a mature piece of code and effortlessly handles a lot of languages, but you will encounter the inevitable segmentation fault once and again. On the other hand if the compiled language is java, you automatically get all the benefits: Automatic memory management, exceptions which actually tell you where the error occurred, and so on. Of course, this comes with a performance penalty, but the time you save on debugging will be a big improvement with respect to productivity.

You might wonder "what about python". For a number of reasons, ruby and python somehow seem to occupy the same spot in the map of languages, to the point where they have to resort to marketing techniques (ever heard the terms "pythonic" and "the ruby way"?) to distinguish themselves from the competitors. Scientific computing wise, python definitely seems to be ahead of ruby with the scipy project.

However, java-wise, the leading python implementation jython is somewhat behind and still implementing python 2.2, where we're currently at python 2.5. I think if you really want to use python in a managed environment, you will have to switch to .NET. Microsoft's IronPython.

Update: On April 22, jruby-1.1.1 has been released which contains further bug-fixes and performance improvements.

Saturday, March 22, 2008

Tool of the week (web edition): gotapi.com

This weeks tool of the week (well, not that every week actually gets a tool of the week, but anyway) is not actually a tool, but a website: gotapi.com. It basically provides searchable indices for a large number of API documentation from other websites. It includes C/C++, Java, Python, Ruby, HTML, Javascript, CSS and many others. The interface has some nice AJAX in there and is very usable and fast. Go and check it out!

Monday, February 25, 2008

Tool of the Week (Ruby Edition): fastri

If you have ever consider programming in ruby, you should definitely have a look at fastri. It is a replacement for the standard ri and it is much faster and has some nice additional features. If you're fed up waiting for seconds just to learn that your query was ambiguous, this is the way to go. It comes as a server and a client, but also has a standalone verions qri which is only slightly slower.

Sunday, February 24, 2008

Machine Learning and Data Sets

I've been busy taking care of my 11 month old daughter lately which leaves almost no time to do something as remotely useful as posting on my blog - not that I have been doing it more often when I was still working full time. At the same time you get a lot of ideas and potentially interesting insights, now that your brain has time to idle now and then, for example while picking up toys thrown to the ground again and again.

Anyway, I lately came to think that machine learners as a whole should devote much more time to working with actual data. In particular machine learners who think of themselves as being "method guys" (yes, this also includes me). It usually works likes this: You have some technique you really like a lot and you use it to extend an already existing method until you come up with something you think is really neat. It may have some interesting properties other algorithms don't have, and you really would like to write a paper on it.

But then, the problem starts, because in order to prove that your extension is actually useful, you will have to prove that it makes a difference practically. So you go around your group asking colleagues if they have or know of some intersting data set. We call this the "have method, need data" phenomenon.

Of course, if you had started with a concrete application in mind, you would never have to ask yourself "oh, this is great, but what is it good for?"

Also, in machine learning, the formally defined problems we have are very abstract (like minimizing the expected risk from i.i.d. drawn data points), and many of the actual challenge are actually only "defined" by specific data sets.

Anyway, data wrangling have recently posted a huge list of links to data sets on the web which is certainly an interesting starting point.

And yes, if you already have your method, you might also find some interesting "real world application" there.