Thursday, October 23, 2008

Matrices, JNI, DirectBuffers, and Number Crunching in Java

One thing Java lacks is a fast matrix library. There things like JAMA and COLT (not using its LAPACK bindings), but compared to state-of-the-art implementations like ATLAS, they are orders of magnitude slower.

Therefore I've set out some time ago to write my own matrix classes based on ATLAS. It turns out that while this is quite straightforward for most of the time, there are few issues which become complicated very quickly, to the point where I wonder if Java is really the right platform for this kind of application at all.

Basically, there are two areas where the Java platform can give you quite a headache:

  • Pass large amounts of data to native code.
  • Lack of control about how good the JIT compiles code.

Passing Data to Native Code

For a number of reasons, producing highly tuned linear algebra code still requires a lot of low-level fiddling with machine code. If you want really fast code, you have to rely on libraries like ATLAS. The question is then how you pass the data from Java to ATLAS.

Maybe the easiest choice is to store the data in primitive type arrays like double[] and then access them via JNI. Only that the documentation is quite explicit about the fact that this will not work well for large arrays as you typically will get a copy of the array, and afterwards, the result has to be copied back.

The solutions proposed by Sun is to use direct buffers which permits to allocate a piece of memory outside of the VM which is then passed directly to native code.

This was also the solution adopted by our jblas library. However, it quickly turned out that the memory management for DirectBuffers is anything but perfect. As has already been pointed out by others, the main problem is that Java does not take the amount of DirectBuffers into account to compute its memory footprint. What this means is that if you allocate a large number of DirectBuffers, the JVM will happy fill all of your memory, and potentially the swap space without triggering a single garbage collection.

After this happened a few times to me I began wondering whether DirectBuffers are garbage collected at all, but it seems that everything is implement reasonably well: the memory allocated with direct buffers is managed by PhantomReferences in order to be notified when the corresponding object is garbage collected, and if you call System.gc() by hand often enough your DirectBuffers eventually get garbage collected. That is, most of the time, at least.

The problem is that there is hardly a way around this. Maybe it's my fault because direct buffers were never intended to be allocated in large numbers. I guess the typical use would be to allocate a direct buffer of some size and then use that one fixed direct buffer to transfer data between the native code and the JVM.

The only problem is that this is basically the copying scenario which direct buffers should have helped to avoid... .

I've written a little program which compares the time it takes to copy an array using the GetDoubleArrayElements() function and using the bulk put and get functions of DoubleBuffer. The results look like this (on an Intel Core2 with 2 GHz):

What is a bit funny (and also brings me directly to the second point) is that GetDoubleArrayElements takes longer than DoubleBuffer.put() and get() after a size of about 100kb. Both routines basically have to deal with the same task of copying data from a Java primitive array into a piece of raw memory. But for some reason, the DoubleBuffer variant seems to be able to uphold a certain throughput for much higher amounts of memory.

Code Optimization

The second point basically means that it's more difficult than in a traditional compile language to control how fast your code becomes after it has been compiled JIT. In principle, there is no reason why this should be true, but I've observed a few funny effects which have caused me to doubt that the HotSpot technology is optimized for number crunching.

For example, as I've discussed elsewhere, copying data between arrays, or accessing DirectBuffers leads to drastically different performances on the same virtual machine depending on the byte-code compiler. For some reason, the eclipse compiler seems to produce code which leads to much faster machine code than Sun's own compiler on Sun's HotSpot JVM.

What this means that it is very hard to optimize code in Java, because you cannot be sure whether it will lead to efficient machine code by the JIT compiler of the virtual machine. With compile languages the situation is in principle similar, but at least you can control which compiler you use and optimize against that.


So I'm not quite sure what this means. Maybe some JVM guru can enlight me on what is happening here. And I don't even see what is wrong with the way direct buffers are implemented, but on the bottom line, memory management for direct buffers is not very robust, meaning that your program will typically consume much more memory than it should.

Monday, October 20, 2008

What is Machine Learning? Revisited

I always hated questions like "what is an organism" in school. What makes this kind of question hard is while you typically have quite an elaborate idea about what an organism is, summarizing all of this information in on sentence can be very hard.

For me, the question "what is machine learning" is very similar. I've already tried to answer that question, or to better characterize what the role of data sets are in machine learning, but I never found those definitions sufficient.

But is it really that important to be able to define what machine learning is? After all, we all know what it is, right? Well, I found that answering what machine learning is to you can actually be very helpful when it comes to deciding which new ideas to follow or planning what to do in the long run. And it also helps a lot answering the question what I do for a living. Being able to describe what you're doing at least is better than laughing out load and then saying "well... ." (that really happens...)

Okay, so the usual definition you often hear is that machine learning is about writing programs which can learn from examples. This is about as vaguely correct and non-committal as it gets. Put differently, machine learning is about programs which can learn from examples, and solve complex problems which are hard to formalize.

So far, so good, but when you take this characterization seriously, you see that machine learning is mostly a certain approach, or conceptual framework, for solving computational problems. In particular, machine learning is not directly linked to an application area (like, I don't know, building fast and reliable data bases or computer networks), but it is more of a meta-discipline. In this respect, it is maybe more similar to theoretical computer science, or even mathematics, where you develop theory and neglect the question what the practical implications are because you are confident that eventually, somebody will find a way to use it (After all, who would have thought that number theory would one day be so important for cryptography?)

This also means that we often find ourselves in the position of an intruder in other areas, telling people that a "learning approach" works better than the tools they have been building for the last couple of years. Those of you who on projects with a strong application focus like bioinformatics, computational chemistry, or computer security certainly know what I am talking about... .

So on the one hand, machine learning deals with problems which are hard to formalize, and concrete data sets become a substitute for formal problem definitions. On the other hand, machine learning can and is often studied in quite an abstract fashion, for example, by developing general purpose learning methods.

But is it really possible to study something abstractly where we often need concrete data sets to represent certain questions and problems? Of course, elaborate theoretical frameworks for modeling learning exist, for example, for supervised learning. But these formalizations are ultimately too abstract to capture what really characterizes the problems we wish to solve in the real world.

I personally think the answer is no, meaning that it's not possible to do machine learning abstractly, because we need concrete applications and data sets to provide real challenges to our methods.

What this also means is that you need to compete not only with other machine learning people, but also with people who approach the problems in the classical way, by trying to write programs which solve the problem. While it is always nice (and also very impressive) if you can solve a hard task with a generic machine learning method, you need to compete with "classical" approaches as well to make sure that the "learning approach" really has an advantage.