Thursday, October 23, 2008

Matrices, JNI, DirectBuffers, and Number Crunching in Java

One thing Java lacks is a fast matrix library. There things like JAMA and COLT (not using its LAPACK bindings), but compared to state-of-the-art implementations like ATLAS, they are orders of magnitude slower.

Therefore I've set out some time ago to write my own matrix classes based on ATLAS. It turns out that while this is quite straightforward for most of the time, there are few issues which become complicated very quickly, to the point where I wonder if Java is really the right platform for this kind of application at all.

Basically, there are two areas where the Java platform can give you quite a headache:

  • Pass large amounts of data to native code.
  • Lack of control about how good the JIT compiles code.

Passing Data to Native Code

For a number of reasons, producing highly tuned linear algebra code still requires a lot of low-level fiddling with machine code. If you want really fast code, you have to rely on libraries like ATLAS. The question is then how you pass the data from Java to ATLAS.

Maybe the easiest choice is to store the data in primitive type arrays like double[] and then access them via JNI. Only that the documentation is quite explicit about the fact that this will not work well for large arrays as you typically will get a copy of the array, and afterwards, the result has to be copied back.

The solutions proposed by Sun is to use direct buffers which permits to allocate a piece of memory outside of the VM which is then passed directly to native code.

This was also the solution adopted by our jblas library. However, it quickly turned out that the memory management for DirectBuffers is anything but perfect. As has already been pointed out by others, the main problem is that Java does not take the amount of DirectBuffers into account to compute its memory footprint. What this means is that if you allocate a large number of DirectBuffers, the JVM will happy fill all of your memory, and potentially the swap space without triggering a single garbage collection.

After this happened a few times to me I began wondering whether DirectBuffers are garbage collected at all, but it seems that everything is implement reasonably well: the memory allocated with direct buffers is managed by PhantomReferences in order to be notified when the corresponding object is garbage collected, and if you call System.gc() by hand often enough your DirectBuffers eventually get garbage collected. That is, most of the time, at least.

The problem is that there is hardly a way around this. Maybe it's my fault because direct buffers were never intended to be allocated in large numbers. I guess the typical use would be to allocate a direct buffer of some size and then use that one fixed direct buffer to transfer data between the native code and the JVM.

The only problem is that this is basically the copying scenario which direct buffers should have helped to avoid... .

I've written a little program which compares the time it takes to copy an array using the GetDoubleArrayElements() function and using the bulk put and get functions of DoubleBuffer. The results look like this (on an Intel Core2 with 2 GHz):



What is a bit funny (and also brings me directly to the second point) is that GetDoubleArrayElements takes longer than DoubleBuffer.put() and get() after a size of about 100kb. Both routines basically have to deal with the same task of copying data from a Java primitive array into a piece of raw memory. But for some reason, the DoubleBuffer variant seems to be able to uphold a certain throughput for much higher amounts of memory.

Code Optimization

The second point basically means that it's more difficult than in a traditional compile language to control how fast your code becomes after it has been compiled JIT. In principle, there is no reason why this should be true, but I've observed a few funny effects which have caused me to doubt that the HotSpot technology is optimized for number crunching.

For example, as I've discussed elsewhere, copying data between arrays, or accessing DirectBuffers leads to drastically different performances on the same virtual machine depending on the byte-code compiler. For some reason, the eclipse compiler seems to produce code which leads to much faster machine code than Sun's own compiler on Sun's HotSpot JVM.

What this means that it is very hard to optimize code in Java, because you cannot be sure whether it will lead to efficient machine code by the JIT compiler of the virtual machine. With compile languages the situation is in principle similar, but at least you can control which compiler you use and optimize against that.

Summary

So I'm not quite sure what this means. Maybe some JVM guru can enlight me on what is happening here. And I don't even see what is wrong with the way direct buffers are implemented, but on the bottom line, memory management for direct buffers is not very robust, meaning that your program will typically consume much more memory than it should.

7 comments:

Anonymous said...

Nothing should prevent you from doing your own memory management on buffers. Instead of having the JVM allocate the direct buffer, with all the uncertainty of its lifespan, allocate it yourself from native code, and then use the JNI function NewDirectByteBuffer() to wrap the memory in a ByteBuffer for use on the Java side. When you want to free the backing buffer, use JNI and the native free (or whatever memory management you're using).

Mikio Braun said...

Hi Jesse,

thanks for pointing me to the NewDirectByteBuffer() function. But I think that the basic problem remains that the JVM is oblivious to the memory allocated "off the JVM heap" and it won't be as aggressive reclaiming those buffers if they aren't accessible anymore.

By the way, I just couldn't stand the instability anymore (having your machine freeze a few times a day on larger computations due to memory leaks is not nice) and have recently converted all my matrix code to using normals double arrays. It actually only took a few hours and now even huge computations run very smoothly.

I had to make sure that I do some stuff in Java now, in particular when the complexity is linear in the data. For example, copying data with dcopy just doesn't make any sense. On the other hand, for the really computationally expensive operations like matrix-matrix computations, the copying costs are negligible. And then there is the stuff like computing eigenvalues or solving linear equations where I don't even want to know too much about.

So, stability wise, it looks very nice now. I'll tweak the performance a bit more till I'm happy with that and then I think I'll release my matrix library.

Anonymous said...

I am not sure if you have seen this, MTJ: http://ressim.berlios.de/. They have native and ATLAS based implementations of linear algebra stuff. Their benchmarks are interesting.

Mikio Braun said...

Hi Kiran,

thanks for pointing me to MTJ. I think I have already taken a look at it, but forgot about it in the mean time. I couldn't really see all their source code but it also seems that they have taken the route of primitive arrays as well.

Their performance numbers are in fact quite interesting and also underline the need for highly optimized native matrix routines. There are just so many things you have to get right, starting with memory locality down to how to order the floating point operations to fill your pipelines optimally. Something which is maybe hard to control in Java.

It seems that MTJ takes a similar approach to rely on BLAS and LAPACK, and it also looks pretty complete featurewise. In our project JBLAS we have taken a slightly different approach with the interface. For example, there are only 2d matrices. It turns out the separation between vectors and matrices which only have one row or one column is a bit artificial and often requires you to recast vectors as matrices and vice versa. We have also used a lot of overloading to make using the classes as comfortable as possible.

Looks like it's time to release JBLAS soon!

Steve Lianoglou said...

Hi Mikio,

Thanks for the writeup ... I'm always curious about numerical stuff in Java, so I found this very interesting.

Two comments:

(1) I'm not quite sure what you mean by this from your post on Nov 6:

"""... have recently converted all my matrix code to using normals double arrays."""

Are you not calling out to JNI anymore? Do you just mean that your n-dim matrix is now stored as a 1d double array? (I think this is actually how COT stores its n-dim matrices, also).

(2) I think you might get some good feedback if you pop into the JVM Languages group[1] and ask them if they have any input into your situation.

While your problem isn't really related to new languages hosted on the JVM, the folks there are quite friendly and really know the JVM internals inside and out.

Anyway, I'd be interested in any follow up posts you have about related work (subscribing to your RSS feed now) as I'd like to actually use Scala for similar work down the road.

-steve

[1] JVM Languages Group: http://groups.google.com/group/jvm-languages

Mikio Braun said...

Hi Steve,

thanks for your comments.

Regarding your question, I'm still calling JNI, I'm only storing the matrices as double arrays now. Works very nicely, although one now has to do some things in Java now (especially those which take linear time like copying or even vector addition). Luckily, Java appears to be as quick as the native code there.

Thanks also for the link to the JVM languages group. I think it would already help if the JVM would take the memory allocated by the direct buffers into account and trigger garbage collections accordingly.

Concerning your plans with Scala, I'm setting up some m@tlab lookalike up in JRuby myself, and it works pretty well. In particular because you can add all kinds of syntactic sugar on the ruby side without subclassing. No idea how that would work in Scala... .

We're just now brushing up the documentation a bit, so look out for our first release of jblas.

Jesse Pavel (jpavel@alum.mit.edu) said...

But I think that the basic problem remains that the JVM is oblivious to the memory allocated "off the JVM heap" and it won't be as aggressive reclaiming those buffers if they aren't accessible anymore.I suppose this reply is a bit after the fact now, but what I meant was to allocate your own block of memory using, say, the C library's malloc(), wrapping that pointer using NewDirectByteBuffer(), and using this block as an NIO buffer for however long you need it. Then, when you're done, use the C library's free() to release the block of native memory and set your Java reference to null. Since you manually reclaim, the timing of Java's GC isn't important anymore.