Showing posts with label /Projects/jblas. Show all posts
Showing posts with label /Projects/jblas. Show all posts

Tuesday, December 22, 2009

jblas release 1.0

Update: I released jblas-1.0.1 which also works on processors which only support SSE2 (e.g. older Athlon64 processors). Go to jblas.org and check it out!

I’ve just release jblas 1.0. For those of you who don’t know yet, jblas is a matrix library for Java which is based on the BLAS and LAPACK routines, currently using ATLAS and lapack-lite, for maximum performance.

I’m using jblas myself for most of my research together with a wrapper in jruby which I plan to release pretty soon as well.

In any case, here are the most important new features of release 1.0

  • Default jar file now contains all the libraries for Windows, Linux, and Mac OS X. Both 32 and 64 bit are supported (although right now, the 64 bit version for Windows is still missing. I’ll take care of that after the holidays). The libraries are built for the Core2 platform which in my experience gives the best combined performance on Intel and AMD processors (The Hammer or AMDk10h settings give slightly better performance on AMD, but inferior performance on the Intel platform).

  • New LAPACK functions GEEV (generalized Eigenvalues), GETRF (LU factorization), POTRF (Cholesky factorization) in NativeBlas.

  • Decompose class providing high-level access to LU and Cholesky factorization.

  • Improved support for standard Java framework: Matrix classes are serializable, matrix classes provide read-only AbstractList views for elements, rows, and vectors.

  • Permutation class for random permutations and subsets.

The new jar file also comes with a little command line tool for checking the installation and running a small benchmark. If you run java -server -jar jblas-1.0.jar on my machine (Linux, 32 bit, Core 2 Duo @ 2Ghz), you get


Simple benchmark for jblas

Running sanity benchmarks.

checking vector addition... ok
-- org.jblas CONFIG BLAS native library not found in path. Copying native library from the archive. Consider installing the library somewhere in the path (for Windows: PATH, for Linux: LD_LIBRARY_PATH).
-- org.jblas CONFIG Loading libjblas.so from /lib/static/Linux/i386/libjblas.so.
checking matrix multiplication... ok
checking existence of dsyev...... ok
checking XERBLA... ok
Sanity checks passed.

Each benchmark will take about 5 seconds...

Running benchmark "Java matrix multiplication, double precision".
n = 10 : 424.4 MFLOPS (1061118 iterations in 5.0 seconds)
n = 100 : 1272.6 MFLOPS (3182 iterations in 5.0 seconds)
n = 1000 : 928.5 MFLOPS (3 iterations in 6.5 seconds)

Running benchmark "Java matrix multiplication, single precision".
n = 10 : 445.0 MFLOPS (1112397 iterations in 5.0 seconds)
n = 100 : 1273.0 MFLOPS (3183 iterations in 5.0 seconds)
n = 1000 : 1330.9 MFLOPS (4 iterations in 6.0 seconds)

Running benchmark "ATLAS matrix multiplication, double precision".
n = 10 : 428.2 MFLOPS (1070428 iterations in 5.0 seconds)
n = 100 : 3293.9 MFLOPS (8235 iterations in 5.0 seconds)
n = 1000 : 5383.2 MFLOPS (14 iterations in 5.2 seconds)

Running benchmark "ATLAS matrix multiplication, single precision".
n = 10 : 465.2 MFLOPS (1162905 iterations in 5.0 seconds)
n = 100 : 5997.3 MFLOPS (14994 iterations in 5.0 seconds)
n = 1000 : 9186.6 MFLOPS (23 iterations in 5.0 seconds)

Resources:

Monday, April 06, 2009

Some Benchmark Numbers for jblas

The last few days I've been putting together a benchmarking tool for jblas which I'll hope to release to the public as soon as I've figured out how to best package that beast.

It compares my own matrix library, jblas against the Matrix Toolbox for Java, Colt, as well as simple implementations in C and Java, and, of course, ATLAS.

All experiments were run on my Laptop which sports an Intel Core 2 CPU (T7200) with 2GHz. Maybe the most remarkable feature of this CPU (and many other Intel CPUs as well) is the rather large 2nd level cache of 4MB which will come in handy as we'll see below.

Some more information on the plots: C+ATLAS amounts to taking C for vector addition and ATLAS for matrix vector and matrix matrix multiplication. Shown is the number of theoretical floating point operations. For example, adding two vectors with n elements requires n additions, multiplying a matrix with n squared elements requires 2n^2 additions and multiplications. Now I know that there exist matrix multiplication algorithm which require less than cubic number of operations, but I've just sticked to the "naive" number of operations as that is what ATLAS implements, as far as I know.

Vector Addition



In vector addition the task is simply to add all elements of one vector to all elements of the other vector (in-place). What is interesting about vector addition is that there is really little you can do to optimize the flow of information: You just have to go through the whole vector once. Basically, all methods are on par, with the exception of Colt and ATLAS (!!). No idea what they did wrong, though.

You can also very nicely see the L1 and L2 cache edges resulting in the two steps. On CPUs with smaller L2 cache (like, for example, most AMD CPUs), the shoulder is much less pronounced.

Matrix-Vector Multiplication



Matrix-vector multiplication can be thought of as adding up the columns of a matrix given the weights of the vector. Similar to vector addition, you basically have to go through the whole data once, there is little you can do about it. However, if you are clever, you can make at least sure that the input and output vectors stay in cache which leads to a better throughput.

Here, ATLAS is faster than the naive implementations by roughly 50%. jblas also uses a naive implementation. The reason is that Java needs to copy an array when you pass it to native code, and since you basically have to go through the matrix once, you loose more time copying the matrix than simply doing the computation in Java itself.

Matrix-Matrix Multiplication



In matrix-matrix multiplication the ratio of memory movements to operations is so good that you can practically reach the theoretical maximum throughput on modern memory architectures - if you arrange your operations accordingly. ATLAS really outshines the other implementations here, getting to almost two (double precision!) floating point operations per clock cycle.

Luckily, copying the n squared many floats can be asymptotically neglected compared to the n cube many operations meaning that it pays of to use ATLAS from Java. The performance of jblas is very close to ATLAS, and becomes even closer for large matrices.

Summary

On the buttom line, I'm glad that jblas is pretty fast, certainly faster than Colt and on par with MTJ except for matrix-matrix multiplication. I should add that I used the Java-only versions of Colt and MTJ. It should be possible to optionally link against the ATLAS versions, although Colt does not integrate these methods as well as jblas, and I'm not sure whether MTJ does that.

Also, in my opinion jblas is easier to use because it only uses one matrix type, not distinct types for vectors and matrices. I've tried that at first, too, but then often ended up in situations where a computation results in a row or column vector which I then had to artificially cast to a vector so that I could go on.

Of course, MTJ and Colt are currently more feature rich, supporting, for example, sparse matrices or banded storage schemes, but if you want to have something simple which performs well right now, why not take jblas ;)

Thursday, October 23, 2008

Matrices, JNI, DirectBuffers, and Number Crunching in Java

One thing Java lacks is a fast matrix library. There things like JAMA and COLT (not using its LAPACK bindings), but compared to state-of-the-art implementations like ATLAS, they are orders of magnitude slower.

Therefore I've set out some time ago to write my own matrix classes based on ATLAS. It turns out that while this is quite straightforward for most of the time, there are few issues which become complicated very quickly, to the point where I wonder if Java is really the right platform for this kind of application at all.

Basically, there are two areas where the Java platform can give you quite a headache:

  • Pass large amounts of data to native code.
  • Lack of control about how good the JIT compiles code.

Passing Data to Native Code

For a number of reasons, producing highly tuned linear algebra code still requires a lot of low-level fiddling with machine code. If you want really fast code, you have to rely on libraries like ATLAS. The question is then how you pass the data from Java to ATLAS.

Maybe the easiest choice is to store the data in primitive type arrays like double[] and then access them via JNI. Only that the documentation is quite explicit about the fact that this will not work well for large arrays as you typically will get a copy of the array, and afterwards, the result has to be copied back.

The solutions proposed by Sun is to use direct buffers which permits to allocate a piece of memory outside of the VM which is then passed directly to native code.

This was also the solution adopted by our jblas library. However, it quickly turned out that the memory management for DirectBuffers is anything but perfect. As has already been pointed out by others, the main problem is that Java does not take the amount of DirectBuffers into account to compute its memory footprint. What this means is that if you allocate a large number of DirectBuffers, the JVM will happy fill all of your memory, and potentially the swap space without triggering a single garbage collection.

After this happened a few times to me I began wondering whether DirectBuffers are garbage collected at all, but it seems that everything is implement reasonably well: the memory allocated with direct buffers is managed by PhantomReferences in order to be notified when the corresponding object is garbage collected, and if you call System.gc() by hand often enough your DirectBuffers eventually get garbage collected. That is, most of the time, at least.

The problem is that there is hardly a way around this. Maybe it's my fault because direct buffers were never intended to be allocated in large numbers. I guess the typical use would be to allocate a direct buffer of some size and then use that one fixed direct buffer to transfer data between the native code and the JVM.

The only problem is that this is basically the copying scenario which direct buffers should have helped to avoid... .

I've written a little program which compares the time it takes to copy an array using the GetDoubleArrayElements() function and using the bulk put and get functions of DoubleBuffer. The results look like this (on an Intel Core2 with 2 GHz):



What is a bit funny (and also brings me directly to the second point) is that GetDoubleArrayElements takes longer than DoubleBuffer.put() and get() after a size of about 100kb. Both routines basically have to deal with the same task of copying data from a Java primitive array into a piece of raw memory. But for some reason, the DoubleBuffer variant seems to be able to uphold a certain throughput for much higher amounts of memory.

Code Optimization

The second point basically means that it's more difficult than in a traditional compile language to control how fast your code becomes after it has been compiled JIT. In principle, there is no reason why this should be true, but I've observed a few funny effects which have caused me to doubt that the HotSpot technology is optimized for number crunching.

For example, as I've discussed elsewhere, copying data between arrays, or accessing DirectBuffers leads to drastically different performances on the same virtual machine depending on the byte-code compiler. For some reason, the eclipse compiler seems to produce code which leads to much faster machine code than Sun's own compiler on Sun's HotSpot JVM.

What this means that it is very hard to optimize code in Java, because you cannot be sure whether it will lead to efficient machine code by the JIT compiler of the virtual machine. With compile languages the situation is in principle similar, but at least you can control which compiler you use and optimize against that.

Summary

So I'm not quite sure what this means. Maybe some JVM guru can enlight me on what is happening here. And I don't even see what is wrong with the way direct buffers are implemented, but on the bottom line, memory management for direct buffers is not very robust, meaning that your program will typically consume much more memory than it should.