Monday, April 06, 2009

Some Benchmark Numbers for jblas

The last few days I've been putting together a benchmarking tool for jblas which I'll hope to release to the public as soon as I've figured out how to best package that beast.

It compares my own matrix library, jblas against the Matrix Toolbox for Java, Colt, as well as simple implementations in C and Java, and, of course, ATLAS.

All experiments were run on my Laptop which sports an Intel Core 2 CPU (T7200) with 2GHz. Maybe the most remarkable feature of this CPU (and many other Intel CPUs as well) is the rather large 2nd level cache of 4MB which will come in handy as we'll see below.

Some more information on the plots: C+ATLAS amounts to taking C for vector addition and ATLAS for matrix vector and matrix matrix multiplication. Shown is the number of theoretical floating point operations. For example, adding two vectors with n elements requires n additions, multiplying a matrix with n squared elements requires 2n^2 additions and multiplications. Now I know that there exist matrix multiplication algorithm which require less than cubic number of operations, but I've just sticked to the "naive" number of operations as that is what ATLAS implements, as far as I know.

Vector Addition



In vector addition the task is simply to add all elements of one vector to all elements of the other vector (in-place). What is interesting about vector addition is that there is really little you can do to optimize the flow of information: You just have to go through the whole vector once. Basically, all methods are on par, with the exception of Colt and ATLAS (!!). No idea what they did wrong, though.

You can also very nicely see the L1 and L2 cache edges resulting in the two steps. On CPUs with smaller L2 cache (like, for example, most AMD CPUs), the shoulder is much less pronounced.

Matrix-Vector Multiplication



Matrix-vector multiplication can be thought of as adding up the columns of a matrix given the weights of the vector. Similar to vector addition, you basically have to go through the whole data once, there is little you can do about it. However, if you are clever, you can make at least sure that the input and output vectors stay in cache which leads to a better throughput.

Here, ATLAS is faster than the naive implementations by roughly 50%. jblas also uses a naive implementation. The reason is that Java needs to copy an array when you pass it to native code, and since you basically have to go through the matrix once, you loose more time copying the matrix than simply doing the computation in Java itself.

Matrix-Matrix Multiplication



In matrix-matrix multiplication the ratio of memory movements to operations is so good that you can practically reach the theoretical maximum throughput on modern memory architectures - if you arrange your operations accordingly. ATLAS really outshines the other implementations here, getting to almost two (double precision!) floating point operations per clock cycle.

Luckily, copying the n squared many floats can be asymptotically neglected compared to the n cube many operations meaning that it pays of to use ATLAS from Java. The performance of jblas is very close to ATLAS, and becomes even closer for large matrices.

Summary

On the buttom line, I'm glad that jblas is pretty fast, certainly faster than Colt and on par with MTJ except for matrix-matrix multiplication. I should add that I used the Java-only versions of Colt and MTJ. It should be possible to optionally link against the ATLAS versions, although Colt does not integrate these methods as well as jblas, and I'm not sure whether MTJ does that.

Also, in my opinion jblas is easier to use because it only uses one matrix type, not distinct types for vectors and matrices. I've tried that at first, too, but then often ended up in situations where a computation results in a row or column vector which I then had to artificially cast to a vector so that I could go on.

Of course, MTJ and Colt are currently more feature rich, supporting, for example, sparse matrices or banded storage schemes, but if you want to have something simple which performs well right now, why not take jblas ;)

14 comments:

Anonymous said...

hi, does jblas computes eigen values decomposition?

juanmf said...

Hi, this link is broken:
http://jblas.org/
I wanted to contact you because I cant get jblas to work. it gives me a shared library error libfrotran.so.3 not found. Could you give me some help here? where can i get libfortran?
and, does this dependency limits the portability of mi program?
If there is a forum for this I would appreciate if you lead me there...

Mikio Braun said...

Hello!

I've moved jblas to github: http://mikiobraun.github.com/jblas

Yes, jblas also does eigendecompositions, look at org.jblas.Eigen. Many stuff is missing, but it is easy to add as long as it is in jblas.

On github you'll also find statically linked multiplatform jars. They should not depend on other software and be deployable as is.

I don't have a forum yet, but I'll take care of that soon. Just watch the github site for announcements.

-Mikio

Mikio Braun said...

Err, I meant it is easy to add as long as it is in LAPACK.

juanmf said...

Hi again, I had already downloaded the jblas-0.2-multiplatform.jar library and now i get the same error with jblas-0.2-Linux-i386-static.jar.
error is:
BLAS native library not found in path. Copying native library
from the archive. Consider installing the library somewhere
in the path (for Windows: PATH, for Linux: LD_LIBRARY_PATH).
Trying to copy from /lib/Linux/i386/libjblas.so.
tempfile.getPath() = /tmp/jblas5007379128373119817libjblas.so
Copying took 0.056 seconds.
Couldn't load copied link file: java.lang.UnsatisfiedLinkError: /tmp/jblas5007379128373119817libjblas.so: libgfortran.so.3: no se puede abrír el archivo de objeto compartido: No existe el fichero ó directorio.

Exception in thread "AWT-EventQueue-0" java.lang.UnsatisfiedLinkError: org.jblas.NativeBlas.dgeev(CCI[DII[DI[DI[DII[DII[DII)I
at org.jblas.NativeBlas.dgeev(Native Method)

juanmf said...

I solved it by installing libgfortran here I leave my resoults
Fin Jampack, tardo: 20435 milliseconds
78.24751578734987 # 78.39003669397961 # 78.28833075820475 # 78.4040511260857 # 78.21750110839729 # 78.39003669397961 # 78.24751578734987 # 78.39003669397961 # 78.24751578734987 # 78.37485184802782 # 78.24751578734987 # 78.22753453655322 # 78.22753453655322 # 78.34575760409119 # 78.2605940285022 # 78.37485184802782 # 78.4040511260857 # 78.37485184802782 # 78.31685767121941 # 78.34575760409119 #
Probando MTJ
Fin MTJ tardo: 7160 milliseconds
78.680851506945 # 78.66935555547676 # 78.65966943674954 # 78.67026017708233 # 78.66316486020321 # 78.65018618195344 # 78.63939849844684 # 78.62878292813639 # 78.61755999202018 # 78.60745904552391 # 78.60484334804448 # 78.59494131065944 # 78.58572313235467 # 78.56682402100608 # 78.55835703174338 # 78.48834963390274 # 78.5191143859455 # 78.45593785730816 # 78.63079835313931 # 78.34575760409112 #
Probando Jblas
Fin Jblas tardo: 2593 milliseconds
78.24751578734983 # 78.39003669397958 # 78.28833075820484 # 78.40405112608579 # 78.21750110839719 # 78.39003669397958 # 78.24751578734983 # 78.39003669397958 # 78.24751578734983 # 78.37485184802806 # 78.24751578734983 # 78.22753453655302 # 78.22753453655302 # 78.34575760409112 # 78.26059402850215 # 78.37485184802806 # 78.40405112608579 # 78.37485184802806 # 78.31685767121942 # 78.34575760409112 #
this are the results of calculating the max eigen values of 3000 matrices o order n.
for each, the first 20 max eigen values are print.
When the order of the matrix is under 10 Jampack is faster than MTJ, but jblas ROCKS!
just need help solving the libgfortran.so.3 dependency problem because I need to deliver this system to someone who surely doesnt have libgfortran installed

Mikio Braun said...

Cool to hear that jblas performs well!

I thought I had take care of the fortran libs. I currently cannot access my computer, but I'll look into it next week.

-M

juanmf said...

Sorry, I see I wrote of order n when i ment of order 30. the test was made with 3000 matrices of order 30.

Mikio Braun said...

Hi!

Hopefully fixed the dependency on gfortran. I put new static and multiplatform jars in the download area.

I also set up a google-group for jblas.

juanmf, let me know whether it works now for you!

-M

juanmf said...

ok, I`ll let you know.

juanmf said...

Second time I try to post, the last one didn`t work.
I did't uninstall glibFortran from linux, but the binaries you uploaded works fine in windows, without further installations.
Thanks, great work!

juanmf said...

hi, third try to post here :'( the binaries works fine in windows, did't uninstall glibfortran in linux.
thanks, Great work!

juanmf said...

I forgot, here is a link to my project: AHP (Analytic Hierarchy Process), Complete consistency matrix with optimal values. It`s an implementation of a Genetic algorithm. Your libs makes it perform much better.

Mikio Braun said...

Hey juanmf,

sorry, I only saw your last comments right now... . Guess I should tweak the commenting options of Blogger better... .

Anyway, I'm glad that your program could benefit from jblas!

-M