Wednesday, July 09, 2008

Data Mangling Basics

When you're new to machine learning, there usually comes the point where you first have to face real data. At that point you suddenly realize that all your training was, well, academic. Real data doesn't come in matrices, real data has missing values, and so on. Real data is usually just not fit for being directly digested by your favorite machine learning method (well, if it does, consider yourself to be very lucky).

So you spent some time massaging the data formats until you have something which can be used in, say, matlab, but the results you get aren't just so good. If you're lucky, there is some senior guy in your lab you can ask for help, and he will usually do some things to preprocess the data you've never heard of in class.

Actually, I think that it should be taught in class, even when there is no systematic methodology behind it, and no fancy theorems to prove.So here is my by no means exhaustive set of useful preprocessings.

  • Take subsets You might find this pretty obvious, but I've seen students debugging their methods directly on the 10000 instances data set often enough to doubt that. So when you have a new data set and you want to quickly try out several methods, take a random subset of the data until your method can handle the data set in seconds, not minutes or hours. The complexity of most data sets is also such that you usually get somewhat close to the achievable accuracy with a few hundred examples.

  • Plot histograms, look at scatter plots Even when you have high-dimensional data, it might be very informative to look at histograms of individual coordinates, or scatter plots of pairs of coordinates. This already tells you a lot about the data: What is its range? Is it discrete or continuous? Which directions are highly correlated, and which are not? And so on. Again, this might seem pretty obvious, but often students just run the specified method without looking at the data first.

  • Center and normalize While most method papers make you think that the method "just works", in reality, you often have to do some preprocessing to make the methods work well. One such step is to center and normalize your data to have unit variance in each direction. Don't forget to save the offset and normalization factors: you will need them to correctly process the
    features for prediction!

  • Take Logarithms Sometimes, you have to take the logarithms of your data, in particular when the range of the data is extremely large and the density of your data points decreases as the values become larger. Interestingly, many of our own senses work on a logarithmic scale, as for example, hearing and vision. Loudness is for example measured in decibel, which is a logarithmic scale.

  • Remove Irrelevant Features Some kernels are particularly sensitive to irrelevant features. For example, if you take a Gaussian kernel, each feature which is irrelevant for prediction increases the number of data points you need to predict well, because for practically each realization of the irrelevant feature, you need to have additional data points in order to learn well.

  • Plot Principal Values Often, many of your features are highly correlated. For example, the height and weight of a person is usually quite correlated. A large number of correlated features means that the effective dimensionality of your data set is much smaller than the number of features. If you plot the principal values (eigenvalues of the sample covariance matrix C'*C, you usually will notice that there are only a few large values, meaning that a subset of the space will already contain most of the variance of your data. Also note that a projection to those dimensions is the best low-dimensional approximation with respect to the squared error, so using this information, you can transform your problem in one with fewer features.

  • Plot Scalar Products of the output variable with Eigenvectors of the Kernel Matrix Finally, while principal values only tell you something about the variance in the input features, if you plot the scalar products between the eigenvectors of the kernel matrix and the output variable in a supervised learning setting, you can see how many (kernel) principal components you need to capture the relevant information about the learning problem. The general shape of these coefficients can also tell you if your kernel makes sense or not. Watch out for an upcoming JMLR paper for more details.

  • Compare with k-nearest neighbors Finally, before you apply your favorite method on the data set, try something really simple like k-nearest neighbors. This can get you a good idea of what kinds of accuracies you can expect.


So in summary, there is a lot of things to do with your data before you plug it into your data set. If you do it correctly, you will learn quite a lot about the nature of the data, and have transformed your data to learn more robustly.

I'd be interested to know what initial procedures you apply to your data, so feel free to add some comments.

Saturday, July 05, 2008

Time Management for PostDocs

As a student, there is always a lot of stuff to do. But on the other hand, you are in the convenient position that others have decided what is important and you basically just need to make sure that you allot enough time to each single item.

Once you start working as a Ph.D. student or as a PostDoc, things gradually stop being that simple. You acquire a few responsibilities, and before you know it, you not only have to think about doing some original research, but you also have to supervise a bunch of students, give lectures, grade papers, organize a seminar, do reviews, do a bit of system administration, take care of the office furniture for the new building your moving in, keep the group's website up-to-date, and so on.

So while you might have been as busy as a student, the big difference is that now all these different duties actually compete for your time, and you have to start actively managing your time. Otherwise, only the squeaky wheel gets all the grease. And just as it happens, most of the stuff which has written "URGENT" in large, red letters written all over it is not what is most important in the long run.

The most important step might be to realize that it's not only about sorting and ranking the tasks in your inbox, but that you must go a step further and also control what gets in your inbox and what priorities you assign. Being fully aware of one's priorities certainly leads to better load balancing than just trying to satisfy all constraints of the items on your to-do-list.

I guess the language I've been slipping into in the last paragraph already points to fact that I'm a computer scientist by training. So for me (at least if you believe the stereotypes), everything is just a process, and the goal is to maximize the overall productivity with respect to some sensible criteria given certain constraints like the fact that each day only has 24 hours.

Consequently, one could expect that time management approaches appealing to hackers are a bit different from those which are interesting to the mainstream and which focus more on "soft topics" like how to get into the right frame of mind, or which explain their principles with allegories involving panda bears (just kidding).

There is one book out there called Time Management for System Administrators by O'Reilly which exactly fits the above description. For the author, Tom Tomicelli, time management is about learning useful habits, turning useful behavior into processes which become automatic to the point you forgot why you're doing it in the first place. One example he gives is that he always takes his car to the gas station on Sundays. At one point he wondered why he is actually doing this, and then he remembered that he often missed appointments on Monday morning when he still did not have this habit. But then he changed his habits and actually forgot why he did it, leading to effortless efficiency, the best of all conditions. The other big topic in his book is a way to manage your to-do-lists together with your appointments such that you do not have this one big to-do-list of doom where you know that you will never ever finish it, but instead realistic small sets for each day. Oh, and the book also sports a number of Dilbert comics, underlining what its main customers are expected to be.

On the bottom line, I found this book very fun to read and it contained a lot of nice ideas. On the other hand, I couldn't really implement the process of managing my to-do-lists. I tried, but the main problem was that my days are just too unpredictable, such that although I selected a small number of things I wanted to do on a specific day, I seldom managed to do all of them, leading exactly to the kind of never-ending backlog one should to avoid. But your mileage may vary.

The other time management framework which seems to be fairly popular among hackers is "Getting Things Done" by David Allen. Actually, the whole framework is described in one book of the same title. The actual appeal for hackers very clearly comes from the fact that the book mainly describes a single process for managing all your stuff. The main idea is that you should have some means of keeping all the information you need outside of your brain, freeing it from a lot of burden and helping you to become less worried and more productive (of course). Each single item you have to take care of, and every piece of information, goes through a process where it is analyzed, the next necessary action is decided upon, and then this information is stored in archives, or on several lists. One of this lists, the "next action list" contains all the next actions you could take in order to do some process sorted by the context where you could perform that action. The idea is that when you find some time to make phone calls or read, you just look at your list and take the appropriate actions.

Personally, what I liked about this scheme is that there is so few planning. It focuses more on what you could do instead of what you should do. Instead of planning when you could do that phone call, or even making a short list of things you want to do on a certain day, you just compile lists of things you could do and do them whenever you have the time. And let's face it, there is only so many things you can do on a day, and this approach naturally leads to a situation where you have made best use of your time (provided that you chose your actions wisely from the list).

When you look at other professions, this approach reminds me of the way physicians are organized. A doctor has to deal with many cases at once, but cannot possibly keep all of the information about his patients in his head. Therefore, he uses files, one for each patient. When a patient comes to an appointment, he (or she) has all the necessary information. He decides on the next action to be taken, and puts that information in the file. The patient then makes another appointment to keep track of the development. But the doctor can go home at night and forget almost everything he had to deal with over the day.

There is a very interesting interview with David Allen in wired, if you are interested, together with a few links to other resources.

Of course, I have to admit that I'm also not following this process exactly (although David Allen specifically warns about that), partly because sometimes things just keep getting out of hand, and then again, everybody adds his own ideas to the mix. In the end, I think it is also not that important whether you follow some process to the letter or not, but that you gain some awareness about what you do and why, and why some things never get done, and why these are sometimes those things you really should be doing more.

And, of course, there is also always the danger with hackers that they get so fascinated with a new toy that they miss the point where they maybe should stop optimizing the process and start putting it to work. I'm certainly not making fun of others, maybe I should invest more time into honing my time management skills, but if you're really into "life hacking" and finding the best physical implementation of the GTD scheme, have a look at 43folders or these pictures of people creating some amazing organizers from plain sheets of paper.

If you're more into software, you might find chandler or beeswax interesting.