Friday, July 13, 2012

#Mahout 0.7 has more usable PCA workflow now

Now that Mahout 0.7 has been released, it is probably appropriate to say a word with respect to my contribution, MAHOUT-817, that enables (afaik, for the first time) for Mahout to have a PCA workflow. (Well, in all fairness, I wasn't alone contributor there. Many thanks to Raphael Cendrillon for contributing the mean computation map-reduce step).

It is based on SSVD method that computes mathematical SVD in a distributed way. Generally, there's no problem using brute-force PCA approach that involves mean subtraction and then running SVD on the result.

The only time when the wrench is thrown into that engine is when original PCA input matrix is sufficiently sparse. Making mean subtraction is going to transform input matrix into a dense matrix, potentially increasing number of non-zero elements (and hence, flops needed to complete matrix operations during SVD computation) many times.

As it is true for many Mahout methods, SSVD actually takes a big view of the world with all its details and produces a more condensed analytic view of the world. So the natural thinking to deal with that is, ok, can we just run an SVD first on the sparse input and fix the smaller condensed view at its smallest incarnations later so that creates summary equivalent to that with mean subtraction?

Here is the quick write-up that shows how the math for that neat trick has worked out in Mahout.



Also, a high-level overview and practical usage of this approach is given : https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition.