Weathering Thru Tech Days: February 2011

Saturday, February 19, 2011

Streaming QR decomposition for MapReduce part I: Induction, Divide-and-conquer and other aging but still neat goodies

One of most prominent challenges in my stochastic SVD contribution to Mahout was low-memory parallelizable implementation of a QR decomposition. Although current implementation of QR decomposition is rather tightly coupled with and rather specific for the SSVD method, it can be re-worked to be a massive scale QR method standing on its own.

Briefly, my solution is equivalent to one giant row-wise Givens solver. The trick is to legally reorder Givens operations and distribute their computations to parallel running pipelines so that most of the work doesn't require bulk data recombination.

There are several design patterns and principles used.

Streaming preprocessors

One principle I used was nested streaming preprocessors. Imagine having a bunch of lexical preprocessors working on a program source. First preprocessor strips out comments, second preprocessor parses grammar alphabet, etc. This pattern allows to cram a lot of logic in one sequential processing pass without having to load 'the whole thing' into memory.

So I tried to follow the same principle in matrix processing pipeline. First step is to produce matrix Y. We consume matrix A but we never even form so much as a complete row of A in memory. This is one of the enhancements, being able to consume A in element-by-element fashion (only non-zero elements for sparse data). We can accumulate row of Y as (k+p) long dot-product accumulator and that's all the memory this particular preprocessor needs. Once we finished with all alements in row of A, the dot-product accumulator contains final row of Y which is then passed on onto first step of QR pipeline.

We start off QR pipeline by dividing matrix Y into z horizontal blocks.

$\mathbf{Y}=\left(\begin{matrix}\mathbf{Y}_{1}\\ \mathbf{Y}_{2}\\ \vdots\\ \mathbf{Y}_{z}\end{matrix}\right)$

Again, Y blocking is only conceptual, as actual blocks are never formed in one place. To put things in perspective, each map task works with one or more Y blocks.

QR pipeline works on Y rows, row by row, by applying bottom-up ordered Givens operations until it transforms Y block into the following tall matrix:

\[\mathbf{Y}_{i}\rightarrow\mathrm{GivensQR}\rightarrow\left(\begin{matrix}\times & \times & \cdots & \times & \times\\0 & \times & \cdots & \times & \times\\\vdots & \vdots & \ddots & \vdots & \vdots\\0 & 0 & \cdots & \times & \times\\0 & 0 & \cdots & 0 & \times\\\\0 & 0 & 0 & 0 & 0\\\vdots & \vdots & \vdots & \vdots & \vdots\\0 & 0 & 0 & 0 & 0\end{matrix}\right)=\left(\begin{matrix}\mathbf{R}_{i}\\\mathbf{Z}\end{matrix}\right)\]

This tall matrix is represented with an upper-triangular R matrix, which is (k+p)×(k+p) sitting on the top of tall zero matrix Z. The R matrix is the one from thin QR decomposition, and we can get rid of Z matrix as having no information.

In fact, our QR preprocessor only keeps R-sized (k+p)×(k+p) buffer which "slides" up until it ends up holding R. Each additional iteration "builds" 1 row of Y on top of buffer, turning it into something resembling upper Hessenberg form, and then second part of iteration eliminates "subdiagonal" by applying Givens iterations and turning it back to upper-triangular. Last row is then completely zeroed and thrown away (or rather, reused as buffer for Y row so we don't incur java GC thrashing too much).

Divide and Conquer

The next step is to merge blockwise QR results into single final QR result. in order to do that, we can stack up individual blockwise R matrices one on top of another and apply same strategy, namely, selectively reordered Givens set until we end up with R/Z result again:

\[\left(\begin{matrix}\mathbf{R}_{1}\\\mathbf{R}_{2}\\\cdots\\\mathbf{R}_{n}\end{matrix}\right)\rightarrow\mathrm{GivensQR}\rightarrow\left(\begin{matrix}\mathbf{R}\\\mathbf{Z}\end{matrix}\right)\]

The result of this is final QR decomposition (or at least R part of it).

The next notion is that we can apply those two steps recursively in bottom-up divide-and-conquer fashion to merge as many intermediate QR blocks as we want.

This is basically all that happens in MapReduce QR: mappers are running one or more QR blocks, optionally merging them; and then another pass goes up one level on divide-and-conquer hierarchy tree, and then the whole routine repeats until we are left with just one R.

Induction for collecting Q

Producing (and merging) Q is somewhat more complicated to describe than producing R. But the whole approach of bottom-up ~1000-indegree divide-and-conquer (or perhaps merge-and-conquer :) is the same as described in the previous section even as we evolve Q blocks thru the merge hierarchy. The algorithm for individual step of Divide-and-Conquer bottom-up iteration there builds by induction and is described in my notes. The challenge is to stream individual Q blocks through series of Givens rotations that eventually it would have to go through in order to become a block of the final Q matrix. The algorithm builds for a trivial case of induction for a single bottom-up divide-and-conquer step of indegree 2 and 3 and then proceeds with building general case of indegree n. Memory requirements are to hold one Q block in addition to R sequence during Q merging. If one merge-up is one map-only MapReduce jobs, then just two map-only MR jobs can merge up 1,000,000 blocks or more than a billion rows (assuming Y blocks can be at least 1000 rows high -- but they can be much more in practice).

Turns out the number of Givens operations that we need to apply at each evolution of Q-block can be significantly smaller than the number elements of Q block we need to apply them to because it doesn't depend on the height of the Q-block but only proportional to indegree of the bottom-up divide-and-conquer pass and also ~(k+p)/2. That basically means that we could perform each pass of divide-and-conquer as a map-only pass with Givens operations being a side file information. (actually Givens operations are produced from that stack of Rs shown above, but they are different for each Q block, so that stack of Rs is the side information and Givens are produced on-the-fly). There are also some optimization techniques to transform (reduce) R-stack as we progress thru Q blocks to avoid unnecessary computations, so at the end we end up with only one R in the stack which also happens to be final R for the local step of divide-and-conquer bottom-up process. We take the final R and use that in subsequent divide-and-conquer merge-ups.

In practice in Stochastic SVD we piggyback on the mappers of the second job to finish QR merging -- and then pass it on onto whatever second job is really supposed to do (computing matrix B). That practically would allow to achieve QR per my previous estimate of 1 billion or more rows in mappers with 1Gb RAM with what seems like practically a single map-only pass over input data. (A patch enabling preprocessing input vectors is not part of Mahout patch -- not yet at least, so Mahout version may have significantly higher RAM requirements and longer running time due to GC thrashing than my github branches if the input contains rather wide and dense vector data.)

So, to recap, the major design principles in MR-based QR were: streaming preprocessors; recursive bottom-up divide-and-conquer; design by induction. Big thanks to Udi Manber and his books on algorithm design basics that got pretty well imprinted in the inner side of my skull.

My perception of LSI

Saturday, February 12, 2011

Weathering Thru Tech Days...

One Bridge We'll Never See Again Quite The Same

... or what is the true cost and benefit of technology

I am generally not very comfortable with making public statements on issues not directly relevant to technology and science. But I guess I need to clarify where the name of the blog comes from. Besides, it is still relevant to science and technology, albeit on a purely philosophical level.

Quite often I find myself wondering if the value/cost ratio of that flat tv on my wall is really that good as I thought it was when I struck that "submit order" button. Over the years I increasingly came to conclusion that it is not. And not in the sense that I had problems balancing my checkbook a month after when I had to put up with that credit card bill.

There's a lot of hidden costs both in future and present associated with "techno revolution" we are immersed in.

Indeed, we pay everywhere, every minute without realizing it, just for the privilege of having that Droid X in our pocket. We spend a lot of time developing something; and very often we don't enjoy the time spent. We spend time talking to people while we'd really be spending time with our kids or talking to somebody else instead. We don't get to create stuff we really like instead. Gosh, just mere thought of how much of my productive life I have spent on creating things I personally could quite easily live without, makes me sick.

And by spending all that time, we only create things that will pollute, spend energy (=pollute again, =create higher health costs), cause climate change, in other words will create even more compound future costs for our children to pay. We find it is so easy to borrow from our children, don't we? What kind of economy is that where we pay only fraction of the real cost and expect "future administrations" to pick up the deficit? What kind of human robs his kids?

And that luxury car navigation option, is it really worth not averting those couple food riots our children would have to live through?

And all that war tech?.. Better not even get me started on this...

In my opinion, very few tech products and benefits of tech revolution are actually worth their cost (their real cost, not just those virtual numbers assigned to our accounts we give away to have them).

Well, my Nikon gear may be one happy exception as sometimes I feel as if their craftsmanship helps me to work on my soul development. A little stretch, but...

A Wait By The Glass Ocean

Though perhaps there's one redemption for having all that technological progress. Which is most (and perhaps the only) worthy product, and that's the technology itself.

Chances are our technology is just not yet "smart" enough to produce anything really useful. But it still may become that.

The hope is that at some point we could achieve an 'ubertech' which would provide benefits finally far exceeding the real cost. Like colonizing another worlds. Or keeping our planet in balance. Or transforming ourselves into completely different state of thought or biology. Reaching out to other cultures out there, after all. Who knows what else, something we can't even imagine today.

Those are uberproducts, uberbenefits, coming with a real value which is hard to underestimate. Those benefits are so uber that it doesn't really matter much how much it costs, the only that matters is that at the end of the day we still have enough to pay for them. Like a heart transplant. That hope redeems the tech and all its today's overhead. If we still will be able to afford it...

And the reason why we may not be able to afford it might be because we will have overspent on nurturing the pre-tech (such as the one we have today).

Fort Ross

So it's a balancing act: it seems that ubertech predicates itself on pre-tech, along with its pre-benefits, pre-nonsense and pre-waste. But the more of the wasteful, nonsensical benefits we have, the more we borrow from the future capacity and jeopardize the path of reaching that cost/benefit ratio tipping point where everything caused by pre-waste could eventually be rectified and attended to.

The bottom line is, we humans, not just present, but also some past and all future generations collectively, are doing nothing else but weathering the technology as we have it today, the early tech days. Whether we will be able to bridge the gap between all the costs and future uber-benefits, is anybody's guess at this point. The danger that we over-borrow is very real. It's a blizzard and it yet remains to be seen if we can weather through it without freezing to death. I am just trying to remember that we need the pre-tech but not necessarily the product.

And whenever I work on a technology these days, I try to at least ask myself a question whether it has even a slightest chance to be a pre-tech for the 'ubertech', whether some part of it could be used in another part that could be part of another part of that eventual ubertech that delivers us.

Or it is just yet another one of those stupid dead-end buzzword technologies people use just to add a markup on the price tag for their skills.

iPad? I don't care for no stinking iPad!..

I do care, however, about the fact that somebody out there is still seeking for the next technological leap, and that gigantic evolutionary algorithm with voluntary fitness model force assigning evolutionary fitness scores expressed in dollars and pennies, pence, shillings and cents based on millions of factors.. For in the end it is all about evolution. And evolution, as we know, is all about survival. So we better not forget what it is really all about: a search for the future. No less, no more.

Saturday, February 5, 2011

MapReduce and LSI: Stochastic projection, or SSVD - part II

Mahout project would be logical place for stochastic method for MapReduce platform to appear. However, as of last year, the work on this method has appeared to be stalled somewhat. The problem was appealing in engineering and mathematical sense to me, so I decided to get on it. The implementation I arrived at can be obtained via my github repository and also being integrated into Mahout thru MAHOUT-376 and MAHOUT-593.

So, to the gist of it.

Modified SSVD method

First thing I did, I modified original algorithm just a tiny bit to make it more practical for MapReduce front end computations.

(There's great linear algebra visualizations here that might help visualize some of the steps, although terminology is sometimes odd compared to what found in most matrix computation textbooks),

We find our basic discussion of stochastic projection in [Halko,et al]. The following is modified stochastic SVD algorithm:

Given an m × n matrix A, a target rank k, and an oversampling parameter p, this procedure computes an m × n SVD

$\mathbf{A=U}\boldsymbol{\Sigma}\mathbf{V}^{\top}$ ,

U is m × k, Σ is diagonal matrix containing k singular values; V is n × k.

1. Create seed for random n×(k + p) matrix Ω . The seed defines matrix Ω using Gaussian unit vectors per one of suggestions in [Halko, et al].

Creating stochastic projection basis :
2. Y=AΩ, Y is m × (k+p).

Now Y spans our random subspace. I guess, intuitively, my understanding is that Y is quite likely to be closely aligned with most most of the vectors in A, so that we are not loosing major factors in A (SVD problem is practically the same as factorization problem).

Next, we orthonormalize the subspace spanned by Y using 'thin' QR:

3. Column-orthonormalize Y → Q by computing thin decomposition Y = QR.

Now align original data with our reduced space basis :

4. B=Q^ΤA. B is ( k + p ) × n.

Original algorithm proceeds with computing SVD of B. But matrix B is as wide as original matrix A and it is also dense, which means that for k+p being something like 1000 and A width (n) in the area of 10e+8, double precision arithmetics, we need 8e+11 bytes of RAM, or 800Gb. Not very practical, for scale, huh. Besides, since SVD of B is going to happen in the frontend, it is not a MapReduce process anymore and any benefit from parallel cloud computations is gone here. It is going to take quite a lot of time to compute this. So I changed the following steps to reduce problem size even further and proceed with an eigensolution instead of SVD:

5. Compute eigensolution of a small symmetric matrix BB^Τ

$\mathbf{BB}^{\top}=\mathbf{\hat{U}}\boldsymbol{\Lambda}\mathbf{\hat{U}}^{\top}. \mathbf{BB}^{\top}\in\mathbb{R}^{\left(k+p\right)\times\left(k+p\right)}.$

And just like that we just reduced the size of the problem to (k+p) × ( k+p ), i.e. taking example above, from 800 Gb to just meager 8 Mb. We can solve 8 Mb -large eigenproblem in the frontend with any stock high-precision solver in a matter of seconds (if not fractions of a second).

Computing BB^Τwith help of a MapReduce process is also very quick (by mapReduce standards, since it takes about 20 seconds just to set up all the mappers). But for the problems of significant size and good sized cluster we could really push the boundaries of the problem size with this approach in cases where Lanczos or RAM-only solver may not be practical (and MPI framework is too expensive).

There is a danger of potentially loosing some precision here, but in practice i did not see any. 8Mb is probably enough bits to represent millions and millions of soft clusters. Besides, in all my experiments with a reasonable sized problem that i could still fit in RAM to verify with a stock solver, stochastic projection errors were significantly higher than any errors coming from rounding errors in double arithmetic.

Restoring right and left eigenvector matrices is then easy.

6. Singular values Σ=Λ^0.5.
7. If needed, compute

$\mathbf{U}=\mathbf{Q}\hat{\mathbf{U}}$

8. If needed, compute

$\mathbf{V}=\mathbf{B}^{\top}\hat{\mathbf{U}}\boldsymbol{\Sigma}^{-1}.$

First practical solution takes 5 MR passes which may sound like a lot; but, 3 of them are map-only and 2 of them optional and running in parallel, i.e. for all practical purposes it is quite acceptable.

My more detailed working notes are here and command line howto is here.

Tuesday, February 1, 2011

MapReduce and LSI: Stochastic projection, or SSVD - part I

Over past half a year or so my company had a need for a good and scalable method for Latent Semantic Analysis and related techniques. Since we are essentially a Hadoop shop, everything MapReduce comes naturally to us. So the hope was to find an open source LSA/LSI implementation based on series of Hadoop jobs.

The need wasn't the most pressing though, so I took time to research around. Actually, I started looking for a good scalable SVD algorithm even before I joined my current company. Mahout seemed to be most promising in this direction. The only problem was that Mahout hadn't had (and chances are, at the time you are reading this, still doesn't have) an end-to-end LSA pipeline. It did have most of what one would need: bigram/trigram or even n-gram selection method based on log-likelihood; pluggable lemma analysis; vectorization framework over sequence files that one could run directly in MapReduce. So the only missing parts are fast parallelizable SVD method and more or less helpful vector space index builder so we could turn LSA into LSI.

Over the years LSA was a desirable and easy-to-understand target for researches and NLP engineers alike. The problem with LSA has always been though that SVD algorithms that belie the LSA method were inherently slow and inherently RAM or supercomputer methods. Numerical Analysis community wasn't paying enough attention to cloud environment (and perhaps isn't paying enough even now). So SVD method for MapReduce or similar cloud processing environments was largely remaining an unattainable target.

Truth to be spoken, Mahout did ingest distributed Lanczos method, which is now available in Mahout-0.4. Compared to any stochastic method, it is certainly providing outstanding precision of computation. In discussion with Mahout community, however, stochastic method emerged as more suitable for bigger document corpus. Primarily, it is not clear that Lanczos method can be called "fast" enough to be practical for big corpus -- and truly parallel in a cloud sense: from what I read , it is still iterative even in its MapReduce incarnation and requires a separate MapReduce run for every singular value computed. (Things may have changed since I read about this and I apologize in advance if this does not accurately depict current reality). It is also presumably computationally heavier (including front-end) than using a stochastic projection preprocessing.

Stochastic projection methods were really well-outlined recently; there is a really excellent study published [Halko, et al]). To get an idea how fast this method could be, check out these performance numbers published for redsvd project. Those are really amazing numbers. I don't think Lanczos and ScaLaPack methods could come close to those in terms of speed, core per core. (I suspect though java-based code would never be able to get exactly same performance natively as I think java in its pure form doesn't support SIMD).

The idea of stochastic SVD is to reduce problem dimensions by using random projection while keeping major driving factors of the dataset more or less intact. As a result, the problem may be reduced in size quite a bit. In terms of MapReduce implementation, we'd run problem projection & preprocessing using MapReduce (bulk number crunching) and then solve small problem in a jiffy inside the front-end process (so called "driver" in MapReduce lingo). The trade-off is a rather heavy loss in precision compared to other methods. However, problems like LSI are quite approximate: they are based on term collocation frequency analysis, which is circumstantial for a given corpus (or a given person's lifetime corpus, if we really try to compare method's output to human assessments). Bottom line, as such, Stochastic SVD method is unlikely suitable to crunch numbers for a rocket booster design but quite likely is what bulk computational linguistics needs.