Briefly, my solution is equivalent to one giant row-wise Givens solver. The trick is to legally reorder Givens operations and distribute their computations to parallel running pipelines so that most of the work doesn't require bulk data recombination.

There are several design patterns and principles used.

Streaming preprocessors

One principle I used was

*nested streaming preprocessors*. Imagine having a bunch of lexical preprocessors working on a program source. First preprocessor strips out comments, second preprocessor parses grammar alphabet, etc. This pattern allows to cram a lot of logic in one sequential processing pass without having to load 'the whole thing' into memory.

So I tried to follow the same principle in matrix processing pipeline. First step is to produce matrix Y. We consume matrix A but we never even form so much as a complete row of A in memory. This is one of the enhancements, being able to consume A in element-by-element fashion (only non-zero elements for sparse data). We can accumulate row of Y as (

*k+p*) long dot-product accumulator and that's all the memory this particular preprocessor needs. Once we finished with all alements in row of A, the dot-product accumulator contains final row of Y which is then passed on onto first step of QR pipeline.

We start off QR pipeline by dividing matrix Y into

*z*horizontal blocks.

Again, Y blocking is only conceptual, as actual blocks are never formed in one place. To put things in perspective, each map task works with one or more Y blocks.

QR pipeline works on Y rows, row by row, by applying bottom-up ordered Givens operations until it transforms Y block into the following tall matrix:

\[\mathbf{Y}_{i}\rightarrow\mathrm{GivensQR}\rightarrow\left(\begin{matrix}\times & \times & \cdots & \times & \times\\0 & \times & \cdots & \times & \times\\\vdots & \vdots & \ddots & \vdots & \vdots\\0 & 0 & \cdots & \times & \times\\0 & 0 & \cdots & 0 & \times\\\\0 & 0 & 0 & 0 & 0\\\vdots & \vdots & \vdots & \vdots & \vdots\\0 & 0 & 0 & 0 & 0\end{matrix}\right)=\left(\begin{matrix}\mathbf{R}_{i}\\\mathbf{Z}\end{matrix}\right)\]

This tall matrix is represented with an upper-triangular

**R**matrix, which is (*k+p*)×(*k+p*) sitting on the top of tall zero matrix**Z.**The**R**matrix is the one from thin QR decomposition, and we can get rid of**Z**matrix as having no information.
In fact, our QR preprocessor only keeps R-sized (

*k+p*)×(*k+p*) buffer which "slides" up until it ends up holding R. Each additional iteration "builds" 1 row of Y on top of buffer, turning it into something resembling upper Hessenberg form, and then second part of iteration eliminates "subdiagonal" by applying Givens iterations and turning it back to upper-triangular. Last row is then completely zeroed and thrown away (or rather, reused as buffer for Y row so we don't incur java GC thrashing too much).
Divide and Conquer

The next step is to merge blockwise QR results into single final QR result. in order to do that, we can stack up individual blockwise R matrices one on top of another and apply same strategy, namely, selectively reordered Givens set until we end up with R/Z result again:

\[\left(\begin{matrix}\mathbf{R}_{1}\\\mathbf{R}_{2}\\\cdots\\\mathbf{R}_{n}\end{matrix}\right)\rightarrow\mathrm{GivensQR}\rightarrow\left(\begin{matrix}\mathbf{R}\\\mathbf{Z}\end{matrix}\right)\]

The result of this is final QR decomposition (or at least R part of it).

The next notion is that we can apply those two steps recursively in bottom-up divide-and-conquer fashion to merge as many intermediate QR blocks as we want.

Induction for collecting Q

Producing (and merging) Q is somewhat more complicated to describe than producing R. But the whole approach of bottom-up ~1000-indegree divide-and-conquer (or perhaps merge-and-conquer :) is the same as described in the previous section even as we evolve Q blocks thru the merge hierarchy. The algorithm for individual step of Divide-and-Conquer bottom-up iteration there builds by induction and is described in my notes. The challenge is to stream individual Q blocks through series of Givens rotations that eventually it would have to go through in order to become a block of the final Q matrix. The algorithm builds for a trivial case of induction for a single bottom-up divide-and-conquer step of indegree 2 and 3 and then proceeds with building general case of indegree n. Memory requirements are to hold one Q block in addition to R sequence during Q merging. If one merge-up is one map-only MapReduce jobs, then just two map-only MR jobs can merge up 1,000,000 blocks or more than a billion rows (assuming Y blocks can be at least 1000 rows high -- but they can be much more in practice).

Turns out the number of Givens operations that we need to apply at each evolution of Q-block can be significantly smaller than the number elements of Q block we need to apply them to because it doesn't depend on the height of the Q-block but only proportional to indegree of the bottom-up divide-and-conquer pass and also ~(

*k+p*)/2. That basically means that we could perform each pass of divide-and-conquer as a map-only pass with Givens operations being a side file information. (actually Givens operations are produced from that stack of Rs shown above, but they are different for each Q block, so that stack of Rs is the side information and Givens are produced on-the-fly). There are also some optimization techniques to transform (reduce) R-stack as we progress thru Q blocks to avoid unnecessary computations, so at the end we end up with only one R in the stack which also happens to be final R for the local step of divide-and-conquer bottom-up process. We take the final R and use that in subsequent divide-and-conquer merge-ups.

In practice in Stochastic SVD we piggyback on the mappers of the second job to finish QR merging -- and then pass it on onto whatever second job is really supposed to do (computing matrix B). That practically would allow to achieve QR per my previous estimate of 1 billion or more rows in mappers with 1Gb RAM with what seems like practically a single map-only pass over input data. (A patch enabling preprocessing input vectors is not part of Mahout patch -- not yet at least, so Mahout version may have significantly higher RAM requirements and longer running time due to GC thrashing than my github branches if the input contains rather wide and dense vector data.)

So, to recap, the major design principles in MR-based QR were: streaming preprocessors; recursive bottom-up divide-and-conquer; design by induction. Big thanks to Udi Manber and his books on algorithm design basics that got pretty well imprinted in the inner side of my skull.

My perception of LSI |