Weathering Thru Tech Days: March 2011

Sunday, March 27, 2011

Streaming QR decomposition for MapReduce part III: Collecting Q by Induction

In Part II, as a part of bottom-up divide-and-conquer parallelization strategy, one problem emerged that we need to solve at every node of bottom-up tree:

$$\left(\begin{matrix}\left(\begin{matrix}\mathbf{Q}_{1}\\\mathbf{Q}_{2}\\\vdots\\\mathbf{Q}\end{matrix}\right)\mathbf{R}_{1}\\\left(\begin{matrix}\mathbf{Q}\\\mathbf{Q}\\\vdots\\\mathbf{Q}\end{matrix}\right)\mathbf{R}_{2}\\\vdots\\\left(\begin{matrix}\mathbf{Q}\\\mathbf{Q}\\\vdots\\\mathbf{Q}_{z}\end{matrix}\right)\mathbf{R}_{n}\end{matrix}\right)=\left(\begin{matrix}\mathbf{\hat{Q}}\\\mathbf{\hat{Q}}\\\vdots\\\mathbf{\hat{Q}}\end{matrix}\right)\mathbf{\hat{R}}$$ (1)

Let's simplify the problem for a moment (as it turns out, vertical blocking of Q is trivial as they are always produced by application of column-wise Givens operations which we can 'replay' on any Q block).

$$\left(\begin{matrix}\mathbf{Q}_{1}\mathbf{R}_{1}\\\mathbf{Q}_{2}\mathbf{R}_{2}\\\cdots\\\mathbf{Q}_{z}\mathbf{R}_{z}\end{matrix}\right)\Rightarrow\left(\begin{matrix}\mathbf{\hat{Q}}_{1}\\\mathbf{\hat{Q}}_{2}\\\cdots\\\mathbf{\hat{Q}}_{z}\end{matrix}\right)\hat{\mathbf{R}}$$ (2)

Also let's decompose standard Givens QR operation in to two:

1) producing Givens transformation sequence on some input $m\times n$, $m>n$ A as $$\left(\mathbf{G}_{1},\mathbf{G}_{2}...\mathbf{G}_{k}\right)=\mathrm{givens\_qr}\left(\mathbf{A}\right)$$. I will denote product of all Givens operations obtained this way as $$\prod_{i}\mathbf{G}_{i}\equiv\prod\mathrm{givens\_qr}\left(\mathbf{A}\right)$$.

2) Applying product of Givens operations on input produces already familiar result
\[\left(\prod_{i}\mathbf{G}_{i}\right)^{\top}\mathbf{A}=\left(\begin{matrix}\mathbf{R}\\\mathbf{Z}\end{matrix}\right)\]

from which it follows that thin QR's R can be written as
\[\mathbf{R}=\left[\left(\prod_{i}\mathbf{G}_{i}\right)^{\top}\mathbf{A}\right]\left(1:n,:\right)=\left[\left(\prod\mathrm{givens\_qr}\left(\mathbf{A}\right)\right)^{\top}\mathbf{A}\right]\left(1:n,:\right)\]
using Golub/Van Loan's block notation.

(Golub/Van Loan's block notation in form of A(a₁:a₂, b₁:b₂) means "crop of a matrix A, rows a₁ to a₂ and columns b₁ to b₂". For half- and full-open intervals the bounds are just omitted).

'Thick' Q is produced by $$\mathbf{I}\left(\prod_{i}\mathbf{G}_{i}\right)$$ and hence 'thin Q' is
\[\mathbf{Q}=\left[\mathbf{I}\left(\prod_{i}\mathbf{G}_{i}\right)\right]\left(:,1:n\right)=\left[\mathbf{I}\left(\prod_{i}\mathrm{givens\_qr}\left(\mathbf{A}\right)\right)\right]\left(:,1:n\right),\mathbf{\,\, I}\in\mathbb{R}^{m\times m}.\]

Now that we laid out all notations, let's get to the gist. As I mentioned before, I continue building algorithm by induction. Let's consider case (2) for $$z=2$$. Then it can be demonstrated that the following is the solution as it is equivalent to full Givens QR with a rearranged but legitimate order of Givens operations:

$$\mathbf{\hat{R}}=\left\{\left[\prod\mathrm{givens\_qr}\left(\begin{matrix}\mathbf{R}_{1}\\\mathbf{R}_{2}\end{matrix}\right)\right]^{\top}\cdot\left(\begin{matrix}\mathbf{R}_{1}\\\mathbf{R}_{2}\end{matrix}\right)\right\} \left(1:n,:\right)$$,

$$\mathbf{\hat{Q}=}\left[\left(\begin{matrix}\mathbf{Q}_{1} & \mathbf{Z}\\\mathbf{Z} & \mathbf{Q}_{2}\end{matrix}\right)\cdot\prod_{i}\mathrm{givens\_qr}\left(\begin{matrix}\mathbf{R}_{1}\\\mathbf{R}_{2}\end{matrix}\right)\right]\left(:,1:n\right)$$.

Let's denote function of 2-block computation of $$\mathbf{\hat{R}}$$ as

$$\mathrm{rhat}\left(\mathbf{R}_{i},\mathbf{R}_{j}\right)=\left\{\left[\prod\mathrm{givens\_qr}\left(\begin{matrix}\mathbf{R}_{i}\\\mathbf{R}_{j}\end{matrix}\right)\right]^{\top}\cdot\left(\begin{matrix}\mathbf{R}_{i}\\\mathbf{R}_{j}\end{matrix}\right)\right\} \left(1:n,:\right)$$.

Still working on the trivial case of induction: Now note that since Givens operations for calculation of $$\mathbf{\hat{Q}}$$ are applied pairwise to the columns of the accumulator matrix, i.e. independently for each row, we can split computation of $$\mathbf{\hat{Q}}$$ and apply it to any combination of vertical blocks of $$\left(\begin{matrix}\mathbf{Q}_{1} & \mathbf{Z}\\\mathbf{Z} & \mathbf{Q}_{2}\end{matrix}\right)$$. Let's say we decide to split it into 2 blocks with as many rows as in Q₁ and Q₂ and get back to block-wise formulas sought for solution of (2):

$$\mathbf{\hat{Q}_{1}=}\left[\left(\begin{matrix}\mathbf{Q}_{1} & \mathbf{Z}\end{matrix}\right)\cdot\prod_{i}\mathrm{givens\_qr}\left(\begin{matrix}\mathbf{R}_{1}\\\mathbf{R}_{2}\end{matrix}\right)\right]\left(:,1:n\right),$$

$$\mathbf{\hat{Q}_{2}=}\left[\left(\begin{matrix}\mathbf{Z} & \mathbf{Q}_{2}\end{matrix}\right)\cdot\prod_{i}\mathrm{givens\_qr}\left(\begin{matrix}\mathbf{R}_{1}\\\mathbf{R}_{2}\end{matrix}\right)\right]\left(:,1:n\right).$$ (3)

Let's denote those transformations in more general form as

$$\mathrm{qhat\_down}\left(\mathbf{Q}_{i},\mathbf{R}_{i},\mathbf{R}_{j}\right)=\left[\left(\begin{matrix}\mathbf{Q}_{i} & \mathbf{Z}\end{matrix}\right)\cdot\prod\mathrm{givens\_qr}\left(\begin{matrix}\mathbf{R}_{i}\\\mathbf{R}_{j}\end{matrix}\right)\right]\left(:,1:n\right)$$

and

$$\mathrm{qhat\_up}\left(\mathbf{Q}_{i},\mathbf{R}_{i},\mathbf{R}_{j}\right)=\left[\left(\begin{matrix}\mathbf{Z} & \mathbf{Q}_{i}\end{matrix}\right)\cdot\prod\mathrm{givens\_qr}\left(\begin{matrix}\mathbf{R}_{j}\\\mathbf{R}_{i}\end{matrix}\right)\right]\left(:,1:n\right)$$

then we can rewrite (3) as

$$\mathbf{\hat{Q}}=\left(\begin{matrix}\mathbf{\hat{Q}}_{1}\\\mathbf{\hat{Q}}_{2}\end{matrix}\right)=\left(\begin{matrix}\mathrm{qhat\_down}\left(\mathbf{Q}_{1},\mathbf{R}_{1},\mathbf{R}_{2}\right)\\\mathrm{qhat\_up}\left(\mathbf{Q}_{2},\mathbf{R}_{2},\mathbf{R}_{1}\right)\end{matrix}\right)$$,

$$\mathbf{\hat{R}}=\mathrm{rhat}\left(\begin{matrix}\mathbf{R}_{1}\\\mathbf{R}_{2}\end{matrix}\right)$$

This is our solution for (2) of trivial case (z=2).

I will show solution for z=3 and then just will give the final solution without a proof.

Case z=3:

Note that functions qhat_up() and qhat_down() also implicitly produce intermediate rhat() products during their computation, so we want to 'enhance' them to capture that rhat() result as well. We will denote 'evolving' intermediate results Q and R as a sequence $$\left(\mathbf{\tilde{Q}},\mathbf{\tilde{R}}\right)$$:

$$\mathrm{qrhat\_down}\left(\mathbf{Q}_{i},\mathbf{R}_{i},\mathbf{R}_{j}\right)=\left(\mathbf{\tilde{Q}},\mathbf{\tilde{R}}\right)=\left(\mathrm{qhat\_down}\left(\mathbf{Q}_{i},\mathbf{R}_{i},\mathbf{R}_{j}\right),\mathrm{rhat}\left(\mathbf{R}_{i},\mathbf{R}_{j}\right)\right)$$,

$$\mathrm{qrhat\_up}\left(\mathbf{Q}_{i},\mathbf{R}_{i},\mathbf{R}_{j}\right)=\left(\mathbf{\tilde{Q}},\mathbf{\tilde{R}}\right)=\left(\mathrm{qhat\_up}\left(\mathbf{Q}_{i},\mathbf{R}_{i},\mathbf{R}_{j}\right),\mathrm{rhat}\left(\mathbf{R}_{j},\mathbf{R}_{i}\right)\right)$$.

Then, using those notations, we can write solution for (2) z=3 as

$$\mathbf{\hat{Q}}=\left(\begin{matrix}\mathbf{\hat{Q}}_{1}\\\mathbf{\hat{Q}}_{2}\\\mathbf{\hat{Q}}_{3}\end{matrix}\right)=\left(\begin{matrix}\mathrm{qrhat\_down}\left(\mathrm{qrhat\_down}\left(\mathbf{Q}_{1},\mathbf{R}_{1},\mathbf{R}_{2}\right),\mathbf{R}_{3}\right).\mathbf{\tilde{Q}}\\\mathrm{qrhat\_down}\left(\mathrm{qrhat\_up}\left(\mathbf{Q}_{2},\mathbf{R}_{2},\mathbf{R}_{1}\right),\mathbf{R}_{3}\right).\mathbf{\tilde{Q}}\\\mathrm{qrhat\_up}\left(\mathbf{Q}_{3},\mathbf{R}_{3},\mathrm{rhat}\left(\mathbf{R}_{1},\mathbf{R}_{2}\right)\right).\mathbf{\tilde{Q}}\end{matrix}\right)$$.

Note that algorithm for computing $$\mathbf{\hat{Q}}_i$$ requires input of Q_i, iterator[(R₁,R₂...R_z)].

General solution for (2) for any z is built by induction as the following algorithm:

This algorithm is still sub-efficient as for the entire matrix (2) it computes more rhat() operations than needed, and can be optimized to reduce those operations when considering final solution for (1), but that's probably too many details for this post for now. Actual details are found in code and my working notes.

Finally, a word about how to transform solution for (2) into a solution for (1).

We can rewrite indexes in (1) as

$$\left(\begin{matrix}\left(\begin{matrix}\mathbf{Q}_{1}\\\mathbf{Q}_{2}\\\cdots\\\mathbf{Q}\end{matrix}\right)\mathbf{R}_{1}\\\left(\begin{matrix}\mathbf{Q}\\\mathbf{Q}\\\cdots\\\mathbf{Q}\end{matrix}\right)\mathbf{R}_{2}\\\cdots\\\left(\begin{matrix}\mathbf{Q}\\\mathbf{Q}\\\cdots\\\mathbf{Q}_{z}\end{matrix}\right)\mathbf{R}_{n}\end{matrix}\right)\equiv\left(\begin{matrix}\left(\begin{matrix}\mathbf{Q}_{11}\\\mathbf{Q}_{12}\\\cdots\\\mathbf{Q}_{1k}\end{matrix}\right)\mathbf{R}_{1}\\\left(\begin{matrix}\mathbf{Q}_{21}\\\mathbf{Q}_{22}\\\cdots\\\mathbf{Q}_{2k}\end{matrix}\right)\mathbf{R}_{2}\\\cdots\\\left(\begin{matrix}\mathbf{Q}_{n1}\\\mathbf{Q}_{n2}\\\cdots\\\mathbf{Q}_{nk}\end{matrix}\right)\mathbf{R}_{n}\end{matrix}\right)$$

And then regroup that into k independent tasks solving

$$\left(\begin{matrix}\mathbf{Q}_{1i}\mathbf{R}_{1}\\\mathbf{Q}_{2i}\mathbf{R}_{2}\\\cdots\\\mathbf{Q}_{zi}\mathbf{R}_{z}\end{matrix}\right)\Rightarrow\left(\begin{matrix}\mathbf{\hat{Q}}_{1i}\\\mathbf{\hat{Q}}_{2i}\\\cdots\\\mathbf{\hat{Q}}_{zi}\end{matrix}\right)\hat{\mathbf{R}}$$,

$$i\in1..k,$$

which can be solved via solution for (2).

That may also be one of parallelization strategies.

That wasn't so hard, was it? It's possible I am performing a subpar re-tracing of some existing research or method here. But I don't think there's much work about how to fit QR onto parallel batch machinery.

I hope that some horizons perhaps became a little clearer.

SSVD Command Line usage

Here's the doc, also attached to Mahout-593. At some point wiki update is due. When we know what it is all good for.

Also, from my email regarding -s parameter:

There are 2 cases where you might want to adjust -s :

1 -- if you running really huge input that produces more than 1000 or
so map tasks and/or that is causing OOM in some tasks in some
situations. It looks like your input is far from that now.

2 -- if you have quite wide input -- realistically more than 30k
non-zero elements in a row. The way current algorithm works, it tries
to do blocking QR of stochastically projected rows in the mappers
which means it needs to read at least k+p rows in each split (map
task). This can be fixed and i have a branch that should eventually
address this. In your case, if there happen to be splits that contain
less than 110 rows of input, that would be the case where you might
want to start setting -s greater than DFS block size (64mb) but it has
no effect if it's less than that (which is why hadoop calls it
_minimum_ split size). I don't remember hadoop's definition of this
parameter, i think it is in bytes, so that means you probably need to
specify something like 100,000,000 to start seeing decrease in number
of the map tasks. But honestly i never tried this yet since i never
had input wide enough to require this.

Saturday, March 26, 2011

Streaming QR decomposition for MapReduce part II: Bottom-up divide-and-conquer overview

In Part I of "Streaming QR decomposition for MapReduce" I touched a little bit about how bottom-up collection of Q is done. I think I would like to add a couple of figures in attempt to give a little more details and clarifications how it is done.

Collecting Q: Divide-And-Conquer: a tad more details

As I mentioned before, outer QR step is essentially bottom-up n-indegree Divide-And-Conquer algorithm.

\[\mathbf{Y}=\left(\begin{matrix}\cdots\\\mathbf{Y}_{i}\\\mathbf{Y}_{i+1}\\\mathbf{Y}_{i+2}\\\cdots\end{matrix}\right)=\begin{matrix}\cdots\\\left.\begin{matrix}\mathbf{Q} & \mathbf{R}\\\mathbf{Q} & \mathbf{R}\\\cdots & \cdots\\\mathbf{Q} & \mathbf{R}\end{matrix}\right]\\\\\left.\begin{matrix}\mathbf{Q} & \mathbf{R}\\\mathbf{Q} & \mathbf{R}\\\cdots & \cdots\\\mathbf{Q} & \mathbf{R}\end{matrix}\right]\\\\\left.\begin{matrix}\mathbf{Q} & \mathbf{R}\\\mathbf{Q} & \mathbf{R}\\\cdots & \cdots\\\mathbf{Q} & \mathbf{R}\end{matrix}\right]\\\cdots\end{matrix}\Rightarrow\begin{matrix}\cdots\\\left.\begin{matrix}\cdots & \mathbf{\cdots}\\\left.\begin{matrix}\mathbf{Q}\\\mathbf{Q}\\\cdots\\\mathbf{Q}\end{matrix}\right] & \mathbf{R}\\\\\left.\begin{matrix}\mathbf{Q}\\\mathbf{Q}\\\cdots\\\mathbf{Q}\end{matrix}\right] & \mathbf{R}\\\\\left.\begin{matrix}\mathbf{Q}\\\mathbf{Q}\\\cdots\\\mathbf{Q}\end{matrix}\right] & \mathbf{R}\\\cdots & \cdots\end{matrix}\right]\\\cdots\end{matrix}\Rightarrow\cdots\Rightarrow\begin{matrix}\left.\begin{matrix}\mathbf{\cdots}\\\mathbf{\hat{Q}}\\\mathbf{\hat{Q}}\\\cdots\\\mathbf{\hat{Q}}\\\\\mathbf{\hat{Q}}\\\mathbf{\hat{Q}}\\\cdots\\\mathbf{\hat{Q}}\\\\\mathbf{\hat{Q}}\\\mathbf{\hat{Q}}\\\cdots\\\mathbf{\hat{Q}}\\\cdots\end{matrix}\right] & \hat{\mathbf{R}}\end{matrix}\]

The way generalized Givens thin QR works, it applies a number of Givens transformations on a Q-accumulator matrix given some initial tall matrix (in this case, Y blocks) until the tall matrix is reduced to a form of $\left(\begin{matrix}\mathbf{R}\\\mathbf{Z}\end{matrix}\right)$ , where Z has all zero elements and R is an upper-triangular. Initial value for Q-accumulator matrix is chosen as I (square identity matrix). To tranform result to a 'thin' QR result, only first n columns of accumulator are taken (that becomes Q of the decomposition) and R part is taken as the second matrix of the thin decomposition.

In our case, in order to enable Givens recursive use, we generalize standard Givens algorithm by making it accept two parameters: pre-existing 'thick qr' Q accumulator instead of I (in reality only shifting 'thin' Q accumulator is sufficient) and the tall input matrix as second input.

Bottom-up divide-and-conquer algorithm hence can be described as follows:

First, we split all input (Y) into vertical blocks (see illustration above). For each block, we compute slightly modified version of standard Givens QR which is optimized for sequential (streaming) access to the tall input block elements without running into big memory requirements. The output of this step is bunch of (Q,R) pairs (each pair corresponds to the original block). For benefit of subsequent step, we will consider this block data as ({Q}, R) where sequence of Q blocks is denoted as {Q} and contains only one matrix. (Actually math stuff normally uses parenthesis () to distinguish sequences from sets, but I feel using {} notation is much more expressive in this case).
Second, we group the input of form {Q}, R into new groups, each group thus would be denoted as {{Q},R} such that number of R in each group doesn't exceed certain maximum bottom-up indegree limit N (usually 1000 for 1G RAM solvers).
Third, for each group formed in step 2 above, we run row-wise Givens solver, which produces ("merges") a valid Givens solution for each group {{Q},R} → {Q},R. Essentually this solves "QR reduction" at nodes of divide-and-conquer:

$\begin{matrix}\left.\begin{matrix}\mathbf{Q}\\ \mathbf{Q}\\ \cdots\\ \mathbf{Q}\end{matrix}\right]\mathbf{R}\\ \\\left.\begin{matrix}\mathbf{Q}\\ \mathbf{Q}\\ \cdots\\ \mathbf{Q}\end{matrix}\right]\mathbf{R}\\ \cdots\\ \left.\begin{matrix}\mathbf{Q}\\ \mathbf{Q}\\ \cdots\\ \mathbf{Q}\end{matrix}\right]\mathbf{R}\end{matrix}\Rightarrow\left.\begin{matrix}\mathbf{\hat{Q}}\\ \mathbf{\hat{Q}}\\ \cdots\\ \mathbf{\hat{Q}}\end{matrix}\right]\hat{\mathbf{R}}$

,

which is essentially equivalent to solving a number of individual block-wise QR decompositions over a one ore more initial blocks of Y into single one

$\left(\begin{matrix}\left(\mathbf{\begin{matrix}\mathbf{Q}_{1}\\ \mathbf{Q}\\ \cdots\\ \mathbf{Q}\end{matrix}}\right)\mathbf{R}_{1}\\ \\\left(\mathbf{\begin{matrix}\mathbf{Q}\\ \mathbf{Q}\\ \cdots\\ \mathbf{Q}\end{matrix}}\right)\mathbf{R}_{2}\\ \cdots\\ \left(\mathbf{\begin{matrix}\mathbf{Q}\\ \mathbf{Q}\\ \cdots\\ \mathbf{Q}_{z}\end{matrix}}\right)\mathbf{R}_{n}\end{matrix}\right)=\left(\begin{matrix}\hat{\mathbf{Q}}_{1}\\ \hat{\mathbf{Q}}\\ \cdots\\ \hat{\mathbf{Q}}_{z}\end{matrix}\right)\mathbf{\hat{R}}$ .
This algorithm is called 'compute QHatSequence2' (I think) in my detailed notes and is the one that builds by induction. Hat sign is to denote final QR blocks. I plan to discuss details of that algorithm in "part III" of this blog.
Fourth, we consider each result of step 3 to form a sequence again and restart form step 2 and repeat it from there unless the number of groups is 1, which would produce our final Q blocks and single R as a result.

Parallelization Strategy

We can consider the iterations above from the point of view of individual block Y_i as a set of isolated parallel steps. We can view the entire computation as a series of independent computations over Q_i,{R₁...R_n}. It also turns out we always consume the sequence of R in the same order and never have to load more than one upper triangular matrix in the memory, so the parameters of such algorithms can actually be [Q_i, iterator{R₁...R_n}]. The algorithm produces new version of Q block, and the last solver in a group would produce a new R item for the next step R-sequence (as can be demonstrated further on). Then solver is reloaded with next R sequence and runs again (until we are left with just one R). That would be next map-only job run but each run reduces total number of Rs thousands of times. so 3 map-only runs can handle 1 billion blocks (besides the last run is really combined with the next step, computation of $\mathbf{B}^{\top}$, so we save at least one iteration setup overhead here.
Another small enhancement is that it turns out each Q computation doesn't need entire {R₁...R_n}. Let's rewrite Q block indexes into 2-variable index to reflect their group and subgroup indices:\[\left(\begin{matrix}\left(\begin{matrix}\mathbf{Q}_{1}\\\mathbf{Q}_{2}\\\cdots\\\mathbf{Q}\end{matrix}\right)\mathbf{R}_{1}\\\left(\begin{matrix}\mathbf{Q}\\\mathbf{Q}\\\cdots\\\mathbf{Q}\end{matrix}\right)\mathbf{R}_{2}\\\cdots\\\left(\begin{matrix}\mathbf{Q}\\\mathbf{Q}\\\cdots\\\mathbf{Q}_{z}\end{matrix}\right)\mathbf{R}_{n}\end{matrix}\right)\equiv\left(\begin{matrix}\left(\begin{matrix}\mathbf{Q}_{11}\\\mathbf{Q}_{12}\\\cdots\\\mathbf{Q}_{1k}\end{matrix}\right)\mathbf{R}_{1}\\\left(\begin{matrix}\mathbf{Q}_{21}\\\mathbf{Q}_{22}\\\cdots\\\mathbf{Q}_{2k}\end{matrix}\right)\mathbf{R}_{2}\\\cdots\\\left(\begin{matrix}\mathbf{Q}_{n1}\\\mathbf{Q}_{n2}\\\cdots\\\mathbf{Q}_{nk}\end{matrix}\right)\mathbf{R}_{n}\end{matrix}\right).\] It turns out that computation over Q_1i and Q_2i requires entire {R₁...R_n} but computation over Q_3i requires $\left\{ \mathrm{GivensQR}\left[\left(\begin{matrix}\mathbf{R}_{1}\\\mathbf{R}_{2}\end{matrix}\right)\right].\mathbf{R},\mathbf{R}_{3}...\mathbf{R}_{n}\right\}$, i.e only n-1 upper-triangulars. Going on, сomputation over Q_4i requires sequence $\left\{ \mathrm{GivensQR}\left[\left(\begin{matrix}\mathrm{GivensQR}\left[\left(\begin{matrix}\mathbf{R}_{1}\\\mathbf{R}_{2}\end{matrix}\right)\right].\mathbf{R}\\\mathbf{R}_{3}\end{matrix}\right)\right].\mathbf{R},\mathbf{R}_{4}...\mathbf{R}_{n}\right\}$. And so on, with the Q_n requiring only two upper-triangular arguments. That means that we quite legally can split computation into at most k independent jobs, each i-th parallel job computing blocks Q_1i, Q_2i...Q_ni in sequence while also reducing R sequence per above. To aid parallelization even more, we can actually choose k to be whatever we want: inside the algorithm, matrix $\left(\begin{matrix}\mathbf{Q}_{i1}\\\mathbf{Q}_{i2}\\\cdots\\\mathbf{Q}_{ik}\end{matrix}\right)$ is only ever transformed by applying column-wise Givens operations and horizontal shifts. Hence, it doesn't matter how it is being split into blocks, we can regroup rows there in any way to come up with a degree of parallelism k we are comfortable with.

Looks pretty complicated, huh? I guess it's not most complicated part yet though. Besides, I haven't implemented bottom-up approach 100% yet. I only implemented 2-step hierarchy (i.e. we can have n×n Y blocks initially only). I think it is kind of enough for what my company does, but it can be worked on to insert more steps in between to enable as many blocks as we want.

We've just started on the path uphill.