## Sunday, March 27, 2011

### Streaming QR decomposition for MapReduce part III: Collecting Q by Induction

In Part II, as a part of bottom-up divide-and-conquer parallelization strategy, one problem emerged that we need to solve at every node of bottom-up tree:

$\left(\begin{matrix}\left(\begin{matrix}\mathbf{Q}_{1}\\\mathbf{Q}_{2}\\\vdots\\\mathbf{Q}\end{matrix}\right)\mathbf{R}_{1}\\\left(\begin{matrix}\mathbf{Q}\\\mathbf{Q}\\\vdots\\\mathbf{Q}\end{matrix}\right)\mathbf{R}_{2}\\\vdots\\\left(\begin{matrix}\mathbf{Q}\\\mathbf{Q}\\\vdots\\\mathbf{Q}_{z}\end{matrix}\right)\mathbf{R}_{n}\end{matrix}\right)=\left(\begin{matrix}\mathbf{\hat{Q}}\\\mathbf{\hat{Q}}\\\vdots\\\mathbf{\hat{Q}}\end{matrix}\right)\mathbf{\hat{R}}$        (1)

Let's simplify the problem for a moment (as it turns out, vertical blocking of Q is trivial as they are always produced by application of column-wise Givens operations which we can 'replay' on any Q block).

$\left(\begin{matrix}\mathbf{Q}_{1}\mathbf{R}_{1}\\\mathbf{Q}_{2}\mathbf{R}_{2}\\\cdots\\\mathbf{Q}_{z}\mathbf{R}_{z}\end{matrix}\right)\Rightarrow\left(\begin{matrix}\mathbf{\hat{Q}}_{1}\\\mathbf{\hat{Q}}_{2}\\\cdots\\\mathbf{\hat{Q}}_{z}\end{matrix}\right)\hat{\mathbf{R}}$     (2)

Also let's decompose standard Givens QR operation in to two:

1) producing Givens transformation sequence on some input $m\times n$, $m>n$  as $\left(\mathbf{G}_{1},\mathbf{G}_{2}...\mathbf{G}_{k}\right)=\mathrm{givens\_qr}\left(\mathbf{A}\right)$.  I will denote product of all Givens operations obtained this way as $\prod_{i}\mathbf{G}_{i}\equiv\prod\mathrm{givens\_qr}\left(\mathbf{A}\right)$.

2) Applying product of Givens operations on input produces already familiar result
$\left(\prod_{i}\mathbf{G}_{i}\right)^{\top}\mathbf{A}=\left(\begin{matrix}\mathbf{R}\\\mathbf{Z}\end{matrix}\right)$

from which it follows that thin QR's R can be written as
$\mathbf{R}=\left[\left(\prod_{i}\mathbf{G}_{i}\right)^{\top}\mathbf{A}\right]\left(1:n,:\right)=\left[\left(\prod\mathrm{givens\_qr}\left(\mathbf{A}\right)\right)^{\top}\mathbf{A}\right]\left(1:n,:\right)$
using Golub/Van Loan's block notation.

(Golub/Van Loan's block notation in form of A(a1:a2, b1:b2) means "crop of a matrix A, rows a1 to a2 and columns b1 to b2". For half- and full-open intervals the bounds are just omitted).

'Thick' Q is produced by $\mathbf{I}\left(\prod_{i}\mathbf{G}_{i}\right)$ and hence 'thin Q' is
$\mathbf{Q}=\left[\mathbf{I}\left(\prod_{i}\mathbf{G}_{i}\right)\right]\left(:,1:n\right)=\left[\mathbf{I}\left(\prod_{i}\mathrm{givens\_qr}\left(\mathbf{A}\right)\right)\right]\left(:,1:n\right),\mathbf{\,\, I}\in\mathbb{R}^{m\times m}.$

Now that we laid out all notations, let's get to the gist. As I mentioned before, I continue building algorithm by induction. Let's consider case (2) for $z=2$. Then it can be demonstrated that the following is the solution as it is equivalent to full Givens QR with a rearranged but legitimate order of Givens operations:

$\mathbf{\hat{R}}=\left\{\left[\prod\mathrm{givens\_qr}\left(\begin{matrix}\mathbf{R}_{1}\\\mathbf{R}_{2}\end{matrix}\right)\right]^{\top}\cdot\left(\begin{matrix}\mathbf{R}_{1}\\\mathbf{R}_{2}\end{matrix}\right)\right\} \left(1:n,:\right)$,

$\mathbf{\hat{Q}=}\left[\left(\begin{matrix}\mathbf{Q}_{1} & \mathbf{Z}\\\mathbf{Z} & \mathbf{Q}_{2}\end{matrix}\right)\cdot\prod_{i}\mathrm{givens\_qr}\left(\begin{matrix}\mathbf{R}_{1}\\\mathbf{R}_{2}\end{matrix}\right)\right]\left(:,1:n\right)$.

Let's denote function of 2-block computation of $\mathbf{\hat{R}}$ as

$\mathrm{rhat}\left(\mathbf{R}_{i},\mathbf{R}_{j}\right)=\left\{\left[\prod\mathrm{givens\_qr}\left(\begin{matrix}\mathbf{R}_{i}\\\mathbf{R}_{j}\end{matrix}\right)\right]^{\top}\cdot\left(\begin{matrix}\mathbf{R}_{i}\\\mathbf{R}_{j}\end{matrix}\right)\right\} \left(1:n,:\right)$.

Still working on the trivial case of induction: Now note that since Givens operations for calculation of $\mathbf{\hat{Q}}$ are applied pairwise to the columns of the accumulator matrix, i.e. independently for each row, we can split computation of $\mathbf{\hat{Q}}$ and apply it to any combination of vertical blocks of $\left(\begin{matrix}\mathbf{Q}_{1} & \mathbf{Z}\\\mathbf{Z} & \mathbf{Q}_{2}\end{matrix}\right)$. Let's say we decide to split it into 2 blocks with as many rows as in Q1 and Q2 and get back to block-wise formulas sought for solution of (2):

$\mathbf{\hat{Q}_{1}=}\left[\left(\begin{matrix}\mathbf{Q}_{1} & \mathbf{Z}\end{matrix}\right)\cdot\prod_{i}\mathrm{givens\_qr}\left(\begin{matrix}\mathbf{R}_{1}\\\mathbf{R}_{2}\end{matrix}\right)\right]\left(:,1:n\right),$

$\mathbf{\hat{Q}_{2}=}\left[\left(\begin{matrix}\mathbf{Z} & \mathbf{Q}_{2}\end{matrix}\right)\cdot\prod_{i}\mathrm{givens\_qr}\left(\begin{matrix}\mathbf{R}_{1}\\\mathbf{R}_{2}\end{matrix}\right)\right]\left(:,1:n\right).$      (3)

Let's denote those transformations in more general form as

$\mathrm{qhat\_down}\left(\mathbf{Q}_{i},\mathbf{R}_{i},\mathbf{R}_{j}\right)=\left[\left(\begin{matrix}\mathbf{Q}_{i} & \mathbf{Z}\end{matrix}\right)\cdot\prod\mathrm{givens\_qr}\left(\begin{matrix}\mathbf{R}_{i}\\\mathbf{R}_{j}\end{matrix}\right)\right]\left(:,1:n\right)$

and

$\mathrm{qhat\_up}\left(\mathbf{Q}_{i},\mathbf{R}_{i},\mathbf{R}_{j}\right)=\left[\left(\begin{matrix}\mathbf{Z} & \mathbf{Q}_{i}\end{matrix}\right)\cdot\prod\mathrm{givens\_qr}\left(\begin{matrix}\mathbf{R}_{j}\\\mathbf{R}_{i}\end{matrix}\right)\right]\left(:,1:n\right)$

then we can rewrite (3) as

$\mathbf{\hat{Q}}=\left(\begin{matrix}\mathbf{\hat{Q}}_{1}\\\mathbf{\hat{Q}}_{2}\end{matrix}\right)=\left(\begin{matrix}\mathrm{qhat\_down}\left(\mathbf{Q}_{1},\mathbf{R}_{1},\mathbf{R}_{2}\right)\\\mathrm{qhat\_up}\left(\mathbf{Q}_{2},\mathbf{R}_{2},\mathbf{R}_{1}\right)\end{matrix}\right)$,

$\mathbf{\hat{R}}=\mathrm{rhat}\left(\begin{matrix}\mathbf{R}_{1}\\\mathbf{R}_{2}\end{matrix}\right)$

This is our solution for (2) of trivial case (z=2).

I will show solution for z=3 and then just will give the final solution without a proof.

Case z=3:

Note that functions qhat_up() and qhat_down() also implicitly produce intermediate rhat() products during their computation, so we want to 'enhance' them to capture that rhat() result as well. We will denote 'evolving' intermediate results Q and R as a sequence $\left(\mathbf{\tilde{Q}},\mathbf{\tilde{R}}\right)$:

$\mathrm{qrhat\_down}\left(\mathbf{Q}_{i},\mathbf{R}_{i},\mathbf{R}_{j}\right)=\left(\mathbf{\tilde{Q}},\mathbf{\tilde{R}}\right)=\left(\mathrm{qhat\_down}\left(\mathbf{Q}_{i},\mathbf{R}_{i},\mathbf{R}_{j}\right),\mathrm{rhat}\left(\mathbf{R}_{i},\mathbf{R}_{j}\right)\right)$,

$\mathrm{qrhat\_up}\left(\mathbf{Q}_{i},\mathbf{R}_{i},\mathbf{R}_{j}\right)=\left(\mathbf{\tilde{Q}},\mathbf{\tilde{R}}\right)=\left(\mathrm{qhat\_up}\left(\mathbf{Q}_{i},\mathbf{R}_{i},\mathbf{R}_{j}\right),\mathrm{rhat}\left(\mathbf{R}_{j},\mathbf{R}_{i}\right)\right)$.

Then, using those notations, we can write solution for (2) z=3 as

$\mathbf{\hat{Q}}=\left(\begin{matrix}\mathbf{\hat{Q}}_{1}\\\mathbf{\hat{Q}}_{2}\\\mathbf{\hat{Q}}_{3}\end{matrix}\right)=\left(\begin{matrix}\mathrm{qrhat\_down}\left(\mathrm{qrhat\_down}\left(\mathbf{Q}_{1},\mathbf{R}_{1},\mathbf{R}_{2}\right),\mathbf{R}_{3}\right).\mathbf{\tilde{Q}}\\\mathrm{qrhat\_down}\left(\mathrm{qrhat\_up}\left(\mathbf{Q}_{2},\mathbf{R}_{2},\mathbf{R}_{1}\right),\mathbf{R}_{3}\right).\mathbf{\tilde{Q}}\\\mathrm{qrhat\_up}\left(\mathbf{Q}_{3},\mathbf{R}_{3},\mathrm{rhat}\left(\mathbf{R}_{1},\mathbf{R}_{2}\right)\right).\mathbf{\tilde{Q}}\end{matrix}\right)$.

Note that algorithm for computing $\mathbf{\hat{Q}}_i$ requires input of Qi, iterator[(R1,R2...Rz)].

General solution for (2) for any z is built by induction as the following algorithm:

This algorithm is still sub-efficient as for the entire matrix (2) it computes more rhat() operations than needed, and can be optimized to reduce those operations when considering final solution for (1),  but that's probably too many details for this post for now. Actual details are found in code and my working notes.

Finally, a word about how to transform solution for (2) into a solution for (1).

We can rewrite indexes in (1) as

$\left(\begin{matrix}\left(\begin{matrix}\mathbf{Q}_{1}\\\mathbf{Q}_{2}\\\cdots\\\mathbf{Q}\end{matrix}\right)\mathbf{R}_{1}\\\left(\begin{matrix}\mathbf{Q}\\\mathbf{Q}\\\cdots\\\mathbf{Q}\end{matrix}\right)\mathbf{R}_{2}\\\cdots\\\left(\begin{matrix}\mathbf{Q}\\\mathbf{Q}\\\cdots\\\mathbf{Q}_{z}\end{matrix}\right)\mathbf{R}_{n}\end{matrix}\right)\equiv\left(\begin{matrix}\left(\begin{matrix}\mathbf{Q}_{11}\\\mathbf{Q}_{12}\\\cdots\\\mathbf{Q}_{1k}\end{matrix}\right)\mathbf{R}_{1}\\\left(\begin{matrix}\mathbf{Q}_{21}\\\mathbf{Q}_{22}\\\cdots\\\mathbf{Q}_{2k}\end{matrix}\right)\mathbf{R}_{2}\\\cdots\\\left(\begin{matrix}\mathbf{Q}_{n1}\\\mathbf{Q}_{n2}\\\cdots\\\mathbf{Q}_{nk}\end{matrix}\right)\mathbf{R}_{n}\end{matrix}\right)$

And then regroup that into k independent tasks solving

$\left(\begin{matrix}\mathbf{Q}_{1i}\mathbf{R}_{1}\\\mathbf{Q}_{2i}\mathbf{R}_{2}\\\cdots\\\mathbf{Q}_{zi}\mathbf{R}_{z}\end{matrix}\right)\Rightarrow\left(\begin{matrix}\mathbf{\hat{Q}}_{1i}\\\mathbf{\hat{Q}}_{2i}\\\cdots\\\mathbf{\hat{Q}}_{zi}\end{matrix}\right)\hat{\mathbf{R}}$,
$i\in1..k,$

which can be solved via solution for (2).

That may also be one of  parallelization strategies.

That wasn't so hard, was it? It's possible I am performing a subpar re-tracing of some existing research or method here. But I don't think there's much work about how to fit  QR onto parallel batch machinery.

I hope that some horizons perhaps became a little clearer.