Collecting Q: Divide-And-Conquer: a tad more details

As I mentioned before, outer QR step is essentially bottom-up

*n*-indegree Divide-And-Conquer algorithm.

\[\mathbf{Y}=\left(\begin{matrix}\cdots\\\mathbf{Y}_{i}\\\mathbf{Y}_{i+1}\\\mathbf{Y}_{i+2}\\\cdots\end{matrix}\right)=\begin{matrix}\cdots\\\left.\begin{matrix}\mathbf{Q} & \mathbf{R}\\\mathbf{Q} & \mathbf{R}\\\cdots & \cdots\\\mathbf{Q} & \mathbf{R}\end{matrix}\right]\\\\\left.\begin{matrix}\mathbf{Q} & \mathbf{R}\\\mathbf{Q} & \mathbf{R}\\\cdots & \cdots\\\mathbf{Q} & \mathbf{R}\end{matrix}\right]\\\\\left.\begin{matrix}\mathbf{Q} & \mathbf{R}\\\mathbf{Q} & \mathbf{R}\\\cdots & \cdots\\\mathbf{Q} & \mathbf{R}\end{matrix}\right]\\\cdots\end{matrix}\Rightarrow\begin{matrix}\cdots\\\left.\begin{matrix}\cdots & \mathbf{\cdots}\\\left.\begin{matrix}\mathbf{Q}\\\mathbf{Q}\\\cdots\\\mathbf{Q}\end{matrix}\right] & \mathbf{R}\\\\\left.\begin{matrix}\mathbf{Q}\\\mathbf{Q}\\\cdots\\\mathbf{Q}\end{matrix}\right] & \mathbf{R}\\\\\left.\begin{matrix}\mathbf{Q}\\\mathbf{Q}\\\cdots\\\mathbf{Q}\end{matrix}\right] & \mathbf{R}\\\cdots & \cdots\end{matrix}\right]\\\cdots\end{matrix}\Rightarrow\cdots\Rightarrow\begin{matrix}\left.\begin{matrix}\mathbf{\cdots}\\\mathbf{\hat{Q}}\\\mathbf{\hat{Q}}\\\cdots\\\mathbf{\hat{Q}}\\\\\mathbf{\hat{Q}}\\\mathbf{\hat{Q}}\\\cdots\\\mathbf{\hat{Q}}\\\\\mathbf{\hat{Q}}\\\mathbf{\hat{Q}}\\\cdots\\\mathbf{\hat{Q}}\\\cdots\end{matrix}\right] & \hat{\mathbf{R}}\end{matrix}\]

The way generalized Givens thin QR works, it applies a number of Givens transformations on a Q-accumulator matrix given some initial tall matrix (in this case, Y blocks) until the tall matrix is reduced to a form of $\left(\begin{matrix}\mathbf{R}\\\mathbf{Z}\end{matrix}\right)$ , where **Z**has all zero elements and

**R**is an upper-triangular. Initial value for Q-accumulator matrix is chosen as

**I**(square identity matrix). To tranform result to a 'thin' QR result, only first

*n*columns of accumulator are taken (that becomes

**Q**of the decomposition) and

**R**part is taken as the second matrix of the thin decomposition.

In our case, in order to enable Givens recursive use, we generalize standard Givens algorithm by making it accept two parameters: pre-existing 'thick

*qr*'

**Q**accumulator instead of

**I**(in reality only shifting 'thin'

**Q**accumulator is sufficient) and the tall input matrix as second input.

Bottom-up divide-and-conquer algorithm hence can be described as follows:

- First, we split all input (
**Y**) into vertical blocks (see illustration above). For each block, we compute slightly modified version of standard Givens**QR**which is optimized for sequential (streaming) access to the tall input block elements without running into big memory requirements. The output of this step is bunch of (**Q**,**R**) pairs (each pair corresponds to the original block). For benefit of subsequent step, we will consider this block data as ({**Q**},**R**) where sequence of**Q**blocks is denoted as {**Q**} and contains only one matrix. (Actually math stuff normally uses parenthesis () to distinguish sequences from sets, but I feel using {} notation is much more expressive in this case).

- Second, we group the input of form {
**Q**},**R**into new groups, each group thus would be denoted as {{**Q**},**R**} such that number of**R**in each group doesn't exceed certain maximum bottom-up indegree limit N (usually 1000 for 1G RAM solvers).

- Third, for each group formed in step 2 above, we run row-wise Givens solver, which produces ("merges") a valid Givens solution for each group {{
**Q**},**R**} → {**Q**},**R**. Essentually this solves "QR reduction" at nodes of divide-and-conquer:

,which is essentially equivalent to solving a number of individual block-wise**QR**decompositions over a one ore more initial blocks of**Y**into single one

- Fourth, we consider each result of step 3 to form a sequence again and restart form step 2 and repeat it from there unless the number of groups is 1, which would produce our final
**Q**blocks and single**R**as a result.

- We can consider the iterations above from the point of view of individual block
**Y**_{i}as a set of isolated parallel steps. We can view the entire computation as a series of independent computations over**Q**_{i},{**R**_{1}...**R**_{n}}. It also turns out we always consume the sequence of**R**in the same order and never have to load more than one upper triangular matrix in the memory, so the parameters of such algorithms can actually be [**Q**_{i}, iterator{**R**_{1}...**R**_{n}}]. The algorithm produces new version of**Q**block, and the last solver in a group would produce a new**R**item for the next step**R**-sequence (as can be demonstrated further on). Then solver is reloaded with next**R**sequence and runs again (until we are left with just one**R**). That would be next map-only job run but each run reduces total number of**R**s thousands of times. so 3 map-only runs can handle 1 billion blocks (besides the last run is really combined with the next step, computation of $\mathbf{B}^{\top}$, so we save at least one iteration setup overhead here. - Another small enhancement is that it turns out each
**Q**computation doesn't need entire {**R**_{1}...**R**_{n}}. Let's rewrite**Q**block indexes into 2-variable index to reflect their group and subgroup indices:\[\left(\begin{matrix}\left(\begin{matrix}\mathbf{Q}_{1}\\\mathbf{Q}_{2}\\\cdots\\\mathbf{Q}\end{matrix}\right)\mathbf{R}_{1}\\\left(\begin{matrix}\mathbf{Q}\\\mathbf{Q}\\\cdots\\\mathbf{Q}\end{matrix}\right)\mathbf{R}_{2}\\\cdots\\\left(\begin{matrix}\mathbf{Q}\\\mathbf{Q}\\\cdots\\\mathbf{Q}_{z}\end{matrix}\right)\mathbf{R}_{n}\end{matrix}\right)\equiv\left(\begin{matrix}\left(\begin{matrix}\mathbf{Q}_{11}\\\mathbf{Q}_{12}\\\cdots\\\mathbf{Q}_{1k}\end{matrix}\right)\mathbf{R}_{1}\\\left(\begin{matrix}\mathbf{Q}_{21}\\\mathbf{Q}_{22}\\\cdots\\\mathbf{Q}_{2k}\end{matrix}\right)\mathbf{R}_{2}\\\cdots\\\left(\begin{matrix}\mathbf{Q}_{n1}\\\mathbf{Q}_{n2}\\\cdots\\\mathbf{Q}_{nk}\end{matrix}\right)\mathbf{R}_{n}\end{matrix}\right).\] It turns out that computation over**Q**_{1i}and**Q**_{2i}requires entire {**R**_{1}...**R**_{n}} but computation over**Q**_{3i}requires $\left\{ \mathrm{GivensQR}\left[\left(\begin{matrix}\mathbf{R}_{1}\\\mathbf{R}_{2}\end{matrix}\right)\right].\mathbf{R},\mathbf{R}_{3}...\mathbf{R}_{n}\right\}$, i.e only*n-*1 upper-triangulars. Going on, сomputation over**Q**_{4i}requires sequence $\left\{ \mathrm{GivensQR}\left[\left(\begin{matrix}\mathrm{GivensQR}\left[\left(\begin{matrix}\mathbf{R}_{1}\\\mathbf{R}_{2}\end{matrix}\right)\right].\mathbf{R}\\\mathbf{R}_{3}\end{matrix}\right)\right].\mathbf{R},\mathbf{R}_{4}...\mathbf{R}_{n}\right\}$. And so on, with the**Q**_{n}requiring only two upper-triangular arguments. That means that we quite legally can split computation into at most*k*independent jobs, each*i-*th parallel job computing blocks**Q**_{1i},**Q**_{2i}...**Q**_{ni}in sequence while also reducing**R**sequence per above. To aid parallelization even more, we can actually choose*k*to be whatever we want: inside the algorithm, matrix $\left(\begin{matrix}\mathbf{Q}_{i1}\\\mathbf{Q}_{i2}\\\cdots\\\mathbf{Q}_{ik}\end{matrix}\right)$ is only ever transformed by applying column-wise Givens operations and horizontal shifts. Hence, it doesn't matter how it is being split into blocks, we can regroup rows there in any way to come up with a degree of parallelism*k*we are comfortable with.

Looks pretty complicated, huh? I guess it's not most complicated part yet though. Besides, I haven't implemented bottom-up approach 100% yet. I only implemented 2-step hierarchy (i.e. we can have

*n×n***Y**blocks initially only). I think it is kind of enough for what my company does, but it can be worked on to insert more steps in between to enable as many blocks as we want.We've just started on the path uphill.

## No comments:

## Post a Comment