Weathering Thru Tech Days: March 2016

Friday, March 11, 2016

AM: BMR Book Errata

3/9/2016

Section 4.11.1

The `rowSums` formula should read
\[
\mathbf{A1}=\sum_{i}^{n}\mathbf{A}_{*i},
\]
and the `colSums` formula should read
\[
\mathbf{A}^{\top}\mathbf{1}=\sum_{i}^{m}\mathbf{A}_{i*}
\]
Note: it is a little bit confusing. Following R semantics, the rowSums() method means "row-wise sum" which is the same as "sum of columns"; and vice versa, the colSums() method means "column-wise sum", which can be computed as the sum of matrix rows.

Section 6.1

The dimensions of matrix $$\mathbf{V}$$ are $$\mathbf{V}\in\mathbb{R}^{n\times k}$$.
Formula (6.1) should read
\[
\begin{equation}
\boldsymbol{a}_{pca}=\mathbf{V}^{\top}\left(\boldsymbol{a}-\boldsymbol{\mu}\right).\label{eq:to-pca}
\end{equation}
\]
Formula (6.2) - (6.3) should read
\[
\begin{eqnarray}
\boldsymbol{a} & = & \left(\mathbf{V}^{\top}\right)^{\text{-}1}\boldsymbol{a}_{pca}+\boldsymbol{\mu} \\
& = & \mathbf{V}\boldsymbol{a}_{pca}+\boldsymbol{\mu}.\label{eq:from-pca}
\end{eqnarray}
\]

3/12/2016

Section 8.3

In Step (1): Setup working directories and acquire data, the URL of the Wikipedia XML dump has since changed. The third command of Step (1) should read:

curl http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles10.xml-p002336425p003046511.bz2 -o $WORK_DIR/wikixml/enwiki-latest-pages-articles.xml.bz2

We keep the Kindle edition updated with the Errata. If you have ever bought the print version from Amazon, the Kindle version is free via Amazon MatchBook. If you already have a Kindle version, you should be able just to reload it with the updated one.

Saturday, March 5, 2016

Mahout "Samsara" book prerequisites

Q: "... Any tips on prerequisite reading?"

A: To quote the preface of the book:

"Some material assumes an undergraduate level understanding of calculus and on occasion multivariate calculus. The examples are given in Scala and assume familiarity with basic Scala terminology and a minimal notion of functional programming."

Let's elaborate.

There are two sides: the technology side and the math side.

For the technology side, the reader would benefit from some Scala fluency and functional programming. Public Scala material presented on http://scala-lang.org/ gets one a long way, plus of course there's always a longer book by Martin Odersky, et. al., "Programming in Scala".

For the math side, note that we do not explain math. We do not ask ourselves the question "Can I understand it [why it works]?", but rather we ask the question "Can I try it?". A lot of practical research is working exactly just like that -- by asking ourselves the question "What if I try this or that?". So it is not so much about math but rather about math notations, which are for our purposes is pseudocode that we turn into code. Math notations are sufficiently explained in the section A.2; if you have a concrete question beyond what is there, please ask -- we will try to answer.

Beyond notations, it would help to be familiar with the undergraduate level of linear algebra and calculus (after all, we are trying to deal with applied machine learning).

For linear algebra, a good reference is the textbook by Gilbert Strang, "Introduction to Linear Algebra". Algorithms we illustrate at times rely on working knowledge of singular value decomposition (SVD), eigendecomposition, Cholesky decomposition and QR decomposition, as well as Four Fundamental subspaces.

It has been some time since I studied multivariate calculus, so I am not quite sure what the best text is on it these days. I have the "Multivariate Calculus" book by Larson Edwards, which is pretty thorough in my opinion. We do not need all of it though; our book examples touch on very few notions of multivariate calculus -- partial derivatives, gradient, Hessian matrix. As a refresher, perhaps even reading Wikipedia articles on these issues is enough.

Of course when one works on a particular algorithm, she or he needs to read the publication containing the algorithm formulation. That is one of the things that we are trying to demonstrate: how to read and implement algorithm formulations using Mahout "Samsara". We give references to the algorithm publications as appropriate, throughout the book.

-DL