Weathering Thru Tech Days: 2016

Friday, March 11, 2016

AM: BMR Book Errata

3/9/2016

Section 4.11.1

The `rowSums` formula should read
\[
\mathbf{A1}=\sum_{i}^{n}\mathbf{A}_{*i},
\]
and the `colSums` formula should read
\[
\mathbf{A}^{\top}\mathbf{1}=\sum_{i}^{m}\mathbf{A}_{i*}
\]
Note: it is a little bit confusing. Following R semantics, the rowSums() method means "row-wise sum" which is the same as "sum of columns"; and vice versa, the colSums() method means "column-wise sum", which can be computed as the sum of matrix rows.

Section 6.1

The dimensions of matrix $$\mathbf{V}$$ are $$\mathbf{V}\in\mathbb{R}^{n\times k}$$.
Formula (6.1) should read
\[
\begin{equation}
\boldsymbol{a}_{pca}=\mathbf{V}^{\top}\left(\boldsymbol{a}-\boldsymbol{\mu}\right).\label{eq:to-pca}
\end{equation}
\]
Formula (6.2) - (6.3) should read
\[
\begin{eqnarray}
\boldsymbol{a} & = & \left(\mathbf{V}^{\top}\right)^{\text{-}1}\boldsymbol{a}_{pca}+\boldsymbol{\mu} \\
& = & \mathbf{V}\boldsymbol{a}_{pca}+\boldsymbol{\mu}.\label{eq:from-pca}
\end{eqnarray}
\]

3/12/2016

Section 8.3

In Step (1): Setup working directories and acquire data, the URL of the Wikipedia XML dump has since changed. The third command of Step (1) should read:

curl http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles10.xml-p002336425p003046511.bz2 -o $WORK_DIR/wikixml/enwiki-latest-pages-articles.xml.bz2

We keep the Kindle edition updated with the Errata. If you have ever bought the print version from Amazon, the Kindle version is free via Amazon MatchBook. If you already have a Kindle version, you should be able just to reload it with the updated one.

Saturday, March 5, 2016

Mahout "Samsara" book prerequisites

Q: "... Any tips on prerequisite reading?"

A: To quote the preface of the book:

"Some material assumes an undergraduate level understanding of calculus and on occasion multivariate calculus. The examples are given in Scala and assume familiarity with basic Scala terminology and a minimal notion of functional programming."

Let's elaborate.

There are two sides: the technology side and the math side.

For the technology side, the reader would benefit from some Scala fluency and functional programming. Public Scala material presented on http://scala-lang.org/ gets one a long way, plus of course there's always a longer book by Martin Odersky, et. al., "Programming in Scala".

For the math side, note that we do not explain math. We do not ask ourselves the question "Can I understand it [why it works]?", but rather we ask the question "Can I try it?". A lot of practical research is working exactly just like that -- by asking ourselves the question "What if I try this or that?". So it is not so much about math but rather about math notations, which are for our purposes is pseudocode that we turn into code. Math notations are sufficiently explained in the section A.2; if you have a concrete question beyond what is there, please ask -- we will try to answer.

Beyond notations, it would help to be familiar with the undergraduate level of linear algebra and calculus (after all, we are trying to deal with applied machine learning).

For linear algebra, a good reference is the textbook by Gilbert Strang, "Introduction to Linear Algebra". Algorithms we illustrate at times rely on working knowledge of singular value decomposition (SVD), eigendecomposition, Cholesky decomposition and QR decomposition, as well as Four Fundamental subspaces.

It has been some time since I studied multivariate calculus, so I am not quite sure what the best text is on it these days. I have the "Multivariate Calculus" book by Larson Edwards, which is pretty thorough in my opinion. We do not need all of it though; our book examples touch on very few notions of multivariate calculus -- partial derivatives, gradient, Hessian matrix. As a refresher, perhaps even reading Wikipedia articles on these issues is enough.

Of course when one works on a particular algorithm, she or he needs to read the publication containing the algorithm formulation. That is one of the things that we are trying to demonstrate: how to read and implement algorithm formulations using Mahout "Samsara". We give references to the algorithm publications as appropriate, throughout the book.

-DL

Tuesday, February 23, 2016

Mahout "Samsara" Book Is Out

After many months of work, our book is finally out.

Here is our formal announcement:

We are happy to announce that Mahout Samsara is finally documented in print. The book title is “Apache Mahout: Beyond MapReduce.”

Similar to other books on computer science or languages such as R, Python, or C++, the authors are hoping to put the reader on the path of designing and creating his or her own algorithms. Also included are tutorials and practical usage information about newer Mahout algorithms.

The emphasis is on Machine Learning algorithm design aspects in the context of Apache Mahout “Samsara” releases (0.10, 0.11). If there were another suitable name for this book, it might be “Beyond the Black Box”: the discussions go beyond the scope of just the Mahout environment or Mahout algorithms and touch on more general concepts for devising algorithms in the context of massive inputs and computations, using Mahout Samsara as a semantical implementation medium for the solutions.

Mahout “Samsara” currently targets H2O, Apache Flink and Apache Spark as backend execution options. As work on the Apache Flink backend is still in progress, all examples, while being execution engine agnostic (with one intended exception), are set up to run with the Apache Spark backend.

This work has been greatly helped by valuable reviews and ideas from other Mahout committers, contributors, and industry professionals, as indicated in the “Acknowledgments” section of the preface (Thank you!).

This book does not discuss legacy MapReduce-based algorithms.

Thank you for using Mahout!

Dmitriy Lyubimov (@dlieuOfTwit)
Andrew Palumbo (@andy_palumbo)

Technical info:

There are two editions of the book: a black and white paperback and a full color Kindle textbook.

The Kindle textbook completely preserves the layout of the paperback edition. It is enrolled in the Amazon “Matchbook” program (free with the purchase of a print copy when ordered via Amazon.com).

The paperback edition has been optimized specifically for a black-and-white print. The format is 7x10in. (a common textbook size).

Code examples are available on GitHub.

Where:

Amazon.

Update: Createspace now has closed their online book store it seems, and just redirects to Amazon, so we cannot seem to set discounts any longer thru them.
~~In the US a paperback-only copy can also be purchased here with a 25% off code QLZ8DLPL.~~

Antother update: the electronic copy can now be obtained here at no cost. Of course Amazon is still the way to have the paper version or on Kindle, but they are no longer the exclusive distributor. Please note that the copyright applies (no redistribution beyond this blog; no reproduction beyond fair use). If you want to link to the book, we will appreciate if you link this blog post (https://weatheringthrutechdays.blogspot.com/2016/02/mahout-samsara-book-is-out.html), rather than the direct download. Thank you.

Post Scriptum

Thank you for reading!

See also: Book prerequisites, Errata updates