Thursday, July 25, 2013

Scala DSL for Mahout in-core linear algebra (proposal)

I've been working on a Scala DSL for Mahout in-core linear algebra.

the code is here:
https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala/math-scala

 Here's what i got so far

Current In-Core operations (examples)


Inline initialization


dense vectors 

val denseVec1: Vector = (1.0, 1.1, 1.2)
val denseVec2 = dvec(1, 0, 1.1, 1.2)

sparse vectors 

val sparseVec = svec((5 -> 1) :: (10 -> 2.0) :: Nil)
val sparseVec2: Vector = (5 -> 1.0) :: (10 -> 2.0) :: Nil


matrix inline initialization, either dense or sparse, is always row-wise:

dense matrices  : 

val a = dense((1, 2, 3), (3, 4, 5))


sparse matrices 

   val a = sparse(
      (1, 3) :: Nil,
      (0, 2) :: (1, 2.5) :: Nil
    )

diagonal matrix with constant diagonal elements

diag(10, 3.5) 

diagonal matrix with main diagonal backed by a vector 

diagv((1, 2, 3, 4, 5))

Identity matrix 

eye(10)

element wise operations 

geting vector element 

val d = vec(5)

setting vector element 

vec(5) = 3.0

getting matrix element 

val d = m(3,5)

setting matrix element (setQuick() behind the scenes) 

m(3,5) = 3.0

Getting matrix row or column

val rowVec = m(3, ::) 
val colVec = m(::, 3) 

Setting matrix row or column 

m(3, ::) = (1, 2, 3)
m(::, 3) = (1, 2, 3)

thru vector assignment also works 

m(3, ::) := (1, 2, 3)
m(::, 3) := (1, 2, 3)

subslices of row or vector work too 

a(0, 0 to 1) = (3, 5)

or with vector assignment

a(0, 0 to 1) := (3, 5)


matrix contiguous block as matrix, with assignment 

// block
val b = a(2 to 3, 3 to 4) 

// asignment to a block
a(0 to 1, 1 to 2) = dense((3, 2), (2, 3))

or thru the matrix assignment operator 

a(0 to 1, 1 to 2) := dense((3, 2), (2, 3))

Assignment operator by copying between vectors or matrix 

vec1 := vec2 
m1 := m2 

also works for matrix subindexing notations as per above

Assignment thru a function literal (matrix)

m := ((row, col, x) => if (row == col) 1 else 0)

for a vector the same

vec := ((index, x) => sqrt(x))

BLAS-like ops


plus/minus, either vector or matrix or numeric, with assignment or not

a + b
a - b

a + 5.0
a - 5.0

Hadamard (elementwise) product, either vector or matrix or numeric operands

a * b
a * 5


same things with assignment, matrix, vector or numeric operands 

a += b
a -= b

a += 5.0
a -= 5.0 

a *= b
a *= 5

One nuance here is associativity rules in scala. E.g. 1/x in R (where x is vector or matrix) is elementwise inverse operation and in scala realm would be expressed as

val xInv = 1 /: x

and R's 5.0 - x would be

val x1 = 5.0 -: x

Even trickier and really probably not so obvious stuff :

a -=: b

assigns a - b to b (in-place) and returns b. Similarly for a /=: b or 1 /=: v.

(all assignment operations, including :=,  return the assignee argument just like in C++)

dot product (vector operands) 

a dot b 

matrix /vector equivalency (or non-equivalency). Dangerous, exact equivalence is rarely useful, better use norm comparisons with admission of small errors 

a === b
a !== b

Matrix multiplication (matrix operands)

a %*% b

for matrices that explicitly support optimized right and left muliply (currently, diagonal matrices)

right-multiply (for symmetry, in fact same as %*%) 

diag(5,5)  :%*% b 

optimized left multiply with a diagonal matrix:

a %*%: diag(5,5) # i.e. same as (diag(5,5) :%*% a.t) t

Second norm, vector or matrix argument:

a.norm

Transpose 

val mt = m.t

Decompositions(matrix arguments)


Cholesky decompositon (as an object of a CholeskyDecomposition class with all its operations)

val ch = chol(m)

SVD

val (u, v, s) = svd(m)

EigenDecomposition

val (v, d) = eigen(m)

QR

val (q, r) = qr(m)

Check for rank deficiency (runs rank-revealing QR)

m.isFullRank

Misc

vector cardinality 

a.length

matrix cardinality 

m.nrow

m.ncol

a copy-by-value (vector or matrix ) 

val b = a cloned 


Pitfalls 

This one the people who are accustomed to writing R linear algebra will probably find quite easy to fall into. R has a nice property, a copy-on-write, so all variables actually appear to act as no-side-effects scalar-like values and all assignment appear to be by value. Since scala always assigns by reference (except for AnyVal types which are assigned by value), it is easy to fall prey to the following side effect modifications

val m1 = m 

m1 += 5.0 // modifies m as well

A fix is as follows: 

val m1 = m cloned 
m1 += 5.0 // now m is intact

Putting it all together: In-core SSVD with power iterations. 

This is a port of my R-based SSVD prototype. To give you an idea, here is how it is going to look in Scala DSL for Mahout (not quite tuned for performance yet, but enough to give you an idea):


 /**
   * In-core SSVD algorithm.
   *
   * @param a input matrix A
   * @param k request SSVD rank
   * @param p oversampling parameter
   * @param q number of power iterations
   * @return (U,V,s)
   */
  def ssvd(a: Matrix, k: Int, p: Int = 15, q: Int = 0) = {
    val m = a.nrow
    val n = a.ncol
    if (k > min(m, n))
      throw new IllegalArgumentException(
        "k cannot be greater than smaller of m,n")
    val pfxed = min(p, min(m, n) - k)

    // actual decomposition rank
    val r = k + pfxed

    // we actually fill the random matrix here
    // just like in our R prototype, although technically
    // that would not be necessary if we implemented specific random
    // matrix view. But ok, this should do for now.
    // it is actually the distributed version we are after -- we
    // certainly would try to be efficient there.

    val rnd = new Random()
    val omega = new DenseMatrix(n, r) := ((r, c, v) => rnd.nextGaussian)

    var y = a %*% omega
    var yty = y.t %*% y
    val at = a.t
    var ch = chol(yty)
    var bt = ch.solveRight(at %*% y)


    // power iterations
    for (i <- 0 until q) {
      y = a %*% bt
      yty = y.t %*% y
      ch = chol(yty)
      bt = ch.solveRight(at %*% y)
    }

    val bbt = bt.t %*% bt
    val (uhat, d) = eigen(bbt)

    val s = d.sqrt
    val u = ch.solveRight(y) %*% uhat
    val v = bt %*% (uhat %*%: diagv(1 /: s))

    (u(::, 0 until k), v(::, 0 until k), s(0 until k))

  }

And a very-very naive test for it:

  test("SSVD") {

    val a = dense((1, 2, 3), (3, 4, 5))

    val (u, v, s) = ssvd(a, 2, q=1)

    printf("U:\n%s\n", u)
    printf("V:\n%s\n", v)
    printf("Sigma:\n%s\n", s)

    val aBar = u %*% diagv(s) %*% v.t

    val amab = a - aBar

    printf("A-USV'=\n%s\n", amab)

    assert(amab.norm < 1e-10)

  }


=============

A follow-up

Many people have come and asked, "why not Matlab syntax? in particular, why not reserve single-symbol '*' for matrix-matrix multiply?". There are some thoughts:

  • There's no preference to use either syntax. The only preference is to use some of the existing grammars to make things look familiar. R or Matlab-like syntax makes no difference to me in this case, and I think the adoption-wise they are about the same as well. So we could actually create support for both, that much is easy. We'd have to basically call additionally "import RLikeOps._" or "import MatlabLikeOps._" to enable one or the other. But, see further. 
  • Element-wise *,+,-,/ are defined on the same geometry and follow the same pattern. They also all define in-place operators which are very popular since they don't involve creation and cloning of a new matrix as a result ("op"= stuff).  Matrix-matrix multiplication seems to be a special case. In that sense, R approach looks more consistent to me. 
  • Matlab elementwise operation syntax ".*" seems to run into problems with Scala. Scala does not suport '.' symbol as part of operators. Indeed, that would have created ambiguities in Scala.
  • In-place elementwise operations would look really more complicated then. I.e. they would have to be '.*=', '.+=' etc. But there would be no such simple looking thing as *=  or /= (since matrix-matrix in-place multiplication is not possible due to cardinality mismatch in the result). 
  • The rabbit hole goes even further when we consider right-associative operations such as elementwise in-place inversion 1/x (R style), which is pretty popular with me too. That thing would have to look like '1 ./=: x ' in this case which is even more cryptic. '1 /= x' or its in-place version '1 /=: x' looks much more bearable to common sense and scala styling.
  • %*%, like many things in R, looks quirky to me, true. But in many ways, matrix-matrix multiplication is also "quirky" and, most notably, computationally heavy. It is asymptotically significantly more expensive and always requires a new matrix to be created (no in-place opportunities here). So, subjectively, asking people to type this pretty inconvenient %*% thing also reflects that heavy nature of the operation, and IMO puts a subconscious warning -- "are you sure this is the best route?  This is expensive!". 













Friday, July 13, 2012

#Mahout 0.7 has more usable PCA workflow now

Now that Mahout 0.7 has been released, it is probably appropriate to say a word with respect to my contribution, MAHOUT-817, that enables (afaik, for the first time) for Mahout to have a PCA workflow. (Well, in all fairness, I wasn't alone contributor there. Many thanks to Raphael Cendrillon for contributing the mean computation map-reduce step).

It is based on SSVD method that computes mathematical SVD in a distributed way. Generally, there's no problem using brute-force PCA approach that involves mean subtraction and then running SVD on the result.

The only time when the wrench is thrown into that engine is when original PCA input matrix is sufficiently sparse. Making mean subtraction is going to transform input matrix into a dense matrix, potentially increasing number of non-zero elements (and hence, flops needed to complete matrix operations during SVD computation) many times.

As it is true for many Mahout methods, SSVD actually takes a big view of the world with all its details and produces a more condensed analytic view of the world. So the natural thinking to deal with that is, ok, can we just run an SVD first on the sparse input and fix the smaller condensed view at its smallest incarnations later so that creates summary equivalent to that with mean subtraction?

Here is the quick write-up that shows how the math for that neat trick has worked out in Mahout.



Also, a high-level overview and practical usage of this approach is given : https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition.

Thursday, September 8, 2011

#MAHOUT -796 power iterations doc here

Here it goes, so i don't lose it. Also attached to the issue.


Still, more work to come in MAHOUT-797, although not tremendously a lot...

Thursday, September 1, 2011

#Mahout #MapReduce #SSVD version gets power iterations, fixes

I have been working on MAHOUT-796 this week (current work is in https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-796).


Nathan Halko pointed to me that power iterations in practice improve precision and reduce noise.


After a little discussion, that's the route we went :


 Original formula 

\[\mathbf{Q}=\mbox{qr}\left[\left(\mathbf{AA^{\top}}\right)^q\mathbf{A}\boldsymbol{\Omega}\right].\mathbf{Q},\]

\[\mathbf{B}=\mathbf{Q}^{\top}\mathbf{A}.\]

Modified version 

\[\mathbf{Y}=\mathbf{A}\boldsymbol{\Omega},\]

\[\mathbf{B}_{0}=\left[\mbox{qr}\left(\mathbf{Y}\right).\mathbf{Q}\right]^{\top}\mathbf{A},\]

\[\mathbf{B}_{i}=\left[\mbox{qr}\left(\mathbf{AB}_{i-1}^{\top}\right).\mathbf{Q}\right]^{\top}\mathbf{A},\, i\in\left[1..q\right].\]

Notation \(\mbox{qr}\left(\cdot\right).\mathbf{Q}\) means "compute QR decomposition of the argument and retain Q as a result". 

Current combination of QJob and BtJob is essentially producing \(\mathbf{B}_{0}^{\top}=\mathbf{A}^{\top}\mbox{qr}\left(\mathbf{A}\boldsymbol{\Omega}\right).\mathbf{Q}\). Intermediate QJob results are QR blocks, not a final Q, so QJob is not terribly meaningful without BtJob.

The task boils down to figuring out alternative pipeline modifications necessary to produce \(\mathbf{B}_{i}^{\top}\). After that, algorithm proceeds as before with assumption of \[\mathbf{B}\equiv\mathbf{B}_{q}.\] 

The existing processing will be equivalent to \({q}=0\). 

\(\mathbf{B}_{i}\) pipeline (some new code)

\(\mathbf{B}_{i}\) pipeline produces \[\mathbf{B}_{i}^{\top}=\mathbf{A}^{\top}\mbox{qr}\left(\mathbf{A}\mathbf{B}^{\top}\right).\mathbf{Q}.\]

this is very similar to \(\mathbf{B}_{0}\) pipeline with specfics being full multiplication of \(\mathbf{A}\mathbf{B}^{\top}\) in the first job and first pass qr pushdown to the reducer of the first job:

  • map1 : \(\mathbf{A}\mathbf{B}^{\top}\) outer products are produced 
  • combiner1 : presumming outer products of \(\mathbf{A}\mathbf{B}^{\top}\).
  • reducer1 finalizes summing up outer products of \(\mathbf{A}\mathbf{B}^{\top}\) and starts \(\mbox{qrFirstPass}\left(\mathbf{A}\mathbf{B}^{\top}\right)\to\mbox{qr blocks}\). (partitioning must be done by A block # it seems).
  • mapper2, combiner2, reducer2 proceed exactly as mapper2, combiner2, reducer2 in B pipeline and output final \(\mathbf{B}_{i}^{\top}\) with blocks corresponding to initial splits of A input.

Thus, this pipeline is 2 MR jobs with 2 sorts (map1 + combiner1 + shuffle and sort + reducer1 + map2 + combiner2 + shuffle and sort + reducer2).

Integration of Cholesky trick route for computing power iterations 

Ted Dunning pointed out that the whole B pipeline could avoid QR step.

When power iterations are concened, note that whatever route is chosen to calculate \(\mathbf{B}_{i}=g\left(\mathbf{Y}_{i}\right)\), mathematically it should still be valid \(\forall i\in\left[0..q\right]\) for as long as we assume 

\[\mathbf{Y}_{0}=\mathbf{A}\boldsymbol{\Omega}, i=0;\]

\[\mathbf{Y}_{i}=\mathbf{A}\mathbf{B}_{i-1}^{\top}, i>{0}.\] 

So, if Cholesky trick allows to produce \(\mathbf{B}_{0}\) efficiently, I see no reason why it could not be applied to producing the \(\mathbf{B}_{i}\).

I will be pursuing MR option of running Cholesky decompositon of \(\mathbf{Y}^{\top}\mathbf{Y}\) in MAHOUT-797.
If that holds, then power iterations with Cholesky decomposition will likely be similar. 
\(\mathbf{LL}^{\top}=\mathbf{Y}^{\top}\mathbf{Y}=\mathbf{R}^{\top}\mathbf{R}\Rightarrow\) we know how to compute \(\mathbf{R}\);
\[\mathbf{B}_{i}=\left(\mathbf{R}^{-1}\right)^{\top}\mathbf{Y_{i}}^{\top}\mathbf{A};\]

\[\mathbf{Y}_{0}=\mathbf{A}\boldsymbol{\Omega}, i=0;\]

\[\mathbf{Y}_{i}=\mathbf{A}\mathbf{B}_{i-1}^{\top}, i>{0}.\]


Tests
Tests with local MR QR-based solver seem to be encouraging. There's clear improvement on precision. 
With surrogate random input with predefined singular values 10,4,1,(0.1...), toy input 2000x1000 and k=3,p=10 I seem to be getting improvement after just one additional power iteration:



q=0: 

--SSVD solver singular values:
svs: 9.998472 3.993542 0.990456 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000
q=1: (+2 more MR sequential steps):
--SSVD solver singular values:
svs: 10.000000 4.000000 0.999999 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000
After some tests and cleanup, I guess it is going to become a commit really soon.
Bug fixes
In addition to all this, MAHOUT-796 branch gets a lot of bug fixes (in particular, better handling sparse inputs which seemed to have still been broken in current version).

Saturday, June 4, 2011

John #Canny's function #summarizer formulae #Pig #MapReduce

Credit for the idea to Ted Dunning, but I wanted to log the final formulae set i used for MapReduce and Pig implementation of these kind of summarizers for my future reference. Here it goes.

Adaptation

\[s\left(t\right)=ke^{-t}-\left(k-1\right)e^{-\frac{k}{k-1}t},\, k>1. \]

This $s\left(t\right)$  (which Ted Dunning calls 'Canny's filter') has desirable properties of $s'\left(0\right)=0$  and $s\left(0\right)=1 $.

In our case, it takes form of 

\[y\left(x\right)=ke^{-x/\alpha}-\left(k-1\right)e^{-kx/\alpha\left(k-1\right)} \]

Given margin $m$  and decay time $\Delta t$ , so that 

\[y\left(\Delta t\right)=m,\] 

solving for $\alpha$, we get estimate for $\alpha$  assuming margin $m\ll1$  as 
\[\alpha\approx-\frac{\Delta t}{\ln\left(m/k\right)}.\] 

I take $\Delta t$ as constructor input with good default values of $k=4$ and $m=0.01$.

Iterative update $t_{n+1}\geq t_{n}$ 

\[\begin{cases}\begin{cases}\pi_{0}=1,\,\nu_{0}=1,\\\pi=e^{-\left(t_{n+1}-t_{n}\right)/\alpha};\\\nu=e^{-k\left(t_{n+1}-t_{n}\right)/\alpha\left(k-1\right)};\end{cases}\\w_{n+1}=1+\pi w_{n};\\v_{n+1}=1+\nu v_{n};\\s_{n+1}=x_{n+1}+\pi s_{n};\\u_{n+1}=x_{n+1}+\nu u_{n};\\t_{n+1}=t_{n+1}.\end{cases} \]

Then average \[\mu=\frac{ks-\left(k-1\right)u}{kw-\left(k-1\right)v}. \]

Persisted summarizer state thus consists of parameters $\left\{ w,v,s,u,t,k,\alpha\right\}$ .  Parameters $k$  and $\alpha$  are set in constructor and thus don't evolve.


 Iterative update in-the-past $t_{n+1}<t_{n}$

\[\begin{cases}
\begin{cases}
\pi_{0}=1,\,\nu_{0}=1;\\
\pi=e^{-\left(t_{n}-t_{n+1}\right)/\alpha},\\
\nu=e^{-k\left(t_{n}-t_{n+1}\right)/\alpha\left(k-1\right)};
\end{cases}\\
w_{n+1}=w_{n}+\pi,\\
v_{n+1}=v_{n}+\nu,\\
s_{n+1}=s_{n}+\pi x_{n+1},\\
u_{n+1}=u_{n}+\nu x_{n+1},\\
t_{n+1}=t_{n}.
\end{cases} \]

Combining Canny summarizers

Combining two summarizer states $S_{1},\, S_{2}$  having observed two disjoint sets of original observation superset (for algebraic pig UDF, MapReduce use):

Case $t_{2}\geq t_{1}$ :

\[\begin{cases}
t=t_{2};\\
\pi=e^{-\left(t_{2}-t_{1}\right)/\alpha};\\
\nu=e^{-k\left(t_{2}-t_{1}\right)/\alpha\left(k-1\right)};\\
s=s_{2}+s_{1}\pi;\\
u=u_{2}+u_{1}\nu;\\
w=w_{2}+w_{1}\pi;\\
v=v_{2}+v_{1}\nu.
\end{cases} \]

Merging in positive samples only for binomial cases

Case $t^{\left(1\right)}\geq t^{\left(0\right)}$ :

\[\begin{cases}
t=t^{\left(1\right)};\\
\pi=e^{-\left(t^{\left(1\right)}-t^{\left(0\right)}\right)/\alpha};\\
\nu=e^{-k\left(t^{\left(1\right)}-t^{\left(0\right)}\right)/\alpha\left(k-1\right)};\\
s=s^{\left(1\right)}+s^{\left(0\right)}\pi;\\
u=u^{\left(1\right)}+u^{\left(0\right)};\\
w=w^{\left(0\right)}\pi;\\
v=v^{\left(0\right)}\nu.
\end{cases} \]

Case $t^{\left(1\right)}<t^{\left(0\right)}$ :

\[\begin{cases}
t=t^{\left(0\right)}; & \left(no\, change\right)\\
\pi=e^{-\left(t^{\left(0\right)}-t^{\left(1\right)}\right)/\alpha};\\
\nu=e^{-k\left(t^{\left(0\right)}-t^{\left(1\right)}\right)/\alpha\left(k-1\right)};\\
s=s^{\left(0\right)}+s^{\left(1\right)}\pi;\\
u=u^{\left(0\right)}+u^{\left(1\right)}\nu;\\
w=w^{\left(0\right)}; & \left(no\, change\right)\\
v=v^{\left(0\right)} & (no\, change).
\end{cases}  \]

Pecularities of Canny rate summarizer implementation

Assuming $\pi_{0}=1,\,\nu_{0}=1$ , but otherwise

\[\pi=\exp\left(-\frac{\left|t_{n+1}-t_{n}\right|}{\alpha}\right),\]\[\nu=\exp\left(-\frac{k\cdot\left|t_{n+1}-t_{n}\right|}{\alpha\left(k-1\right)}\right),\]

for the case of $t_{n+1}\geq t_{n}$ :

\[\begin{cases}
w_{n+1}=\pi w_{n}+\alpha\left(1-\pi\right);\\
v_{n+1}=\nu v_{n}+\alpha\frac{\left(k-1\right)\left(1-\nu\right)}{k};
\end{cases}\]

 and for the case of $t_{n+1}<t_{n}$ :

\[\begin{cases}
w_{n+1}=\max\left(w_{n},\alpha\left(1-\pi\right)\right);\\
v_{n+1}=\max\left(v_{n},\alpha\frac{\left(k-1\right)\left(1-\nu\right)}{k}\right).
\end{cases} \]

(of course in the latter case if $w_{n}\geq\alpha\left(1-\pi\right)\equiv v_{n}\geq\alpha\frac{\left(k-1\right)\left(1-\nu\right)}{k}$ , so we can optimize a little based on that).

The logic of advancing $s_{n+1}$  and $u_{n+1} $ is the same as in Canny average summarizer.

 Pecularities of combining Canny rate summarizers

Assuming both summarizers have nonzero history and $\alpha^{\left(1\right)}=\alpha^{\left(2\right)}=\alpha$,  and $k^{\left(1\right)}=k^{\left(2\right)}=k$  as a precondition, combining routine looks like

\[\pi=\exp\left(-\frac{\left|t^{\left(2\right)}-t^{(1)}\right|}{\alpha}\right), \]\[\nu=\exp\left(-\frac{k\cdot\left|t^{\left(2\right)}-t_{n}^{\left(1\right)}\right|}{\alpha\left(k-1\right)}\right), \]\[\begin{cases}
\begin{cases}
t=t^{\left(1\right)}\\
s=s^{\left(1\right)}+\pi s^{\left(2\right)}\\
u=u^{\left(1\right)}+\nu u^{\left(2\right)}
\end{cases}, & t^{\left(1\right)}\ge t^{\left(2\right)};\\
\begin{cases}
t=t^{\left(2\right)}\\
s=s^{(2)}+\pi s^{\left(1\right)}\\
u=u^{\left(2\right)}+\nu u^{\left(1\right)}
\end{cases}, & t^{\left(1\right)}<t^{\left(2\right)};\\
w=\max\left(w^{\left(1\right)},w^{\left(2\right)}\right), & \mathrm{all\, cases;}\\
v=\max\left(v^{\left(1\right)},v^{\left(2\right)}\right), & \mathrm{all\, cases.}
\end{cases} \]

In simple words, here we take longest history of two (expressed by denominator terms $w$  and $v$ ), and merge numerators based on which history contained the latest observation. Again, we can assume $w^{\left(1\right)}\ge w^{\left(2\right)}\equiv v^{\left(1\right)}\ge v^{\left(2\right)}$  to save on comparisons a little bit.

Wednesday, June 1, 2011

On MapReduce application for online exponential rates summarizer.

as a follow-up to Ted Dunning's post exponentially weighted averaging:

Here's the bottom line result from that post:

\[\pi=e^{-\left(t_{n}-t_{n-1}\right)/\alpha}\]\[s_{n}=\pi s_{n-1}+x \]\[w_{n}=\pi w_{n-1}+\left(t-t_{n}\right)\]\[t_{n}= t \]

Some additions are due:

Combiner routine:
In most general case, when both summarizers have non-zero history, and $t_{1}>t_{2}$ ,

\[\pi=e^{-\left(t_{1}-t_{2}\right)/\alpha}, \]\[w=\max\left(w_{1},\pi w_{2}\right), \]\[s=s_{1}+\pi s_{2}.\]

Case $t_{1}<t_{2} $ is symmetrical w.r.t. indices $\left(\cdot\right)_{1},\,\left(\cdot\right)_{2}$ .

This combiner routine however is not consistent with original formula as time spans are not exponentially compressed into time scale. Therefore, unit test for such combiner produces significant errors in various history combinations.

A combine-consistent routine:
A better formulation that works fine with the combiner is as follows:

Case $t_{n+1}>t_{n}$ :

\[\begin{cases}
\Delta t=t_{n+1}-t_{n},\\
\begin{cases}
\pi_{0}=1,\\
\pi_{n+1}=e^{-\Delta t/\alpha};
\end{cases}\\
w_{n+1}=\pi_{n+1}w_{n}+\alpha\cdot(1-\pi_{n+1});\\
s_{n+1}=x_{n+1}+\pi_{n+1}s_{n};\\
t_{n+1}=t_{n+1}.
\end{cases} \]

The term $\alpha\left(1-\pi_{n+1}\right)$  really corresponds to the function average of the exponent function $f\left(x\right)=e^{x}$ multiplied by $\Delta t$  as in
\[\Delta t\cdot\bar{f}\left(x|\left[-\frac{\Delta t}{\alpha},0\right]\right)=\Delta t\cdot\frac{\alpha}{\Delta t}\cdot\int_{-\Delta t/\alpha}^{0}e^{x}dx=\]\[ =\alpha\cdot\left(e^{0}-e^{-\Delta t/\alpha}\right)=\]\[ =\alpha\left(1-\pi\right).\]

Finally, update-in-the-past also looks different for consistency and combining sake:


Case $t_{n+1}<t_{n}$:

\[\begin{cases}
\Delta t=t_{n}-t_{n+1},\\
\begin{cases}
\pi_{0}=1,\\
\pi=e^{-\Delta t/\alpha},
\end{cases}\\
w_{n+1}=\max\left[w_{n},\alpha\cdot\left(1-\pi_{n+1}\right)\right],\\
s_{n+1}=s_{n}+\pi_{n+1}x_{n+1},\\
t_{n+1}=t_{n}. & \left(\mathrm{no\, change}\right)
\end{cases}

\]

Monday, May 30, 2011

Generalized Linear Model + Stochastic Gradient Descent with squared loss

Many authors (Koren, Bottou etc.)  popularized use of SGD algorithms.

Indeed, what's not to like. It is fast. It's simple. It's intuitive.

The principle is extremely simple. If you have a statistical model based on bunch of parameters $${\beta}_0$$, $${\beta}_1$$ ... and you have a loss function $$\mathcal{L}\left(\boldsymbol{x}_{i}|\boldsymbol{\beta}\right)$$, and you want to minimize loss w.r.t. $$\boldsymbol{\beta}$$ parameters of the model, then you go thru your input and at each step you try to adjust parameters using simple iterative notion of

\[\boldsymbol{\beta}_{i+1}=\boldsymbol{\beta}_{i}-\gamma\cdot\nabla_{\beta}\mathcal{L},\] 

where $$\gamma$$ is called learning rate and can be found adaptively based on results of crossvalidation of the trained model. (Here, bold symbols denote vectors).

So, can we generalize formulas when applied to Generalized Linear Models?

Indeed, GLM formulation is 

\[G\left(y\right)=\eta\left(\boldsymbol{x}\right),\]

where 

\[\eta\left(\boldsymbol{x}\right)=\boldsymbol{\beta}^{\top}\boldsymbol{x}.\]

We take the perhaps the simplest, squared loss approach, where loss funciton is 

\[\mathcal{L}=\sum_{i}\left(y_{i}-\hat{y}\left(\boldsymbol{x}_{i}|\boldsymbol{\beta}\right)\right)^{2}+\lambda\rho\left(\boldsymbol{\beta}\right),\]

where $$\lambda$$  is regularization rate, and $$\rho\left(\cdot\right)$$  is usually choosen based on prior distribution for regression parameters (which commonly is L1, L2 or t-prior). 

Loss function gradient over $$\boldsymbol{\beta}$$  in case of GLM+SGD (individual SGD step)

\[\nabla_{\beta}\mathcal{L}_{i}=\nabla_{\beta}\left[\left(y_{i}-\hat{y}\left(\boldsymbol{x}_{i}|\boldsymbol{\beta}\right)\right)^{2}+\lambda\rho\left(\boldsymbol{\beta}\right)\right]=\]
\[=2\left(y_{i}-\hat{y}\left(\boldsymbol{x}_{i}|\boldsymbol{\beta}\right)\right)\cdot\nabla_{\beta}\left(y_{i}-\hat{y}\left(\boldsymbol{x}_{i}|\boldsymbol{\beta}\right)\right)+\lambda\nabla_{\beta}\rho\left(\boldsymbol{\beta}\right)=\]
\[=-2r_{i}\frac{dG^{-1}}{d\eta}\boldsymbol{x}_{i}+\lambda\nabla_{\beta}\rho\left(\boldsymbol{\beta}\right).\]

Here, $$r_{i}=y_{i}-\hat{y}\left(\boldsymbol{x}_{i}|\boldsymbol{\beta}\right)$$  is a residual at the i -th step of SGD. Which leads us to a commonly used form of SGD with squared loss

\[\boldsymbol{\beta}_{i+1}=\boldsymbol{\beta}_{i}-\gamma\cdot\nabla_{\beta}\mathcal{L},\]

or, removing some constant multipliers for convenience,

\[\boldsymbol{\beta}_{i+1}=\boldsymbol{\beta}_{i}+\gamma\cdot\left(r_{i+1}\frac{dG^{-1}}{d\eta}\boldsymbol{x}_{i+1}-\lambda\nabla_{\beta}\rho\left(\boldsymbol{\beta}\right)\right).\]

Since $$G^{-1}$$  is always a monotonically increasing function, many implementations replace its derivative term $$\frac{dG^{-1}}{d\eta}$$  with 1 (in particular, logistic regression where calculating derivative requires an exponent computation).

L2 regularization would look like 

\[\rho\left(\boldsymbol{\beta}\right)=\left\Vert \boldsymbol{\beta}\right\Vert {}_{2}^{2}\Rightarrow\nabla\rho\left(\boldsymbol{\beta}\right)=\boldsymbol{\beta}.\]

L1 (less common) regularization would look like 

\[\rho\left(\boldsymbol{\beta}\right)=\left|\left\Vert \boldsymbol{\beta}\right\Vert _{1}\right|\Rightarrow\nabla\rho\left(\boldsymbol{\beta}\right)=\left(\begin{matrix}1\\\vdots\\1\end{matrix}\right).\]

So it's clear that GLM object model now can be considered as consisting of general SGD core algorithm and two strategy patterns: link function and regularization. 

Link function strategy needs to provide inverse link and inverse link prime computations. 

Regularization abstraction needs to provide partial derivative computation $$\frac{\partial\rho}{\partial\beta_{i}}$$ given (set) of regression coefficients (and perhaps an input $$x$$, too, although neither L1 nor L2 really care of the $$x$$).

Identity link function with no regularization would mean Linear regression, Identity link function + L1 gives us LASSO regression, Identity link function + L2 gives us Ridge regression, Logit link function produces logistic regression, and, finally, exponential link function produces Poisson regression, or a.k.a log-linear models (If i am not mistaken). 

Finding hyperparameters $$\gamma$$ and $$\lambda$$ is a separate rather interesting search optimization task.