Monday, May 30, 2011

Generalized Linear Model + Stochastic Gradient Descent with squared loss

Many authors (Koren, Bottou etc.)  popularized use of SGD algorithms.

Indeed, what's not to like. It is fast. It's simple. It's intuitive.

The principle is extremely simple. If you have a statistical model based on bunch of parameters $${\beta}_0$$, $${\beta}_1$$ ... and you have a loss function $$\mathcal{L}\left(\boldsymbol{x}_{i}|\boldsymbol{\beta}\right)$$, and you want to minimize loss w.r.t. $$\boldsymbol{\beta}$$ parameters of the model, then you go thru your input and at each step you try to adjust parameters using simple iterative notion of

\[\boldsymbol{\beta}_{i+1}=\boldsymbol{\beta}_{i}-\gamma\cdot\nabla_{\beta}\mathcal{L},\] 

where $$\gamma$$ is called learning rate and can be found adaptively based on results of crossvalidation of the trained model. (Here, bold symbols denote vectors).

So, can we generalize formulas when applied to Generalized Linear Models?

Indeed, GLM formulation is 

\[G\left(y\right)=\eta\left(\boldsymbol{x}\right),\]

where 

\[\eta\left(\boldsymbol{x}\right)=\boldsymbol{\beta}^{\top}\boldsymbol{x}.\]

We take the perhaps the simplest, squared loss approach, where loss funciton is 

\[\mathcal{L}=\sum_{i}\left(y_{i}-\hat{y}\left(\boldsymbol{x}_{i}|\boldsymbol{\beta}\right)\right)^{2}+\lambda\rho\left(\boldsymbol{\beta}\right),\]

where $$\lambda$$  is regularization rate, and $$\rho\left(\cdot\right)$$  is usually choosen based on prior distribution for regression parameters (which commonly is L1, L2 or t-prior). 

Loss function gradient over $$\boldsymbol{\beta}$$  in case of GLM+SGD (individual SGD step)

\[\nabla_{\beta}\mathcal{L}_{i}=\nabla_{\beta}\left[\left(y_{i}-\hat{y}\left(\boldsymbol{x}_{i}|\boldsymbol{\beta}\right)\right)^{2}+\lambda\rho\left(\boldsymbol{\beta}\right)\right]=\]
\[=2\left(y_{i}-\hat{y}\left(\boldsymbol{x}_{i}|\boldsymbol{\beta}\right)\right)\cdot\nabla_{\beta}\left(y_{i}-\hat{y}\left(\boldsymbol{x}_{i}|\boldsymbol{\beta}\right)\right)+\lambda\nabla_{\beta}\rho\left(\boldsymbol{\beta}\right)=\]
\[=-2r_{i}\frac{dG^{-1}}{d\eta}\boldsymbol{x}_{i}+\lambda\nabla_{\beta}\rho\left(\boldsymbol{\beta}\right).\]

Here, $$r_{i}=y_{i}-\hat{y}\left(\boldsymbol{x}_{i}|\boldsymbol{\beta}\right)$$  is a residual at the i -th step of SGD. Which leads us to a commonly used form of SGD with squared loss

\[\boldsymbol{\beta}_{i+1}=\boldsymbol{\beta}_{i}-\gamma\cdot\nabla_{\beta}\mathcal{L},\]

or, removing some constant multipliers for convenience,

\[\boldsymbol{\beta}_{i+1}=\boldsymbol{\beta}_{i}+\gamma\cdot\left(r_{i+1}\frac{dG^{-1}}{d\eta}\boldsymbol{x}_{i+1}-\lambda\nabla_{\beta}\rho\left(\boldsymbol{\beta}\right)\right).\]

Since $$G^{-1}$$  is always a monotonically increasing function, many implementations replace its derivative term $$\frac{dG^{-1}}{d\eta}$$  with 1 (in particular, logistic regression where calculating derivative requires an exponent computation).

L2 regularization would look like 

\[\rho\left(\boldsymbol{\beta}\right)=\left\Vert \boldsymbol{\beta}\right\Vert {}_{2}^{2}\Rightarrow\nabla\rho\left(\boldsymbol{\beta}\right)=\boldsymbol{\beta}.\]

L1 (less common) regularization would look like 

\[\rho\left(\boldsymbol{\beta}\right)=\left|\left\Vert \boldsymbol{\beta}\right\Vert _{1}\right|\Rightarrow\nabla\rho\left(\boldsymbol{\beta}\right)=\left(\begin{matrix}1\\\vdots\\1\end{matrix}\right).\]

So it's clear that GLM object model now can be considered as consisting of general SGD core algorithm and two strategy patterns: link function and regularization. 

Link function strategy needs to provide inverse link and inverse link prime computations. 

Regularization abstraction needs to provide partial derivative computation $$\frac{\partial\rho}{\partial\beta_{i}}$$ given (set) of regression coefficients (and perhaps an input $$x$$, too, although neither L1 nor L2 really care of the $$x$$).

Identity link function with no regularization would mean Linear regression, Identity link function + L1 gives us LASSO regression, Identity link function + L2 gives us Ridge regression, Logit link function produces logistic regression, and, finally, exponential link function produces Poisson regression, or a.k.a log-linear models (If i am not mistaken). 

Finding hyperparameters $$\gamma$$ and $$\lambda$$ is a separate rather interesting search optimization task.