## Saturday, June 4, 2011

### John #Canny's function #summarizer formulae #Pig #MapReduce

Credit for the idea to Ted Dunning, but I wanted to log the final formulae set i used for MapReduce and Pig implementation of these kind of summarizers for my future reference. Here it goes.

$s\left(t\right)=ke^{-t}-\left(k-1\right)e^{-\frac{k}{k-1}t},\, k>1.$

This $s\left(t\right)$  (which Ted Dunning calls 'Canny's filter') has desirable properties of $s'\left(0\right)=0$  and $s\left(0\right)=1$.

In our case, it takes form of

$y\left(x\right)=ke^{-x/\alpha}-\left(k-1\right)e^{-kx/\alpha\left(k-1\right)}$

Given margin $m$  and decay time $\Delta t$ , so that

$y\left(\Delta t\right)=m,$

solving for $\alpha$, we get estimate for $\alpha$  assuming margin $m\ll1$  as
$\alpha\approx-\frac{\Delta t}{\ln\left(m/k\right)}.$

I take $\Delta t$ as constructor input with good default values of $k=4$ and $m=0.01$.

Iterative update $t_{n+1}\geq t_{n}$

$\begin{cases}\begin{cases}\pi_{0}=1,\,\nu_{0}=1,\\\pi=e^{-\left(t_{n+1}-t_{n}\right)/\alpha};\\\nu=e^{-k\left(t_{n+1}-t_{n}\right)/\alpha\left(k-1\right)};\end{cases}\\w_{n+1}=1+\pi w_{n};\\v_{n+1}=1+\nu v_{n};\\s_{n+1}=x_{n+1}+\pi s_{n};\\u_{n+1}=x_{n+1}+\nu u_{n};\\t_{n+1}=t_{n+1}.\end{cases}$

Then average $\mu=\frac{ks-\left(k-1\right)u}{kw-\left(k-1\right)v}.$

Persisted summarizer state thus consists of parameters $\left\{ w,v,s,u,t,k,\alpha\right\}$ .  Parameters $k$  and $\alpha$  are set in constructor and thus don't evolve.

Iterative update in-the-past $t_{n+1}<t_{n}$

$\begin{cases} \begin{cases} \pi_{0}=1,\,\nu_{0}=1;\\ \pi=e^{-\left(t_{n}-t_{n+1}\right)/\alpha},\\ \nu=e^{-k\left(t_{n}-t_{n+1}\right)/\alpha\left(k-1\right)}; \end{cases}\\ w_{n+1}=w_{n}+\pi,\\ v_{n+1}=v_{n}+\nu,\\ s_{n+1}=s_{n}+\pi x_{n+1},\\ u_{n+1}=u_{n}+\nu x_{n+1},\\ t_{n+1}=t_{n}. \end{cases}$

Combining Canny summarizers

Combining two summarizer states $S_{1},\, S_{2}$  having observed two disjoint sets of original observation superset (for algebraic pig UDF, MapReduce use):

Case $t_{2}\geq t_{1}$ :

$\begin{cases} t=t_{2};\\ \pi=e^{-\left(t_{2}-t_{1}\right)/\alpha};\\ \nu=e^{-k\left(t_{2}-t_{1}\right)/\alpha\left(k-1\right)};\\ s=s_{2}+s_{1}\pi;\\ u=u_{2}+u_{1}\nu;\\ w=w_{2}+w_{1}\pi;\\ v=v_{2}+v_{1}\nu. \end{cases}$

Merging in positive samples only for binomial cases

Case $t^{\left(1\right)}\geq t^{\left(0\right)}$ :

$\begin{cases} t=t^{\left(1\right)};\\ \pi=e^{-\left(t^{\left(1\right)}-t^{\left(0\right)}\right)/\alpha};\\ \nu=e^{-k\left(t^{\left(1\right)}-t^{\left(0\right)}\right)/\alpha\left(k-1\right)};\\ s=s^{\left(1\right)}+s^{\left(0\right)}\pi;\\ u=u^{\left(1\right)}+u^{\left(0\right)};\\ w=w^{\left(0\right)}\pi;\\ v=v^{\left(0\right)}\nu. \end{cases}$

Case $t^{\left(1\right)}<t^{\left(0\right)}$ :

$\begin{cases} t=t^{\left(0\right)}; & \left(no\, change\right)\\ \pi=e^{-\left(t^{\left(0\right)}-t^{\left(1\right)}\right)/\alpha};\\ \nu=e^{-k\left(t^{\left(0\right)}-t^{\left(1\right)}\right)/\alpha\left(k-1\right)};\\ s=s^{\left(0\right)}+s^{\left(1\right)}\pi;\\ u=u^{\left(0\right)}+u^{\left(1\right)}\nu;\\ w=w^{\left(0\right)}; & \left(no\, change\right)\\ v=v^{\left(0\right)} & (no\, change). \end{cases}$

Pecularities of Canny rate summarizer implementation

Assuming $\pi_{0}=1,\,\nu_{0}=1$ , but otherwise

$\pi=\exp\left(-\frac{\left|t_{n+1}-t_{n}\right|}{\alpha}\right),$$\nu=\exp\left(-\frac{k\cdot\left|t_{n+1}-t_{n}\right|}{\alpha\left(k-1\right)}\right),$

for the case of $t_{n+1}\geq t_{n}$ :

$\begin{cases} w_{n+1}=\pi w_{n}+\alpha\left(1-\pi\right);\\ v_{n+1}=\nu v_{n}+\alpha\frac{\left(k-1\right)\left(1-\nu\right)}{k}; \end{cases}$

and for the case of $t_{n+1}<t_{n}$ :

$\begin{cases} w_{n+1}=\max\left(w_{n},\alpha\left(1-\pi\right)\right);\\ v_{n+1}=\max\left(v_{n},\alpha\frac{\left(k-1\right)\left(1-\nu\right)}{k}\right). \end{cases}$

(of course in the latter case if $w_{n}\geq\alpha\left(1-\pi\right)\equiv v_{n}\geq\alpha\frac{\left(k-1\right)\left(1-\nu\right)}{k}$ , so we can optimize a little based on that).

The logic of advancing $s_{n+1}$  and $u_{n+1}$ is the same as in Canny average summarizer.

Pecularities of combining Canny rate summarizers

Assuming both summarizers have nonzero history and $\alpha^{\left(1\right)}=\alpha^{\left(2\right)}=\alpha$,  and $k^{\left(1\right)}=k^{\left(2\right)}=k$  as a precondition, combining routine looks like

$\pi=\exp\left(-\frac{\left|t^{\left(2\right)}-t^{(1)}\right|}{\alpha}\right),$$\nu=\exp\left(-\frac{k\cdot\left|t^{\left(2\right)}-t_{n}^{\left(1\right)}\right|}{\alpha\left(k-1\right)}\right),$$\begin{cases} \begin{cases} t=t^{\left(1\right)}\\ s=s^{\left(1\right)}+\pi s^{\left(2\right)}\\ u=u^{\left(1\right)}+\nu u^{\left(2\right)} \end{cases}, & t^{\left(1\right)}\ge t^{\left(2\right)};\\ \begin{cases} t=t^{\left(2\right)}\\ s=s^{(2)}+\pi s^{\left(1\right)}\\ u=u^{\left(2\right)}+\nu u^{\left(1\right)} \end{cases}, & t^{\left(1\right)}<t^{\left(2\right)};\\ w=\max\left(w^{\left(1\right)},w^{\left(2\right)}\right), & \mathrm{all\, cases;}\\ v=\max\left(v^{\left(1\right)},v^{\left(2\right)}\right), & \mathrm{all\, cases.} \end{cases}$

In simple words, here we take longest history of two (expressed by denominator terms $w$  and $v$ ), and merge numerators based on which history contained the latest observation. Again, we can assume $w^{\left(1\right)}\ge w^{\left(2\right)}\equiv v^{\left(1\right)}\ge v^{\left(2\right)}$  to save on comparisons a little bit.