Processing math: 52%

Saturday, June 4, 2011

John #Canny's function #summarizer formulae #Pig #MapReduce

Credit for the idea to Ted Dunning, but I wanted to log the final formulae set i used for MapReduce and Pig implementation of these kind of summarizers for my future reference. Here it goes.

Adaptation

s(t)=ket(k1)ekk1t,k>1.

This s(t)  (which Ted Dunning calls 'Canny's filter') has desirable properties of s(0)=0  and s(0)=1.

In our case, it takes form of 

y(x)=kex/α(k1)ekx/α(k1)

Given margin m  and decay time Δt , so that 

y(Δt)=m, 

solving for α, we get estimate for α  assuming margin m1  as 
αΔtln(m/k). 

I take Δt as constructor input with good default values of k=4 and m=0.01.

Iterative update tn+1tn 

{{π0=1,ν0=1,π=e(tn+1tn)/α;ν=ek(tn+1tn)/α(k1);wn+1=1+πwn;vn+1=1+νvn;sn+1=xn+1+πsn;un+1=xn+1+νun;tn+1=tn+1.

Then average μ=ks(k1)ukw(k1)v.

Persisted summarizer state thus consists of parameters {w,v,s,u,t,k,α} .  Parameters k  and α  are set in constructor and thus don't evolve.


 Iterative update in-the-past tn+1<tn

{{π0=1,ν0=1;π=e(tntn+1)/α,ν=ek(tntn+1)/α(k1);wn+1=wn+π,vn+1=vn+ν,sn+1=sn+πxn+1,un+1=un+νxn+1,tn+1=tn.

Combining Canny summarizers

Combining two summarizer states S1,S2  having observed two disjoint sets of original observation superset (for algebraic pig UDF, MapReduce use):

Case t2t1 :

{t=t2;π=e(t2t1)/α;ν=ek(t2t1)/α(k1);s=s2+s1π;u=u2+u1ν;w=w2+w1π;v=v2+v1ν.

Merging in positive samples only for binomial cases

Case t(1)t(0) :

{t=t(1);π=e(t(1)t(0))/α;ν=ek(t(1)t(0))/α(k1);s=s(1)+s(0)π;u=u(1)+u(0);w=w(0)π;v=v(0)ν.

Case t(1)<t(0) :

{t=t(0);(nochange)π=e(t(0)t(1))/α;ν=ek(t(0)t(1))/α(k1);s=s(0)+s(1)π;u=u(0)+u(1)ν;w=w(0);(nochange)v=v(0)(nochange).

Pecularities of Canny rate summarizer implementation

Assuming π0=1,ν0=1 , but otherwise

π=exp(|tn+1tn|α),ν=exp(k|tn+1tn|α(k1)),

for the case of tn+1tn :

{wn+1=πwn+α(1π);vn+1=νvn+α(k1)(1ν)k;

 and for the case of tn+1<tn :

{wn+1=max

(of course in the latter case if w_{n}\geq\alpha\left(1-\pi\right)\equiv v_{n}\geq\alpha\frac{\left(k-1\right)\left(1-\nu\right)}{k} , so we can optimize a little based on that).

The logic of advancing s_{n+1}  and u_{n+1} is the same as in Canny average summarizer.

 Pecularities of combining Canny rate summarizers

Assuming both summarizers have nonzero history and \alpha^{\left(1\right)}=\alpha^{\left(2\right)}=\alpha,  and k^{\left(1\right)}=k^{\left(2\right)}=k  as a precondition, combining routine looks like

\pi=\exp\left(-\frac{\left|t^{\left(2\right)}-t^{(1)}\right|}{\alpha}\right), \nu=\exp\left(-\frac{k\cdot\left|t^{\left(2\right)}-t_{n}^{\left(1\right)}\right|}{\alpha\left(k-1\right)}\right), \begin{cases} \begin{cases} t=t^{\left(1\right)}\\ s=s^{\left(1\right)}+\pi s^{\left(2\right)}\\ u=u^{\left(1\right)}+\nu u^{\left(2\right)} \end{cases}, & t^{\left(1\right)}\ge t^{\left(2\right)};\\ \begin{cases} t=t^{\left(2\right)}\\ s=s^{(2)}+\pi s^{\left(1\right)}\\ u=u^{\left(2\right)}+\nu u^{\left(1\right)} \end{cases}, & t^{\left(1\right)}<t^{\left(2\right)};\\ w=\max\left(w^{\left(1\right)},w^{\left(2\right)}\right), & \mathrm{all\, cases;}\\ v=\max\left(v^{\left(1\right)},v^{\left(2\right)}\right), & \mathrm{all\, cases.} \end{cases}

In simple words, here we take longest history of two (expressed by denominator terms w  and v ), and merge numerators based on which history contained the latest observation. Again, we can assume w^{\left(1\right)}\ge w^{\left(2\right)}\equiv v^{\left(1\right)}\ge v^{\left(2\right)}  to save on comparisons a little bit.

Wednesday, June 1, 2011

On MapReduce application for online exponential rates summarizer.

as a follow-up to Ted Dunning's post exponentially weighted averaging:

Here's the bottom line result from that post:

\pi=e^{-\left(t_{n}-t_{n-1}\right)/\alpha}s_{n}=\pi s_{n-1}+x w_{n}=\pi w_{n-1}+\left(t-t_{n}\right)t_{n}= t

Some additions are due:

Combiner routine:
In most general case, when both summarizers have non-zero history, and t_{1}>t_{2} ,

\pi=e^{-\left(t_{1}-t_{2}\right)/\alpha}, w=\max\left(w_{1},\pi w_{2}\right), s=s_{1}+\pi s_{2}.

Case t_{1}<t_{2} is symmetrical w.r.t. indices \left(\cdot\right)_{1},\,\left(\cdot\right)_{2} .

This combiner routine however is not consistent with original formula as time spans are not exponentially compressed into time scale. Therefore, unit test for such combiner produces significant errors in various history combinations.

A combine-consistent routine:
A better formulation that works fine with the combiner is as follows:

Case t_{n+1}>t_{n} :

\begin{cases} \Delta t=t_{n+1}-t_{n},\\ \begin{cases} \pi_{0}=1,\\ \pi_{n+1}=e^{-\Delta t/\alpha}; \end{cases}\\ w_{n+1}=\pi_{n+1}w_{n}+\alpha\cdot(1-\pi_{n+1});\\ s_{n+1}=x_{n+1}+\pi_{n+1}s_{n};\\ t_{n+1}=t_{n+1}. \end{cases}

The term \alpha\left(1-\pi_{n+1}\right)  really corresponds to the function average of the exponent function f\left(x\right)=e^{x} multiplied by \Delta t  as in
\Delta t\cdot\bar{f}\left(x|\left[-\frac{\Delta t}{\alpha},0\right]\right)=\Delta t\cdot\frac{\alpha}{\Delta t}\cdot\int_{-\Delta t/\alpha}^{0}e^{x}dx= =\alpha\cdot\left(e^{0}-e^{-\Delta t/\alpha}\right)= =\alpha\left(1-\pi\right).

Finally, update-in-the-past also looks different for consistency and combining sake:


Case t_{n+1}<t_{n}:

\begin{cases} \Delta t=t_{n}-t_{n+1},\\ \begin{cases} \pi_{0}=1,\\ \pi=e^{-\Delta t/\alpha}, \end{cases}\\ w_{n+1}=\max\left[w_{n},\alpha\cdot\left(1-\pi_{n+1}\right)\right],\\ s_{n+1}=s_{n}+\pi_{n+1}x_{n+1},\\ t_{n+1}=t_{n}. & \left(\mathrm{no\, change}\right) \end{cases}