## Tuesday, April 12, 2011

### Online mean summarizer for binomial distribution with irregular sampling and arbitrary bias

Suppose we have a task to provide biased estimate on probability sampled by $\left(x_{1},...x_{n}\right),\, x_{i}\in\left\{ 0,1\right\}$. Traditional solution for a biased estimate is mean of a conjugate prior. In this case it would be mean of the beta-distribution $P_{n}=\mathbb{E}\left(\mu_{B}\right)=\frac{\alpha}{\alpha+\beta},$ where $\alpha$ equals to number of positive outcomes +1 and $\beta$ is taken as number of negative outcomes +1.

The obvious intention of biased estimate is to converge on certain bias $P_{0}$ in absence of a good sample data: $P_{0}=\lim_{n\rightarrow0}P_{n}.$

In case of standard beta-distribution $P_{0}={1\over{2}}$, which of course is not terribly useful in practice. In practice we may have much better 'guesses' for our initial estimate. So in order to allow arbitrary parameterization for $P_{0}$, let's denote $\alpha=n_{+}+b_{+}$ and $\beta=n_{-}+b_{-}$, $n_{-}$ being the number of negative observations and $n_{+}$ being the number of positive observations. Values $b_{+}$ and $b_{-}$ thus express our initial bias towards positive and negative outcome of final estimate, respectively.

Then our online summarizer could be presented as $P_{n}=\frac{b_{+}+\sum_{i}^{n}x_{i}}{n+b_{-}+b_{+}}$ and it's not hard to see that we can come up with heuristics allowing arbitrary bias $P_{0}=\lim_{n\rightarrow0}P_{n}=\frac{b_{+}}{b_{+}+b_{-}}$ while keeping $b_{+}+b_{-}=2$ per standard beta-distribution.

That's the biased summarizer I have been using until I saw this post.

If our sampling $\left(x_{1},...x_{n}\right),\, x_{i}\in\left\{ 0,1\right\}$ is also taken at times $t_{1},...t_{n}$ then we can factor in exponential phase-out for more distant history while taking more recent history into account. Also, instead of converging onto our bias $P_{0}$ when we just have a lack of history, we can also converge on it if we have lack of recent history: $\lim_{n\rightarrow0}P=\frac{b_{+}}{b_{+}+b_{-}}=P_{0}$ and $\lim_{\left(t_{n}-t_{n-1}\right)\rightarrow+\infty}P=\frac{b_{+}}{b_{+}+b_{-}}=P_{0}.$

Cool. How do we exactly do that in a convenient way?

We can modify result from Ted Dunning's post I referenced above in the following way:$P=\frac{b_{+}+\sum_{i=1}^{n}x_{i}e^{-\left(t-t_{i}\right)/\alpha}}{b_{-}+b_{+}+\sum_{i=1}^{n}e^{-\left(t-t_{i}\right)/\alpha}}.$

It's not hard to see that our goals described by limits above would hold with this solution.

The iterative solution for that would be $\pi_{n+1}=e^{-\left(t_{n+1}-t_{n}\right)/\alpha},$ $w_{n+1}=1+\pi_{n+1}w_{n},$ $s_{n+1}=x_{n+1}+\pi_{n+1}s_{n},$ $P_{n+1}=\frac{b_{+}+s_{n+1}}{b_{+}+b_{-}+w_{n+1}}.$

We also have to go by selecting our $b_{+}$ and $b_{-}$ values more carefully, since value of samples is exponentially decreasing. So it stands to reason we want to decrease effect of our bias based on the amount of history, exponentially weighted, as well. Suppose we have a metric $\varepsilon$ that denotes amount of non-significant history for the purposes of biasing. Then we want to modify condition $b_{+}+b_{-}=2$ based on non-weighted history by using exponential function average $b_{+}+b_{-}=2\cdot\frac{\int_{0}^{-\ln\varepsilon}e^{-t}dt}{-\ln\varepsilon}=2\cdot\frac{\varepsilon-1}{\ln\varepsilon}.$ This gives us solution for bias parameters as $\begin{cases}b_{+}=2P_{0}\frac{\varepsilon-1}{\ln\varepsilon};\\b_{-}=2\left(1-P_{0}\right)\frac{\varepsilon-1}{\ln\varepsilon}.\end{cases}$

Also, having to specify $\alpha$ is weird. Most people like to look at it as span of useful history $\Delta{t}$ and margin $m$, in exponential scale, when the rest of history is not deemed very useful. Good default value for $m$ is perhaps 0.01 (1%). Then we can compute $\alpha$ as $\alpha=\frac{-\Delta t}{\ln m}.$

So, building the summarizer, we have iterative state represented by $\left\{ w_{n},s_{n},t_{n}\right\}$ and non-default constructor that accepts $\Delta{t},m,P_{0}$ and $\varepsilon$ and transforms them into parameters of the summarizer : $b_{-},b_{+},\alpha$ per above.

The summarizer needs to implement 2 methods: $\mathrm{pnow}(t)$ and $\mathrm{update}(t_{n},x_{n})$. The implementation of the update() method pretty much follows from all the above. The implementation of pnow() just needs to compute probability for current moment (we don't have an observation and may be more biased if we did not see recent observations). Method pnow() just needs to do the same estimate as update() assuming $t_{n}$ as 'now' and $s_{n+1}=\pi_{n+1}s_{n}$ and $w_{n+1}=\pi_{n+1}w_{n}$ and  without actually updating the state $\left\{ w_{n},s_{n},t_{n}\right\}$.

Did I say that some version of constructor could assume default values for $\varepsilon$ and $m$? I use $\varepsilon=0.5$ and $m=0.01$ as default. So one of good versions of constructor is the one that accepts bias aka 'null hypothesis' $P_{0}$, and the span of useful history $\Delta t$ to phase out on.

So a bit of coding, and we start converging on a pretty picture in no time.

(And yes, I do take all my pictures myself as well as doing the coding part.)