Thursday, April 21, 2011

Follow-up for the "Mean Summarizer..." post

This is a small follow-up for the previous post.

In practice, in MapReduce world, such as Pig UDF functions (which I btw already put in place), when we run over observation data, we often encounter 2 types of problems:

1-- Unordered observations, i.e. cases such as $$t_{n+1}<t_{n}$$.
2-- Parallel processing of disjoint subsets of common superset of observations and combining into one ("combiner"-hadoop, "Algebraic UDF"-pig).

Hence, we need to add those specific cases to our formulas.

1. Updating state with events in the past.
Updated formula will look like:

Case $$t_{n+1}\geq t_{n} $$:

Case $$t_{n+1}<t_{n}$$ (updating in-the-past):

2.  Combining two summarizers having observed two disjoint sets as subsets of original observation set.
Combining two summarizer states $$S_{1},\, S_{2}$$ having observed two disjoint sets of original observation superset:
Case $$t_{2}\geq t_{1}$$:

Case $$t_{2}<t_{1}$$ is symmetrical w.r.t. indices $$\left(\cdot\right)_{1},\left(\cdot\right)_{2}$$. Also, the prerequisite for combining two summarizers is $$\alpha_{1}=\alpha_{2}=\alpha$$ (history decay is the same).

But enough of midnight oil burning.

No comments:

Post a Comment