Tuesday, April 26, 2011

Supercomputing Is a Religion But Clouds Are Mundane

Supercomputer is a specialized device. It can do a lot of math, with precision no other platform can do, but it's about all it can have an edge  for.  Have you ever used one as a webserver? Or at least some sort of middle tier realtime application server? There's no practical benefit in doing it that way. Value/cost ratio is not so good. No reason to use demigod's power for your backyard gardening.

By supercomputers I ofcourse only mean hardware-based NUMA architectures; commodity hardware based  simulations with LAN links (such as Beowulf) IMO are really too slow to be seriously considered NUMA type supercomputers.

Clouds are, on the other hand, are as mundane as a toothbrush. Clouds are as versatile as a minivan. And, you can scale them at a fraction of a cost to have as many cpus as supercomputers do. In that sense, paraphrasing the old saying "today's supercomputer is tomorrow's computer", today's supercomputer is today's cloud. The only problem is there's virtually no shared memory space for those CPUs, and so it takes forever to move data from one CPU to another (compared to NUMA). So you can only harness the power of those cpus if you parallelize work in independent splits big enough so that the cloud could be loaded for long enough before the need for sharing split computations arises again. But wouldn't it be cool if that toothbrush could solve as big problems as its supercomputer counterpart with the same amount of cores might? Wouldn't it be nice for a moment to feel a Demigod's power rushing in one's veins while doing that annoying weeding in your backyard?

So, what can we do about highly interconnected problems that rely a lot on shared memory to parallelize a solution?

Fortunately, many interconnected problems in their traditional solutions are connected much more than necessary just because software abstractions such as MPI make shared memory access look cheap and iterative approach is used more than it needs be. Or at least less iterative approach exists. It's just sometimes you need a Demigod's wisdom to see it. But as they say it in the east, the need is the forefather of all inventions.

Thursday, April 21, 2011

Follow-up for the "Mean Summarizer..." post

This is a small follow-up for the previous post.

In practice, in MapReduce world, such as Pig UDF functions (which I btw already put in place), when we run over observation data, we often encounter 2 types of problems:


1-- Unordered observations, i.e. cases such as $$t_{n+1}<t_{n}$$.
2-- Parallel processing of disjoint subsets of common superset of observations and combining into one ("combiner"-hadoop, "Algebraic UDF"-pig).

Hence, we need to add those specific cases to our formulas.

1. Updating state with events in the past.
Updated formula will look like:

Case $$t_{n+1}\geq t_{n} $$:
\[\begin{cases}\begin{cases}\pi_{0}=1,\\\pi_{n+1}=e^{-\left(t_{n+1}-t_{n}\right)/\alpha};\end{cases}\\w_{n+1}=1+\pi_{n+1}w_{n};\\s_{n+1}=x_{n+1}+\pi_{n+1}s_{n};\\t_{n+1}=t_{n+1}.\end{cases}\]

Case $$t_{n+1}<t_{n}$$ (updating in-the-past):
\[\begin{cases}\begin{cases}\pi_{0}=1,\\\pi=e^{-\left(t_{n}-t_{n+1}\right)/\alpha},\end{cases}\\w_{n+1}=w_{n}+\pi_{n+1},\\s_{n+1}=s_{n}+\pi_{n+1}x_{n+1},\\t_{n+1}=t_{n}.\end{cases}\]

2.  Combining two summarizers having observed two disjoint sets as subsets of original observation set.
Combining two summarizer states $$S_{1},\, S_{2}$$ having observed two disjoint sets of original observation superset:
Case $$t_{2}\geq t_{1}$$:
\[\begin{cases}t=t_{2};\\s=s_{2}+s_{1}e^{-\left(t_{2}-t_{1}\right)/\alpha};\\w=w_{2}+w_{1}e^{-\left(t_{2}-t_{1}\right)/\alpha}.\end{cases}\]

Case $$t_{2}<t_{1}$$ is symmetrical w.r.t. indices $$\left(\cdot\right)_{1},\left(\cdot\right)_{2}$$. Also, the prerequisite for combining two summarizers is $$\alpha_{1}=\alpha_{2}=\alpha$$ (history decay is the same).

But enough of midnight oil burning.

Tuesday, April 12, 2011

Online mean summarizer for binomial distribution with irregular sampling and arbitrary bias

Suppose we have a task to provide biased estimate on probability sampled by $$\left(x_{1},...x_{n}\right),\, x_{i}\in\left\{ 0,1\right\}$$. Traditional solution for a biased estimate is mean of a conjugate prior. In this case it would be mean of the beta-distribution \[P_{n}=\mathbb{E}\left(\mu_{B}\right)=\frac{\alpha}{\alpha+\beta},\] where $$\alpha$$ equals to number of positive outcomes +1 and $$\beta$$ is taken as number of negative outcomes +1.

The obvious intention of biased estimate is to converge on certain bias $$P_{0}$$ in absence of a good sample data: \[P_{0}=\lim_{n\rightarrow0}P_{n}.\]

In case of standard beta-distribution $$P_{0}={1\over{2}}$$, which of course is not terribly useful in practice. In practice we may have much better 'guesses' for our initial estimate. So in order to allow arbitrary parameterization for $$P_{0}$$, let's denote $$\alpha=n_{+}+b_{+}$$ and $$\beta=n_{-}+b_{-}$$, $$n_{-}$$ being the number of negative observations and $$n_{+}$$ being the number of positive observations. Values $$b_{+}$$ and $$b_{-}$$ thus express our initial bias towards positive and negative outcome of final estimate, respectively.

Then our online summarizer could be presented as \[P_{n}=\frac{b_{+}+\sum_{i}^{n}x_{i}}{n+b_{-}+b_{+}}\] and it's not hard to see that we can come up with heuristics allowing arbitrary bias \[P_{0}=\lim_{n\rightarrow0}P_{n}=\frac{b_{+}}{b_{+}+b_{-}}\] while keeping $$b_{+}+b_{-}=2$$ per standard beta-distribution.

That's the biased summarizer I have been using until I saw this post.

If our sampling $$\left(x_{1},...x_{n}\right),\, x_{i}\in\left\{ 0,1\right\}$$ is also taken at times $$t_{1},...t_{n}$$ then we can factor in exponential phase-out for more distant history while taking more recent history into account. Also, instead of converging onto our bias $$P_{0}$$ when we just have a lack of history, we can also converge on it if we have lack of recent history: \[\lim_{n\rightarrow0}P=\frac{b_{+}}{b_{+}+b_{-}}=P_{0}\] and \[\lim_{\left(t_{n}-t_{n-1}\right)\rightarrow+\infty}P=\frac{b_{+}}{b_{+}+b_{-}}=P_{0}.\]

Cool. How do we exactly do that in a convenient way?

We can modify result from Ted Dunning's post I referenced above in the following way:\[P=\frac{b_{+}+\sum_{i=1}^{n}x_{i}e^{-\left(t-t_{i}\right)/\alpha}}{b_{-}+b_{+}+\sum_{i=1}^{n}e^{-\left(t-t_{i}\right)/\alpha}}.\]

It's not hard to see that our goals described by limits above would hold with this solution.

The iterative solution for that would be \[\pi_{n+1}=e^{-\left(t_{n+1}-t_{n}\right)/\alpha},\] \[w_{n+1}=1+\pi_{n+1}w_{n},\] \[s_{n+1}=x_{n+1}+\pi_{n+1}s_{n},\] \[P_{n+1}=\frac{b_{+}+s_{n+1}}{b_{+}+b_{-}+w_{n+1}}.\]

We also have to go by selecting our $$b_{+}$$ and $$b_{-}$$ values more carefully, since value of samples is exponentially decreasing. So it stands to reason we want to decrease effect of our bias based on the amount of history, exponentially weighted, as well. Suppose we have a metric $$\varepsilon$$ that denotes amount of non-significant history for the purposes of biasing. Then we want to modify condition $$b_{+}+b_{-}=2$$ based on non-weighted history by using exponential function average \[b_{+}+b_{-}=2\cdot\frac{\int_{0}^{-\ln\varepsilon}e^{-t}dt}{-\ln\varepsilon}=2\cdot\frac{\varepsilon-1}{\ln\varepsilon}.\] This gives us solution for bias parameters as \[\begin{cases}b_{+}=2P_{0}\frac{\varepsilon-1}{\ln\varepsilon};\\b_{-}=2\left(1-P_{0}\right)\frac{\varepsilon-1}{\ln\varepsilon}.\end{cases}\]

Also, having to specify $$\alpha$$ is weird. Most people like to look at it as span of useful history $$\Delta{t}$$ and margin $$m$$, in exponential scale, when the rest of history is not deemed very useful. Good default value for $$m$$ is perhaps 0.01 (1%). Then we can compute $$\alpha$$ as \[\alpha=\frac{-\Delta t}{\ln m}.\]

So, building the summarizer, we have iterative state represented by $$\left\{ w_{n},s_{n},t_{n}\right\}$$ and non-default constructor that accepts $$\Delta{t},m,P_{0}$$ and $$\varepsilon$$ and transforms them into parameters of the summarizer : $$b_{-},b_{+},\alpha$$ per above.

The summarizer needs to implement 2 methods: $$\mathrm{pnow}(t)$$ and $$\mathrm{update}(t_{n},x_{n})$$. The implementation of the update() method pretty much follows from all the above. The implementation of pnow() just needs to compute probability for current moment (we don't have an observation and may be more biased if we did not see recent observations). Method pnow() just needs to do the same estimate as update() assuming $$t_{n}$$ as 'now' and $$s_{n+1}=\pi_{n+1}s_{n}$$ and $$w_{n+1}=\pi_{n+1}w_{n}$$ and  without actually updating the state $$\left\{ w_{n},s_{n},t_{n}\right\}$$.

Did I say that some version of constructor could assume default values for $$\varepsilon$$ and $$m$$? I use $$\varepsilon=0.5$$ and $$m=0.01$$ as default. So one of good versions of constructor is the one that accepts bias aka 'null hypothesis' $$P_{0}$$, and the span of useful history $$\Delta t$$ to phase out on.

So a bit of coding, and we start converging on a pretty picture in no time.

(And yes, I do take all my pictures myself as well as doing the coding part.)

Monday, April 4, 2011

Git, Github and committing to ASF svn

There has been a discussion around ASF what the best practices of git/github/ASF svn might be. I am ready to offer my own variation.

It's been some time since ASF (Apache Software Foundation) enabled git mirrors of its SVN repository. Github has been mirroring those for some time.

E.g. Mahout project has apache git url git://git.apache.org/mahout.git which in turn is mirrored in Github as git@github.apache.org:apache/mahout.git. Being a Mahout committer myself, I will use it further as an example.

Consequently, it is possible to use Github's collaboration framework to collaborate on individual project issues, accept additions from contributors, merging/branching even more etc. In other words, enable individual jira issue to have its own commit history without really exposing this history to the final project. Another benefit is to have all the power of 3- or whatever-way rebases and merges git provides instead of having a single patch which often gets out of sync with the trunk.

And of course, ability of git to be a 'distributed' system. Your own copy is always your own play yard as well with full power of ... and so on and so forth.

So.. how do we sync apache svn with git branch?

We use git's paradigm of upstream branch, of course, with svn trunk being an upstream. Sort of.

Of course, svn doesn't support upstream tracking, not directly anyway.

First, let's clone the project and set up svn 'upstream' :
git clone git://git.apache.org/mahout.git mahout-svn
git svn init --prefix=origin/ --tags=tags --trunk=trunk --branches=branches https://svn.apache.org/repos/asf/mahout
git svn rebase

I used to run 'git svn clone...' here, but as this wonderful document points out, you can save a lot of time by cloning apache git instead of svn.

Also don't forget to install the authors file from http://git.apache.org/authors.txt by running

git config svn.authorsfile ".git/authors.txt"

assuming that's the path where you put it. Otherwise, your local git commit metadata will be different from that of created in git.apache.org and as a result, that commit's hash will also be different. "svn rebase" would reconcile it though, it seems.

We could also use 'fork' capability in github to create our repository clone, i think. It would be yet even more faster. Now that i have already a cloned repository, I cannot clone yet one more since Github doesn't seem to allow to have more than 1 fork of another repository. Oh well...

So, we are now in our 'holy cow', trunk branch. We will not work here but only use it for commits. It is also recommended to configure git-svn with svn 'authors' file per that info in apache wiki above, for committers.

Now, say we want to create a branch in Github repository, git@github.com:dlyubimov/mahout-commits which we pre-created using Github tools.

git remote add github git@github.com:dlyubimov/mahout-commits
git checkout trunk
git checkout -b MAHOUT-622  
git push -u github 

1-- creates remote corresponding to github repository, in local repository.
2-- make sure we are on local branch 'trunk'
3-- creates local branch MAHOUT-622 based on current (trunk) branch and switches to it
4-- pushes (in this case, also creates) current local MAHOUT-622 branch to github/MAHOUT-622 and also sets up upstream tracking for it. This would take significantly less time if we used Github's 'fork' as mentioned above.

Now we can create some commit history for github/MAHOUT-622 using Github collaboration (pull requests, etc.) So we are reasonably satisfied with the branch and want to commit it to trunk. Here is what we do :
git checkout MAHOUT-622
git pull
git checkout trunk
git svn rebase
git merge MAHOUT-622 --squash
git commit -m "MAHOUT-622: cleaning up dependencies"
git log -p -1 
git svn dcommit

1-- switch to MAHOUT-622
2-- fast-forward to latest changes in github remote branch
3-- switch to trunk
4-- pull latest changes from svn (just in case)
5-- merge all changes from MAHOUT-622 onto trunk tree without committing. At this point, if any merge conflicts are reported, resolve them (perhaps using 'git mergetool') if necessary;
6-- format future single svn commit for MAHOUT-622 issue as a single commit
7-- optionally check the patch of what we just changed before pushing it to svn. Also, since we just merged it here, i.e. potentially added some more changes to the original patch, do all the due diligence here: check for build, tests passing, etc. whatever else committers do.
8-- push commit to svn. European users are recommended to use --no-rebase option while doing this and then explicitly rebase several seconds later.

Alternatively, in step 5 we can merge directly from remote github branch as:
git checkout trunk
git svn rebase
git fetch github
git merge github/MAHOUT-622 --squash
git commit -m "MAHOUT-622: cleaning up dependencies"
git log -p -1 
git svn dcommit

Note that in this case, instead of pulling, we need to run 'git fetch github' in order to update remote commit tree in the local reference repo.

If we wanted to publish a patch into JIRA so that others could review it, we could do it by running 'git diff trunk MAHOUT-622' or just 'git diff trunk' if we are on MAHOUT-622 already.

Yet another use case is when we want to merge upstream svn changes into the issue we are working on :

git checkout trunk
git svn rebase
git checkout MAHOUT-622
git pull 
git merge trunk
git push
1-- switch to the 'holy cow'
2-- pull latest from 'upstream' ASF svn
3-- switch to MAHOUT-622
4-- pull latest remote collaborations to local MAHOUT-622
5-- merge trunk onto issue branch. If conflicts are reported, resolve them (perhaps using 'git mergetool');
6-- push updates to the remote github issue branch. Since we set up the issue branch to track github's branch, we can just use the default form of 'git push'.