Weathering Thru Tech Days: SSVD Command Line usage

Sunday, March 27, 2011

SSVD Command Line usage

Here's the doc, also attached to Mahout-593. At some point wiki update is due. When we know what it is all good for.

Also, from my email regarding -s parameter:

There are 2 cases where you might want to adjust -s :

1 -- if you running really huge input that produces more than 1000 or
so map tasks and/or that is causing OOM in some tasks in some
situations. It looks like your input is far from that now.

2 -- if you have quite wide input -- realistically more than 30k
non-zero elements in a row. The way current algorithm works, it tries
to do blocking QR of stochastically projected rows in the mappers
which means it needs to read at least k+p rows in each split (map
task). This can be fixed and i have a branch that should eventually
address this. In your case, if there happen to be splits that contain
less than 110 rows of input, that would be the case where you might
want to start setting -s greater than DFS block size (64mb) but it has
no effect if it's less than that (which is why hadoop calls it
_minimum_ split size). I don't remember hadoop's definition of this
parameter, i think it is in bytes, so that means you probably need to
specify something like 100,000,000 to start seeing decrease in number
of the map tasks. But honestly i never tried this yet since i never
had input wide enough to require this.

2 comments:

nathanAugust 24, 2011 at 10:39 AM
Hi Dmitriy,

I have been studying your implementation of ssvd in Mahout. I think its great that you contributed it to this project. At the time of writing the paper, MapReduce was not on our 'academic' radar and only recently have I begun to understand how cloud computing and MR will soon subsume the expensive, impossible to program 'supercomputers' of today.

I would like to get involved in this project. There are some modifications that greatly increase the accuracy and reliability of the algorithm at the expense of only a few more passes through the data. This is important since much of the data in machine learning is noisy and a one pass random projection is likely to listen to that noise a bit too much. Also, I saw mention of not much benchmarking/testing of the algorithm. One of the intents of my thesis work was to implement ssvd in MR (or some large architechture) and analyze it against Lanczos method to define the benefits and limitations of the method. I would be more than happy to provide this analysis and thank you for doing the hard part of implementing it!

I have many things to say and am probably reaching the limit of a blog comment, please contact me and we can discuss further the ssvd.

Nathan Halko nathan@spotinfluence.com
ReplyDelete
Replies
Dmitriy LyubimovAugust 24, 2011 at 12:34 PM
Nathan, that's great! You probably imply power iterations.

I will be happy to be of help. I sent you a reply by email. Thanks.
ReplyDelete
Replies

Add comment