(Co)-clustering with Map-Reduce
Under construction
More details to come, when I find some time.
Source code
For now, you can browse the source or download a [download:5 tarball snapshot] (highly experimental).
Also, this is the simple hand-coded benchmark described in this blog post. It's not very useful without the dataset, but if you see anything glaringly stupid in the code that may affect performance, I'd love to know. FYI, each line in the dataset is a couple of hundred characters long.
Todo
Due to lack of time, for now this is more a statement of desires, than plans:
- Clean up codebase
- Package into EC2 AMI, with accompanying demo data S3 bucket
- Generalize to general co-clustering cost functions (Bregman divergences?)
- Consider integration with Apache Mahout
If you would like to volunteer for any of these, please let me know!
Attachments
-
histogram.cc
(0.8 kB) - added by spapadim
2 years ago.
Simple hand-coded histogram benchmark
