bitquill - Spiros Papadimitriou
Warning: Can't synchronize with the repository (Unsupported version control system "svn": "libdb3.so.3: cannot open shared object file: No such file or directory" ). Look in the Trac log for more information.

(Co)-clustering with Map-Reduce

Under construction

More details to come, when I find some time.

Source code

For now, you can browse the source or download a [download:5 tarball snapshot] (highly experimental).

Also, this is the simple hand-coded benchmark described in this blog post. It's not very useful without the dataset, but if you see anything glaringly stupid in the code that may affect performance, I'd love to know. FYI, each line in the dataset is a couple of hundred characters long.

Todo

Due to lack of time, for now this is more a statement of desires, than plans:

  • Clean up codebase
  • Package into EC2 AMI, with accompanying demo data S3 bucket
  • Generalize to general co-clustering cost functions (Bregman divergences?)
  • Consider integration with Apache Mahout

If you would like to volunteer for any of these, please let me know!

Attachments