Implementing a simple web-log based recommender system

I’ve now implemented such a system as an extension to Catalyst, the open source Perl web framework. The system isn’t yet ready for general distribution, but I’d like to share my approach.

First, I’ve gathered ten years of web access logs from WormBase, a generic model organism database where I work as the project manager.

Next, I correlated IP addresses with requests and tried to trace browsing patterns from one object to the next. This isn’t an exact science since we haven’t historically tried to uniquely identify users.

Data is loaded into a simple MySQL schema with object and object2related tables. Expediently simple.

Recommender systems for biological databases

Recommender systems [Wikipedia] seek to provide users with information related to what they are currently browsing. These are now ubiquitous in e-commerce sites such as Amazon, where each page contains a list of items viewed or purchased by other users.

I’ve long felt that a recommender system could revolutionize the browsing and mining of biological data. The idea would be to provide users with a list of related objects based on browsing history of cadres. See this post for some preliminary implementation notes.

I am worried that a recommender system won’t be received with open arms. Given that my current implementation is based on web log analysis it presents serious privacy issues. It’s conceivable that it’s use could inadvertently reveal the identity of uncloned and unpublished loci.

New genetic map almost complete

I’m almost done putting the finishing touches on a new Genetic Map display. Intended as a drop-in replacement for the Acedb GMap, the new Genetic Map leverages the GBrowse infrastructure providing a familiar user interface as that for browsing the genome.

I still need to fix a few small things (what in the hell is a centi-centi-Morgan?) ๐Ÿ™‚ RD also asked if it might be possible to display mapping data. I think some small modifications of my code that I use for calculating confidence intervals should be sufficient for generating spans for both 2- and 3- factor crosses. The display might get a little complicated displaying separate spans for each experiment – it might be better to aggregate like-experiments together? released to SourceForge

I’ve made a huge number of changes under the hood to make the update process simpler and more stable. A number of things precipitated these changes, including the rapidly expanding size of the database, increasing number of denormalized support databases, the need to run the site in a load balancing situation — and most importantly, the looming addition of 3 new species sometime this year.

I’ve refactored the old Bio::GMOD modules into its own namespace — Bio::WormBase — in part to avoid possible future namespace conflicts as well as too allow a degree of flexibility not present in the old modules.

– Resumable downloads (no longer limited to the Net::FTP/Perl 2GB limit of some architectures).
– purging of old releases
– versioning of all mysql databases, etc, etc, etc
– consolidation and versioning of blast, blat, and epcr databases

To do:
– Service monitoring and restarting as necessary (underway)

You can check out these new modules from sourceforge. I would encourage all of you to become sourceforge developers if you aren’t already so that we can work together to make the mirroring process as streamlined as possible. Please contact me so that I can add you to the list of developers. The project name is bio-wormbase, and the first module is I plan on adding a bunch of other things, such as data mining examples, adminstrative scripts, etc.

Thanks for all your help!


The following documentation is copy-pasted from our Wiki where it is a little easier to read

Fetching the new modules:
Anonymous access:
cvs login
cvs -z3 co -P

Developer access:
export CVS_RSH=ssh
cvs -z3 co -P