You really should be reading acgt.me. Stop already and go read it.

If you don’t already, you really should be reading Keith Bradnam’s ACGT.

Keith is the creator of the hilarious Just Another Bogus Bioinformatics Acronym (JABBA) Awards. If that isn’t reason enough to name your next bioinformatics framework after a “secret beach break in Guanacaste” (not that one exists) instead of using an asinine acronym, well then I’ll throw this in too: ACGT is frontline informative and discusses things you need to know before they’ve hit the rough streets.

He also does the 101 questions with a bioinformatician series which provides a more personal look into the field of bioinformatics. It’s sort of a conference meet-and-greet, CSHL wine-and-cheese, or Sanger Bits & Nibbles (I still chuckle when I hear that, it’s so super kawaii). Keith has already lined up a great list of interesting interviews. I am admittedly, a recent participant, but the previous interviews are from far more esteemed peeps and worth a read. It’s a great idea and you should really get over there and check it out.

And do follow kbradnam on Twitter, too, yep? You’d be making a mistake otherwise.

It’s time to reboot bioinformatics education

JPEG image-936B732E31E9-1Nearly 15 years after completion of the human genome, undergraduate and graduate programs still aren’t adequately training future scientists with the basic bioinformatics skills needed to be successful in the “big data in biology” era. Why?

As a project manager and developer of a long running model organism database (and a former bench scientist myself), I interact with biologists on a daily basis. Franky, I’m alarmed by what I see. Here are some examples of the types of questions I field on a daily basis:

I have a list of genes and I’d like to know the function of each.

I need all the [unspliced|spliced|upstream|downstream|translated] sequence for a group of genes.

I need my data in one very specific file format to support a legacy platform.

I need to do <this generic task> over and over again. It’s killing me and is a waste of my time. Help!

Many junior scientists percolating through the ranks lack the basic skills to address such questions. (I’ll talk about old dogs and new tricks in a subsequent post). More troubling, they often lack the core skills and initiative to tackle rudimentary informatics problems. These include common tasks like collecting and collating data from diverse sources, searching a wiki, reading a mailing list archive, or hacking a pre-existing script to suit a new purpose.

Bioinformatics is here to stay. Get used to it.

Ten or fifteen years ago, many research institutions displayed significant resistance to (and significant ignorance about) the field of bioinformatics. Was it really science? Was it sufficiently hypothesis driven? How did it fit into the mission of a research institute or primarily undergraduate teaching environment? Happily, that resistance has been overcome at most institutions.

Bioinformatics isn’t the same as learning a transient and fleeting laboratory skill. Becoming proficient at running Southern blots or learning a protein purification process might help a student address the discrete questions of their thesis. But in the long term, these are disposable skills learned at great cost.

Not so with bioinformatics. Bioinformatics is a way of thinking. It’s a critical process of organizing information that spills over into many aspects of modern research life. It’s also very easy to develop a useful skill set with a very small time investment.

Frustratingly, many students still have a mental block about programming. They’ve learned (through assimilation and not experience) that programming is difficult. Or they’ve been trained to expect a convenient web interface for everything they need to do. In an ideal world, there would be a web interface for everything. This isn’t an ideal world.

Why has bioinformatics education failed?

I believe that current efforts in bioinformatics education have failed for three reasons.

First, and most fundamentally, bioinformatics training still isn’t universally available. Because of the initial resistance to the field many institutions still lack qualified personnel capable of teaching entry and intermediate level bioinformatics courses.

Second, when bioinformatics training is offered, it’s often as an elective and not considered part of the core curricula.

Finally, the nature of much bioinformatics training is too rarefied. It doesn’t spend enough time on core skills like basic scripting and data processing. For example, algorithm development has no place in a bioinformatics overview course, more so if that is the only exposure to the field the student will have.

Can we fix bioinformatics education?

Yes. Look, it’s easy. Students need primer courses on basic skills first. And it needs to be MANDATORY. Maybe drop the radiation safety course if there isn’t time. Who uses radioactivity anymore anyways? Here are the three core areas that I think all students in cellular & molecular biology, genetics, and related subfields need to succeed.

Core Area 1: Data Discovery

Data discovery refers to a related set of knowledge and skills. What data is available and where can it be found? How can it be retrieved? What if there isn’t a web interface or the data needs to be fetched on a routine basis? Being able to answer such questions forms the basis for programmatically accessing and managing data.

Students should learn how to access common data repository structures like FTP sites, web-based data mining interfaces, wikis, and APIs. They should learn skills for programmatically mining data repositories by learning how to write basic web spiders.

Core Area 2: Data Management

Naming files and datasets consistently and unambiguously is rarely discussed. Nor is data organization and management. These skills are critical for effective analysis, for communication and publication, and for reproducibility.

Boring? Perhaps. But it is absolutely shocking what file naming and management schemes scientifically minded people have created.

Effective data management is not always intuitive. But there are conventions and strategies that can be immensely helpful for transparency, data sharing, and interoperability. Being able to programmatically manage data files is also incredibly useful and a great time saver: rearranging directories, renaming files, archiving files, basic I/O redirection. This is not just for bioinformatics per se, but applies to many areas of biology such as managing confocal images, for example.

Core Area 3: Data transmogrification

Finally, up-and-coming scientists should be able to easily convert files from one format into another.

Again, boring. But useful? You bet. Cast off your Excel shackles.

A quick note to current graduate level students

Are you a graduate student in cell biology, molecular biology, biochemistry, or genetics (or related subfields)?

You should be receiving bioinformatics training as part of your core curriculum. If you aren’t, your program is failing you and you should seek out this training independently. You should also ask your program leaders and department chairs why training in this field isn’t being made available to you.

Community annotation — by any name — still isn’t a part of the research process. It should be.

In order for community annotation efforts to succeed, they need to become part of the established research process: mine annotations, generate hypotheses, do experiments, write manuscripts, submit annotations. Rinse and repeat.

A few weeks ago, I posted the following tweet:

Bioinformatics people like to orate about "community annotation". I've never heard a biologist use the phrase. Therein lies the problem.

A few retweeters responded that in their particular realm of bioinformatics, community annotation was called “community curation” or a “jamboree” and they’ve had various degrees of success. Points taken and effort applauded.

The real essence of my tweet was that community annotation — regardless of what it is called — largely fails or is undertaken on a very small scale because it simply isn’t a priority for biologists.

Working at the bench, community annotation doesn’t even make the long list of things to do: conducting experiments, writing manuscripts and grants, mentoring, sitting on committees, teaching. Contributing to community annotation efforts simply does not make the cut.

How might we fix this?

1. Top-down emphasis on the importance of community annotation.

Community annotation isn’t required of publishers or funding agencies except in the most minimal degree (eg submission of sequences). This needs to be changed. By making community annotation part of the process of doing research, the research itself will become more reproducible, more accessible to a broader audience, and more stable over time. It should be complementary to writing a manuscript.

Publishers benefit because extracted entities become markup targets to enhance their online product. Funding agencies benefit since having primary authors and domain experts submit annotation suits the mission of transparency and reproducibility and has a presumed efficiency over third party curation.

2. Better tools.

The tools for community annotation are embryonic and do not match the user experience people have come to expect in the Facebook / Pinterest / Instagram / Google Docs era. Bioinformatics teams need to begin employing user interface, user experience, and graphic design professionals to build friendlier, more efficient, and more beautiful tools to encourage participation.

3. Recognition.

Again, in an effort to encourage participation, we need to recognize the efforts of people who do contribute. This system must have professional currency to it, akin to writing a review paper, and should be citable for two reasons. First, it adds legitimacy to the contribution. It’s now part of the scientific record that can be extended by other researchers. Second, the primary contributor can now make note of their effort expended on CVs and in the tenure or job performance review process.

Nanopublications and microattribution represent the most promising avenues for providing suitable recognition with scientific legitimacy that maps to the current academic and professional status quo.

Running the Generic Genome Browser under PSGI/Plack

Here’s a simple approach for installing and running a local instance of GBrowse, leveraging the PSGI/Plack webserver <-> web application stack. You don’t need root access, you don’t need Apache, and you don’t need to request any firewall exceptions (for now).

Background

Both the current implementation and installer of GBrowse are loosely tied to Apache. By loosely, I mean that the installer generates suitable configuration and assumes installation paths as if the instance will be run under Apache. The implementation is tightly tied to the CGI specification; it’s a suite of CGI scripts. Although GBrowse will rununder any webserver that implements the CGI specification (are there any that DON’T?), this approach increases the administrative effort required for running a local instance, increases the complexity of configuration, makes it more difficult to run GBrowse under other environments, and makes it impossible to leverage powerful advances in Perl web application development.

Enter PSGI (the Perl Web Server Gateway Interface), a specification for glueing Perl applications to webservers. Plack is a reference implementation of this specification. PSGI as implemented by Plack makes it simple to run Perl-based applications (even CGI-based ones like GBrowse) in a variety of environments.

In other words, PSGI abstracts the request/response cycle so that you can focus on your application. Running your application under CGI, Fast CGI, or mod_perl is just a matter of changing the application handler. The core Plack distribution provides a number of handlers out of the box (CGI, FCGI, mod_perl, for example) and even includes a light-weight webserver (HTTP::Server::PSGI) which is perfect for development. Other webservers also implement the PSGI specification, including the high-performance preforking server Starman.

You can also do cool things via middleware handlers like mapping multiple applications to different URLs with ease (how about running the last 10 versions of GBrowse all without touching Apache config or dealing with library conflicts), handle tasks like serving static files, mangling requests and responses, etc.

What this isn’t (yet)

This isn’t a rewrite of GBrowse using PSGI. It’s just some modifications to the current GBrowse to make it possible to wrap the CGI components so that they can be used via servers that implement the PSGI specification. There is a project to rewrite GBrowse as a pure PSGI app. Stay tuned for details.

Conventions

  1. Installation root.
  2. Our working installation root is configured via the environment variable GBROWSE_ROOT.

  3. No root privileges required.
  4. You do not need to be root. Ever. In fact, one of the great advantages of this approach is the ease with which you can install a local instance.

  5. Self-contained, versioned installation paths.
  6. This tutorial installs everything under a single directory for simplified management and configuration. This path corresponds to the version of GBrowse being installed.

    The current version of GBrowse is specified by environment variable (GBROWSE_VERSION). If you want to use the same installation path from release to release, you can also create and adjust symlinks as necessary (~/gbrowse/current -> ~/gbrowse/gbrowse-2.40, for example, and set GBROWSE_VERSION=current). This isn’t necessarily required but means that you won’t need to set GBROWSE_VERSION every time you update to a new version of GBrowse. At any rate, maintaining installations by version is a Good Practice and makes it easy to revert to older versions should the need arise.

  7. Each installation has it’s own set of local libraries.
  8. In keeping with the self-contained non-privileged design gestalt, we’ll install all required libraries to a local path tied to the installed version of GBrowse ($GBROWSE_ROOT/$GBROWSE_VERSION/extlib). This makes it dead simple to run many possibly conflicting variants of GBrowse all with their own dedicated suite of libraries. Awesome.

Installation

  1. Set up your environment.
  2.   // Set an environment variables for the your installation root and the version of GBrowse you are installing.
      > export GBROWSE_ROOT=~/gbrowse
      > export GBROWSE_VERSION=2.40
    
  3. Prepare your library directory.
  4.   // You may need to install the local::lib library first
      > (sudo) perl -MCPAN -e 'install local::lib'
      > cd ${GBROWSE_ROOT}
      > mkdir ${GBROWSE_VERSION}
      > cd ${GBROWSE_VERSION}
      > mkdir extlib ; cd extlib
      > perl -Mlocal::lib=./
      > eval $(perl -Mlocal::lib=./)
    
  5. Check out GBrowse fork with modifications for running under PSGI/Plack.
  6.   > cd ${GBROWSE_ROOT}
      > mkdir src ; cd src
      > git clone git@github.com:tharris/GBrowse-PSGI.git
      > cd GBrowse-PSGI
      # Here, the wwwuser is YOU, not the Apache user.
      > perl Build.PL --conf         ${GBROWSE_ROOT}/${GBROWSE_VERSION}/conf \
                      --htdocs       ${GBROWSE_ROOT}/${GBROWSE_VERSION}/html \
                      --cgibin       ${GBROWSE_ROOT}/${GBROWSE_VERSION}/cgi \
                      --wwwuser      $LOGNAME \
                      --tmp          ${GBROWSE_ROOT}/${GBROWSE_VERSION}/tmp \
                      --persistent   ${GBROWSE_ROOT}/${GBROWSE_VERSION}/tmp/persistent \
                      --databases    ${GBROWSE_ROOT}/${GBROWSE_VERSION}/databases \
                      --installconf  n \
                      --installetc   n
      > ./Build installdeps   # Be sure to install all components of the Plack stack:
    
          Plack
          Plack::App::CGIBin
          Plack::App::WrapCGI
          Plack::Builder
          Plack::Middleware::ReverseProxy
          Plack::Middleware::Debug
          CGI::Emulate::PSGI
          CGI::Compile
    
      // Should you need to adjust any values, run
      > ./Build.PL reconfig
      > ./Build install
    

    Note: the curent installer script SHOULD NOT require a root password if using local paths like this example. When it asks if you want to restart Apache, select NO. It’s not relevant for us.

  7. Fire up a Plack server using plackup.
  8. The Build script will have installed a suitable .psgi file at conf/GBrowse.psgi. Launch a simple plack HTTP server via:

       > plackup -p 9001 ${GBROWSE_ROOT}/${GBROWSE_VERSION}/conf/GBrowse.psgi
       // Open http://localhost:9001/
    

    Note: By default, plackup will use HTTP::Server::PSGI.

    Where To From Here

    PSGI/Plack is really powerful. Here are some examples that take advantage of configuration already in the conf/GBrowse.psgi file.

    Enable the Plack debugging middleware:

       > export GBROWSE_DEVELOPMENT=true
       > plackup -p 9001 ${GBROWSE_ROOT}/${GBROWSE_VERSION}/conf/GBrowse.psgi
       // Visit http://localhost:9001/ and see all the handy debugging information.
    

    Run GBrowse under the preforking, lightweight HTTP server Starman:

       > perl -MCPAN -e 'install Starman'
       > starman -p 9001 ${GBROWSE_ROOT}/${GBROWSE_VERSION}/conf/GBrowse.psgi