Nearly 15 years after completion of the human genome, undergraduate and graduate programs still aren’t adequately training future scientists with the basic bioinformatics skills needed to be successful in the “big data in biology” era. Why?
As a project manager and developer of a long running model organism database (and a former bench scientist myself), I interact with biologists on a daily basis. Franky, I’m alarmed by what I see. Here are some examples of the types of questions I field on a daily basis:
I have a list of genes and I’d like to know the function of each.
I need all the [unspliced|spliced|upstream|downstream|translated] sequence for a group of genes.
I need my data in one very specific file format to support a legacy platform.
I need to do <this generic task> over and over again. It’s killing me and is a waste of my time. Help!
Many junior scientists percolating through the ranks lack the basic skills to address such questions. (I’ll talk about old dogs and new tricks in a subsequent post). More troubling, they often lack the core skills and initiative to tackle rudimentary informatics problems. These include common tasks like collecting and collating data from diverse sources, searching a wiki, reading a mailing list archive, or hacking a pre-existing script to suit a new purpose.
Bioinformatics is here to stay. Get used to it.
Ten or fifteen years ago, many research institutions displayed significant resistance to (and significant ignorance about) the field of bioinformatics. Was it really science? Was it sufficiently hypothesis driven? How did it fit into the mission of a research institute or primarily undergraduate teaching environment? Happily, that resistance has been overcome at most institutions.
Bioinformatics isn’t the same as learning a transient and fleeting laboratory skill. Becoming proficient at running Southern blots or learning a protein purification process might help a student address the discrete questions of their thesis. But in the long term, these are disposable skills learned at great cost.
Not so with bioinformatics. Bioinformatics is a way of thinking. It’s a critical process of organizing information that spills over into many aspects of modern research life. It’s also very easy to develop a useful skill set with a very small time investment.
Frustratingly, many students still have a mental block about programming. They’ve learned (through assimilation and not experience) that programming is difficult. Or they’ve been trained to expect a convenient web interface for everything they need to do. In an ideal world, there would be a web interface for everything. This isn’t an ideal world.
Why has bioinformatics education failed?
I believe that current efforts in bioinformatics education have failed for three reasons.
First, and most fundamentally, bioinformatics training still isn’t universally available. Because of the initial resistance to the field many institutions still lack qualified personnel capable of teaching entry and intermediate level bioinformatics courses.
Second, when bioinformatics training is offered, it’s often as an elective and not considered part of the core curricula.
Finally, the nature of much bioinformatics training is too rarefied. It doesn’t spend enough time on core skills like basic scripting and data processing. For example, algorithm development has no place in a bioinformatics overview course, more so if that is the only exposure to the field the student will have.
Can we fix bioinformatics education?
Yes. Look, it’s easy. Students need primer courses on basic skills first. And it needs to be MANDATORY. Maybe drop the radiation safety course if there isn’t time. Who uses radioactivity anymore anyways? Here are the three core areas that I think all students in cellular & molecular biology, genetics, and related subfields need to succeed.
Core Area 1: Data Discovery
Data discovery refers to a related set of knowledge and skills. What data is available and where can it be found? How can it be retrieved? What if there isn’t a web interface or the data needs to be fetched on a routine basis? Being able to answer such questions forms the basis for programmatically accessing and managing data.
Students should learn how to access common data repository structures like FTP sites, web-based data mining interfaces, wikis, and APIs. They should learn skills for programmatically mining data repositories by learning how to write basic web spiders.
Core Area 2: Data Management
Naming files and datasets consistently and unambiguously is rarely discussed. Nor is data organization and management. These skills are critical for effective analysis, for communication and publication, and for reproducibility.
Boring? Perhaps. But it is absolutely shocking what file naming and management schemes scientifically minded people have created.
Effective data management is not always intuitive. But there are conventions and strategies that can be immensely helpful for transparency, data sharing, and interoperability. Being able to programmatically manage data files is also incredibly useful and a great time saver: rearranging directories, renaming files, archiving files, basic I/O redirection. This is not just for bioinformatics per se, but applies to many areas of biology such as managing confocal images, for example.
Core Area 3: Data transmogrification
Finally, up-and-coming scientists should be able to easily convert files from one format into another.
Again, boring. But useful? You bet. Cast off your Excel shackles.
A quick note to current graduate level students
Are you a graduate student in cell biology, molecular biology, biochemistry, or genetics (or related subfields)?
You should be receiving bioinformatics training as part of your core curriculum. If you aren’t, your program is failing you and you should seek out this training independently. You should also ask your program leaders and department chairs why training in this field isn’t being made available to you.