Todd Harris, PhD

Facilitating scientific discovery at the intersection of genetics, genomics, bioinformatics, big data, cloud computing, and open science.

  • About

How do we assess the value of biological data repositories?

May 18, 2015 By Todd Harris 2 Comments

In an era of constrained funding and shifting focus, how do we effectively measure the value of online biological data repositories? Which should receive sustained funding? Which should be folded into other resources? What efficiencies can be gained over how these repositories are currently operated?

Everyday it seems a new biological data repository is launched. Flip through the annual NAR Database issue if you do not believe this.

Quick aside: I want to differentiate repositories from resources. Repositories are typically of two varieties. The first collects primary data in support of a single or small consortium of labs. The second culls data from the scientific literature or collects it by author submissions before, during, or after the publication process. Resources may have some characteristics of repositories but are usually more tool oriented and aim to process data that users bring to it.

Encouraging the creation of a large number of repositories has been an important development in the rapid advancement of bioinformatics. These repositories, in turn, have played a critical role in the genomic- and post-genomic research world.

The current community of biological repositories allow for experimentation and innovation of data modeling and user interface. They provide opportunities for education. And they let small labs participate in the grand process of Scientific Curation: the documentation of scientific progress outside of the traditional prose-based publication narrative. We should continue to carefully create new repositories when warranted for the innovation and educational opportunities that they present.

On the flip side, these repositories are often brittle: graduate students and postdocs move on and create knowledge vacuums. Elements of the repository break due to software upgrades and security patches. Data becomes unreliable (eg as genomic coordinates are refined). Interest declines yet the cost of maintaining the resource remain. And daving many repositories carrying slightly different versions of data also introduces confusion for downstream analyses and hinders reproducibility.

Clearly, we need an effective way of measuring the reach and value of biological data repositories. When a repository crosses a certain threshold, it’s funding should be decreased or removed. Remaining funds should used (or allocated if necessary) to port the repository to a parent resource for long-term maintenance.

How can we determine the value of a biological repository?

1. Page views.

Simple metrics like page views, taken in isolation, are wholly insufficient for assessing the value of a repository. Each repository, for example, may have different tolerances for robots, different definitions of what constitutes a robot, or different architectures that mitigate the importance of page views. Page views is only one element that should be taken into account. I personally believe that it’s one of the least effective ways of defining the value of a repository and worry that too much emphasis might be placed on it.

2. Size of the user community.

How big is the user community? This should include registered users (if the repository has such an implementation), users determined via analytics, and the rate of citation.

3. Collective funding of the core user community.

How much money has the core user community of the repository been entrusted with? How much of that money would be lost or wasted if the repository were to be placed in a maintenance mode or defunded altogether? There is no sense in throwing good money after bad and sometimes tough choices must be made, but if the core research community — and all of the money vested in that as well — is affected, funding choices of the repository should be weighed very, very carefully.

Don’t get me wrong: repositories that serve relatively small communities (with a relatively small amount of funding have value. But the net value of such repositories cannot compare to one that serves a user community 10x the size with 100x the funding.

4. Frequency of use.

How frequently to members of the core user community access the repository? Is it essential for day-to-day lie at the bench? Or maybe it is used in a more referential manner on a periodic basis.

5. Difficulty of assimilation.

How difficult and time consuming would it be to fold an existing repository into another? Repositories containing very specialized data, data models, or analysis tools that still support moderately sized communities with substantive funding could actually be MORE expensive to fold into another repository than to continue its maintenance independently.

In sum, page views are insufficient. We need to define the size of the user community, the extent of its funding, and how critical the repository is to continued progress of that community. Finally, we need to carefully weigh sustained funding vs. maintenance funding vs. assimilation costs.

Without knowing these five parameters, making consistent decisions about the value of biological repositories will be challenging at best.

And even with these metrics in hand, the real question — and the one that is much more difficult to address — is:

What are the thresholds for continued support of a repository?
I will address my thoughts on this in an upcoming post.

What are your thoughts? What other metrics should we be using to determine the value of biological data repositories? Leave a comment or ping me on Twitter.

Share this:

  • Twitter
  • Facebook
  • LinkedIn

Filed Under: bioinformatics, careers, funding, science policy Tagged With: big data, bioinformatics, data, funding

23andme’s new Migraine Survey

October 16, 2009 By Todd Harris Leave a Comment

23andme has unleashed a new survey that aims to assess the genetic basis of migraine.

What’s more, 23andme hope to raise public consciousness — and hopefully research funding — of migraine. On The Spittoon, MikeM writes:

Two prominent migraine researchers have suggested that the blame for the slow progress in understanding migraine lies with a systemic lack of public funding for migraine research. They argue that the relatively recent, and incomplete, acceptance of migraine by the medical and research communities as a genuine medical problem, as opposed to mere melodrama, has led migraine’s funding to lag well behind that for diseases of similar impact. For example, they estimate that while $13.80 is spent for each sufferer of asthma, just 36 cents of federal research funds are spent per migraine sufferer.

The genetics of migraine are also only partially understood. That’s where our new survey comes in. Our community-based research program 23andWe seeks to empower the public to engage in genetic research from the ground up. We know our efforts cannot substitute for proper federal support of migraine research, but evidence of great public interest, plus a new finding or two, would add to our understanding of the disease and potentially send a message to Washington.

I have experienced the “migraine as melodrama bias” myself. See, I’ve suffered from migraines since I was in sixth grade. I still remember describing the visual symptoms and ensuing headache phenomena at a time when migraine was even less poorly understood than it is today. My list of symptoms is now easily recognized migraine: aura, sensitivity to light and sound, nausea, duration.

And although there are phenomenal treatments now available, I don’t take a thing. I’ve come to see migraines as a way of flushing out the pipes, a neurological reset that helps me think clearer and with greater creativity.

You don’t need to be a 23andme subscriber to take the survey.

Share this:

  • Twitter
  • Facebook
  • LinkedIn

Filed Under: personal genomics Tagged With: 23andme, funding, migraine, survey

Welcome!
My name is Todd Harris. A geneticist by training, I now work at the intersection of biology and computer science developing tools and systems to organize, visualize, and query large-scale genomic data across a variety of organisms.

I'm driven by the desire to accelerate the pace of scientific discovery and to improve the transparency and reproducibility of the scientific process.

Stay in touch!

Enter your address to receive notifications of new posts by email.

Join 1,296 other subscribers

Copyright © 2023 · Genesis Sample Theme on Genesis Framework · WordPress · Log in