The Top Five 2017 AWS Re:Invent Announcements Impacting Bioinformatics

The sixth annual Amazon Web Services (AWS) Re:Invent conference was held last week in Las Vegas. Like years past, 2017 Re:Invent was a dizzying week of new announcements, enhancements to existing products, and interesting prognostications on the future of cloud computing.

The first Re:Invent conference was held in 2012. 6000 attendees gorged themselves on keynotes, sessions, hackathons, bootcamps, cloud evangelism, and video games. Not much has changed, except for the number of attendees. 2017 boasted 43,000 attendees with a conference campus spread across multiple venues on the Las Vegas strip. It was not difficult to get your 10,000 steps in this year.

I’ve been fortunate to attend Re:Invent every year. In 2012, I’d already been using AWS for nearly 5 years and was thoroughly convinced of it utility and value. In those early days, there was still a lot of reluctance and skepticism of the cloud, particularly in academic settings. To some extent, these biases still exist which partially explains the slower uptake of the cloud for academic projects. By now, I think it’s overwhelmingly clear that academic compute and basic research projects should be leveraging the many benefits of building and deploying on the cloud.

Without further ado, here are my top five announcements at the 2017 AWS Re:Invent impacting bioinformatics.

1. AWS Sagemaker
AWS Sagemaker is a fully managed service for building, training, and deploying machine learning workflows in the AWS cloud. Machine learning has always played an important role in bioinformatics. Simplifying training and deployment of ML workflows will have a profound impact on bioinformatics and big data. For one, Sagemaker offers the opportunity to introduce ML approaches to a broader audience, and to a broader range of research topics. Of any of the 100s of announcements at Re:Invent, I’m most excited to put Sagemaker to use.

2. Amazon Neptune
Bioinformatics is all about highly connected data. These relationships are often a bear to model in relational database management systems. Graph databases are a perfect fit for biological data. Amazon Neptune is the latest entry into the crowded Graph database space. Many commercial options currently available force decisions and raise significant issues of cost and lock-in. Neptune is still in a preview phase and I haven’t had any direct interaction with it, so I can’t address how it will perform against these challenges. However, given its integration with AWS, I expect it’s rate of adoption to increase rapidly. As a highly available and scalable managed database supporting graph APIs and designed for the cloud, Neptune could be an amazing tool for bioinformatics projects.

3. AWS Fargate
AWS Fargate promises to bring the serverless revolution to containers. Containers already have a strong presence in bioinformatics and have greatly simplified the maintenance and deployment of applications that may be, ahem, short on documentation. Still, they’ve required managing the underlying infrastructure. Fargate is a launch type for Amazon ECS that simplifies launching containers without having to manage the underlying infrastructure. You don’t have to define instance type or family or manage scaling or clusters. Just define CPU and memory, IAM, and networking, and let Fargate handle the infrastructure. While we are on containers, AWS also introduced ECS for Kubernetes (EKS). Although it doesn’t rank in my top five, it does bear mention here.

4. AWS Comprehend
Did I say that I was most excited about Sagemaker? Well, I’m also pretty psyched about the introduction of AWS Comprehend. Comprehend is a natural language processing (NLP) managed service that relies on machine learning to process test. At the end of the day, a big part of the most interesting part of bioinformatics is text. Comprehend offers a really cool way to get at that information. It can extract key phrases, known vocabularies, and custom lexica. It also does expected things like weighting occurrences and displaying them in context. Of course, it has an API and integrates with other AWS services, too.

5. AWS Glacier Select
Last but not least is AWS Glacier Select. Really, you ask? A storage enhancement made my top five list? Yes. Here’s way. Biology (and bioinformatics) is about data. Data is expensive to generate and expensive to keep around. You either pay a lot for storage, throw your data away and commit to regenerating it later, or place it in essentially inaccessible archival storage. 
That’s where Glacier Select comes in. Glacier is an AWS archival service for data that you don’t need immediately accessible. But Glacier Select actually lets you execute an SQL query against a Glacier archive. Since its archival storage, you also specify when you would like your results returned. Standard queries take 3–5 hours, and results can be deposited in an S3 bucket. Of course, there’s an API that you can build in to existing applications. I’m super psyched about cheap archival storage that can still be queried and think this has many applications in bioinformatics.

There were many, many other announcements that have direct applications in bioinformatics. I’d highly encourage you to watch the keynotes from Andy Jassy and Werner Vogels to dive in a little deeper.

The Worm Reader’s Gazette: audio transcripts of the scientific literature

Digital media has had a profound effect on scientific publishing. But by and large, consuming that literature is still a tedious process of reading that does not lend itself well to today’s active lifestyles. Do you have a long commute or need your hands and eyes free for other tasks like pipetting or microinjecting? Then we think you’ll love the Worm Reader’s Gazette.

The Worm Breeder’s Gazette is pleased to announce the Worm Reader’s Gazette: audio recordings of scientific literature read aloud by lead authors, now in private invite only alpha testing.

At the Worm Reader’s Gazette, lead authors can create and submit audio recordings of their publications. We’re starting with seminal papers in the C. elegans field and adding new papers as they are published.

Making audio recordings a standard part of the manuscript submission process.

To improve the coverage of available articles, we are working with publishers to make audio recordings a required part of the manuscript submission materials.

Following acceptance and approval of the final galleys, web-based authoring tools will make it simple for authors to create and edit their audio recordings from their mobile phone.

Borrowing from the popular photo sharing service Instagram, special audio filters will add humor, drama, or august ambience as required by the tone and conclusions of your manuscript. These include the mysterious “Irreproducible Result”, the lo-fi “Graduate Seminar”, the visceral “Thesis Defense”, and the noble “Nobel Address” and many, many others.

We’ve included high quality audio samples that authors can drag and drop into their recording, too. Add dog barks after particularly important points, bomb explosions during and after paradigm shifting pronouncements, and audience applause whenever warranted.

Our natural language processing filters will automatically markup your recording with special auditory hyperlinks to primary databases. Users can “click” on these hyperlinks during playback using voice activation via Siri, Cortana, or Alexa: “Hey Siri, order the antibody mentioned in the fifth paragraph and have it sent priority overnight mail to my bench”. Yes, it’s that easy.

We hope you enjoy this new service. And we look forward to seeing — and hearing! — your next publication.

Announcing: The Bioresource Project Manager Directory

Are you a Project Manager or Team Leader of a biological resource, database, or repository?

Add yourself to the Bioresource Project Manager Directory.

Let’s get to know each other. There’s much to discuss. Start by adding yourself to the directory. Don’t worry, your information will be kept private.

Initially, I plan to create a mailing list (Google Groups, perhaps?).

And stay tuned for an upcoming announcement, one you’ll hopefully find exciting!

How do we assess the value of biological data repositories?

In an era of constrained funding and shifting focus, how do we effectively measure the value of online biological data repositories? Which should receive sustained funding? Which should be folded into other resources? What efficiencies can be gained over how these repositories are currently operated?

Everyday it seems a new biological data repository is launched. Flip through the annual NAR Database issue if you do not believe this.

Quick aside: I want to differentiate repositories from resources. Repositories are typically of two varieties. The first collects primary data in support of a single or small consortium of labs. The second culls data from the scientific literature or collects it by author submissions before, during, or after the publication process. Resources may have some characteristics of repositories but are usually more tool oriented and aim to process data that users bring to it.

Encouraging the creation of a large number of repositories has been an important development in the rapid advancement of bioinformatics. These repositories, in turn, have played a critical role in the genomic- and post-genomic research world.

The current community of biological repositories allow for experimentation and innovation of data modeling and user interface. They provide opportunities for education. And they let small labs participate in the grand process of Scientific Curation: the documentation of scientific progress outside of the traditional prose-based publication narrative. We should continue to carefully create new repositories when warranted for the innovation and educational opportunities that they present.

On the flip side, these repositories are often brittle: graduate students and postdocs move on and create knowledge vacuums. Elements of the repository break due to software upgrades and security patches. Data becomes unreliable (eg as genomic coordinates are refined). Interest declines yet the cost of maintaining the resource remain. And daving many repositories carrying slightly different versions of data also introduces confusion for downstream analyses and hinders reproducibility.

Clearly, we need an effective way of measuring the reach and value of biological data repositories. When a repository crosses a certain threshold, it’s funding should be decreased or removed. Remaining funds should used (or allocated if necessary) to port the repository to a parent resource for long-term maintenance.

How can we determine the value of a biological repository?

1. Page views.

Simple metrics like page views, taken in isolation, are wholly insufficient for assessing the value of a repository. Each repository, for example, may have different tolerances for robots, different definitions of what constitutes a robot, or different architectures that mitigate the importance of page views. Page views is only one element that should be taken into account. I personally believe that it’s one of the least effective ways of defining the value of a repository and worry that too much emphasis might be placed on it.

2. Size of the user community.

How big is the user community? This should include registered users (if the repository has such an implementation), users determined via analytics, and the rate of citation.

3. Collective funding of the core user community.

How much money has the core user community of the repository been entrusted with? How much of that money would be lost or wasted if the repository were to be placed in a maintenance mode or defunded altogether? There is no sense in throwing good money after bad and sometimes tough choices must be made, but if the core research community — and all of the money vested in that as well — is affected, funding choices of the repository should be weighed very, very carefully.

Don’t get me wrong: repositories that serve relatively small communities (with a relatively small amount of funding have value. But the net value of such repositories cannot compare to one that serves a user community 10x the size with 100x the funding.

4. Frequency of use.

How frequently to members of the core user community access the repository? Is it essential for day-to-day lie at the bench? Or maybe it is used in a more referential manner on a periodic basis.

5. Difficulty of assimilation.

How difficult and time consuming would it be to fold an existing repository into another? Repositories containing very specialized data, data models, or analysis tools that still support moderately sized communities with substantive funding could actually be MORE expensive to fold into another repository than to continue its maintenance independently.

In sum, page views are insufficient. We need to define the size of the user community, the extent of its funding, and how critical the repository is to continued progress of that community. Finally, we need to carefully weigh sustained funding vs. maintenance funding vs. assimilation costs.

Without knowing these five parameters, making consistent decisions about the value of biological repositories will be challenging at best.

And even with these metrics in hand, the real question — and the one that is much more difficult to address — is:

What are the thresholds for continued support of a repository?
I will address my thoughts on this in an upcoming post.

What are your thoughts? What other metrics should we be using to determine the value of biological data repositories? Leave a comment or ping me on Twitter.