The Top Five 2017 AWS Re:Invent Announcements Impacting Bioinformatics

The sixth annual Amazon Web Services (AWS) Re:Invent conference was held last week in Las Vegas. Like years past, 2017 Re:Invent was a dizzying week of new announcements, enhancements to existing products, and interesting prognostications on the future of cloud computing.

The first Re:Invent conference was held in 2012. 6000 attendees gorged themselves on keynotes, sessions, hackathons, bootcamps, cloud evangelism, and video games. Not much has changed, except for the number of attendees. 2017 boasted 43,000 attendees with a conference campus spread across multiple venues on the Las Vegas strip. It was not difficult to get your 10,000 steps in this year.

I’ve been fortunate to attend Re:Invent every year. In 2012, I’d already been using AWS for nearly 5 years and was thoroughly convinced of it utility and value. In those early days, there was still a lot of reluctance and skepticism of the cloud, particularly in academic settings. To some extent, these biases still exist which partially explains the slower uptake of the cloud for academic projects. By now, I think it’s overwhelmingly clear that academic compute and basic research projects should be leveraging the many benefits of building and deploying on the cloud.

Without further ado, here are my top five announcements at the 2017 AWS Re:Invent impacting bioinformatics.

1. AWS Sagemaker
AWS Sagemaker is a fully managed service for building, training, and deploying machine learning workflows in the AWS cloud. Machine learning has always played an important role in bioinformatics. Simplifying training and deployment of ML workflows will have a profound impact on bioinformatics and big data. For one, Sagemaker offers the opportunity to introduce ML approaches to a broader audience, and to a broader range of research topics. Of any of the 100s of announcements at Re:Invent, I’m most excited to put Sagemaker to use.

2. Amazon Neptune
Bioinformatics is all about highly connected data. These relationships are often a bear to model in relational database management systems. Graph databases are a perfect fit for biological data. Amazon Neptune is the latest entry into the crowded Graph database space. Many commercial options currently available force decisions and raise significant issues of cost and lock-in. Neptune is still in a preview phase and I haven’t had any direct interaction with it, so I can’t address how it will perform against these challenges. However, given its integration with AWS, I expect it’s rate of adoption to increase rapidly. As a highly available and scalable managed database supporting graph APIs and designed for the cloud, Neptune could be an amazing tool for bioinformatics projects.

3. AWS Fargate
AWS Fargate promises to bring the serverless revolution to containers. Containers already have a strong presence in bioinformatics and have greatly simplified the maintenance and deployment of applications that may be, ahem, short on documentation. Still, they’ve required managing the underlying infrastructure. Fargate is a launch type for Amazon ECS that simplifies launching containers without having to manage the underlying infrastructure. You don’t have to define instance type or family or manage scaling or clusters. Just define CPU and memory, IAM, and networking, and let Fargate handle the infrastructure. While we are on containers, AWS also introduced ECS for Kubernetes (EKS). Although it doesn’t rank in my top five, it does bear mention here.

4. AWS Comprehend
Did I say that I was most excited about Sagemaker? Well, I’m also pretty psyched about the introduction of AWS Comprehend. Comprehend is a natural language processing (NLP) managed service that relies on machine learning to process test. At the end of the day, a big part of the most interesting part of bioinformatics is text. Comprehend offers a really cool way to get at that information. It can extract key phrases, known vocabularies, and custom lexica. It also does expected things like weighting occurrences and displaying them in context. Of course, it has an API and integrates with other AWS services, too.

5. AWS Glacier Select
Last but not least is AWS Glacier Select. Really, you ask? A storage enhancement made my top five list? Yes. Here’s way. Biology (and bioinformatics) is about data. Data is expensive to generate and expensive to keep around. You either pay a lot for storage, throw your data away and commit to regenerating it later, or place it in essentially inaccessible archival storage. 
That’s where Glacier Select comes in. Glacier is an AWS archival service for data that you don’t need immediately accessible. But Glacier Select actually lets you execute an SQL query against a Glacier archive. Since its archival storage, you also specify when you would like your results returned. Standard queries take 3–5 hours, and results can be deposited in an S3 bucket. Of course, there’s an API that you can build in to existing applications. I’m super psyched about cheap archival storage that can still be queried and think this has many applications in bioinformatics.

There were many, many other announcements that have direct applications in bioinformatics. I’d highly encourage you to watch the keynotes from Andy Jassy and Werner Vogels to dive in a little deeper.

Migrating to RDS: Converting MyISAM to InnoDB

If you want to leverage the RDS service on AWS, you’ll receive maximum benefit by converting MyISAM tables to InnoDB. Here’s a distillation of a useful approach outlined Another woblag on the Interweb


// Create a backup of your database
mysqldump -u USER -p MYSQLDB | gzip -c > /mnt/backups/mysqldb.sql.gz


// Log in to your mysql instance and dump a .sql to convert tables in batch
mysql> select concat('ALTER TABLE `',table_schema,'`.`',table_name,'` ENGINE=InnoDB;') from information_schema.tables where table_schema='mydb' and ENGINE='MyISAM' into outfile '/tmp/InnoBatchConvert.sql'
mysql> quit
shell> mysql -u root -p < /tmp/InnoBatchConvert.sql


// Confirm tables have been converted to InnoDB
mysql> select table_name, engine from information_schema.tables where table_schema = 'mydb';

An introduction to cloud computing for biologists (aka the 10-minute model organism database installation)

This tutorial will explain the basic concepts of cloud computing and get you up and running in minutes. No knowledge of system administration or programming is necessary. As an example, it describes how to launch your own instance of the model organism database WormBase.

Introduction to cloud computing

If you aren’t familiar with cloud computing here’s all you need to know. At its simplest, cloud computing refers to using remote compute resources over the network as if they were a computer sitting on your desktop. These services are typically virtualized and used in an on-demand fashion.

Several vendors provide cloud computing options. Here, we’ll focus on
Amazon’s Elastic Compute Cloud (EC2).

On EC2, developers can create Amazon Machine Images (AMIs) which are essentially snapshots of a full computer system. For example, the WormBase AMI contains everything necessary to run WormBase — all software and databases with the operating system preconfigured.

Booting up an image is referred to launching an “instance”. When you do so, you choose the size of the server to allocate (for example, how many cores and how much RAM) to run the instance with. You can start, stop, or reboot the instance at any time. Terminating the instance completely removes it from your account. The original reference AMI remains; you can launch a new instance from it any time. This is what Amazon means by elastic. You can provision and decommission new servers with custom capacity in minutes mitigating overhead costs like data centers, surly IT departments, and draconian firewall regulations.

Amazon’s EC2 service is a “pay-for-what-you-use” service; running an instance is not free. You are charged nominal rates for 1) the size of the instance allocated; 2) the amount of disk space the instance requires even if it isn’t running; 3) the amount of bandwidth the instance consumes; 4) how long the instance is running.

A complicated model organism database like WormBase typically require a “large” instance (see below). Running 24/7, the estimated cost would be approximately $2700/year. Costs can be mitigated by starting and stopping the instance when needed, pausing the instance in its current state. This is conceptually similar to puting a desktop computer to sleep. Alternatively, if you aren’t modifying the data on the server, you can safely terminate it when you are done, avoiding disk use charges, too. Simply launch a new instance from the original WormBase AMI. Launching from an AMI requires slightly more time (several minutes) than restarting a stopped instance (< minute). Requesting a dedicated instance in advance from Amazon further reduces the cost by approximately 30%. caveat emptor: these are back-of-the-napkin calculations. Costs can vary dramatically especially if you start making many, many requests to the website. Bandwidth charges for accessing the website are nominal.

Example: Personal Instances of WormBase through Amazon’s EC2

In the past running a private instance of WormBase has been a time-consuming process requiring substantive computer science acumen.

Today I’m happy to announce WormBase Amazon Machine Images (wAMIs, pronounced “whammys”) for Amazon’s Elastic Compute Cloud (EC2). The WormBase AMI makes it absolutely trivial to run your own private version of WormBase.

Running your own instance gives you:
* Dedicated resources
* A feature-rich data mining platform
* Privacy

Contents of the WormBase AMI

* The WS226 (and beyond) version of the database
* The (mostly) full WormBase website
* The Genome Browser with 10 species
* A wealth of pre-installed libraries for data mining (to be covered in a subsequent post)

The first WormBase AMI is missing a few features:
* WormMart
* BLAST

Launching your own instance of WormBase

Here’s a really bad screen cast. You might want to read through the rest of the tutorials for details.

View the screencast in full size.

The general steps for launching an instance of a new AMI are as follows. Note that in the management console it is possible to execute many of these steps during the process of launching any one specific instance, too.

1. Sign up for an Amazon Web Services account

See up for an account at aws.amazon.com. You’ll need a credit card.

2. Create a keypair

Note: You can also complete this step when you launch your instance if you prefer.

When you launch an instance Amazon needs to ensure that you are who you say you are (read: that you have the ability to pay for the resources that you consume), as well as give you a mechanism for logging into the server. This authentication process is handled through the use of secret keys. Even if you only intend to use the web interface of WormBase and not log in directly to the server, you will still need to generate a keypair.

To do this, log in to your Amazon AWS account and click on the EC2 tab. In the left hand pane, click on “Keypairs”. You’ll see a small button labeled “Create Keypair”. Click, and create a new kaypair. You can name it whatever you like. When you click continue a file will be downloaded to your computer. You will need this file if you intend to log on to the server. Store it in a safe place as others can launch services using your account if they get access to this file!

3. Configure a new security group

Note: You can also complete this step when you launch your instance if you prefer.

Security groups are a list of firewall rules for what types of requests your instances respond to. They can be standard services on standard ports (HTTP on port 80) or custom, and they can range from allowing the entire internet to a single IP address. They are a quick way to lock down who gets to use your instance. For now, we’ll create a security group that is very permissive.

Click “Create new group”, give the group a name and description. From the dropdown, select “HTTP”. Click Add Rule. Repeat, this time selecting SSH. Although not required, enabling SSH will allow us to actually log into the server to perform administrative or diagnostic tasks. Click Add Rule, then Save.

4. Find and launch an instance of the WormBase (WS226) AMI

Now we’re ready to launch our own instance. See the video tutorial for description.

5. Get the public DNS entry for your new instance

Your new instance is elastic; it gets a new IP address every time it is launched (although Amazon has services that let it retain a static address, too). You need to get the hostname so that you can connect to the server. Click on “Instances”, select the running instance, and in the bottom pane, find the “Public DNS” entry. Copy this entry, open a new tab in your browser and paste in the URI. It will look something like this:

ec2-50-17-41-111.compute-1.amazonaws.com

6. Stopping your instance

When you are done with your instance, shut it down by going to the EC2 tab > Instances. Select the instance and from other the “Instance Action” drop down or by right clicking, select “Stop”. You’re instance will be paused where you are. Repeat these steps selecting “start” to restart it. Note: you will continue to accumulate charges associated with disk storage while the instance is stopped, but will not incur compute charges. Alternatively, you can choose to “terminate” the instance. Once you do so, be sure to visit the “Volumes” and select the EBS volume that had been attached to the instance — it will be 150GB in size. It will cost about $7/month to save this volume.

In a subsequent tutorial, I’ll show you how to go beyond the web browser to use the powerful command line data mining tools packaged with every WormBase AMI.

Questions? Contact me at todd@wormbase.org.