Tuesday, 13 January 2015

Moving 20 Petabytes

EMBL-EBI's data resources are built on a constantly running compute and storage infrastructure. Over the past decade that infrastructure has grown exponentially, keeping pace with the rapid growth of molecular data and the corresponding need for computation. Terabytes of data flow every day on and off our storage systems, making up the hidden life-blood of data and knowledge that permeates much of modern molecular biology.

There is a somewhat bewildering complexity to all of this. We have 57 key resources: everything from low-level, raw DNA storage (ENA) through genome analysis (Ensembl and Ensembl Genomes), complex knowledge systems (UniProt) and 3D protein structures (PDBe). At minimum, over half a million users visit at least one of the EMBL-EBI websites each month, making 12 million web hits and downloading 35 Terabytes each day. Each resource has its own release cycle, with different international collaborations (e.g. INSDC, wwPDB, ProteomeXchange) handling the worldwide data flow. 

To achieve consistent delivery, we have a complex arrangement of compute hardware distributed around different machine rooms, some at Hinxton and some off site. Around two years ago we started the process to rebid our machine room space, and last year Gyron, a commercial machine-room provider in the southeast of England, won the next five-year contract. This was good news for efficiency (Gyron provided a similar level of service but at a sharper price) but posed an immediate problem for EMBL-EBI's systems, networking and service teams: to wit, how we were going to move our infrastructure without disruption?

Moving a mountain

The carefully laid plans were put into operation in October 2014 and the move was completed in December 2014. Over that time, the EMBL-EBI systems team moved 9500 CPU cores and 22 petabytes of disk, and reconnected 3,400 cables/fibre with 850 power cables. Effectively, they moved half our storage infrastructure with no unscheduled downtime for any resource. In fact, most resources ran as usual throughout the entire operation. That the vast majority of users were totally unaware of the move is a huge tribute to the team, who had to work closely with each of the 57 resource groups to deliver constant service.

Much of this was due to good planning some five years ago, when EMBL-EBI originally grew out of its Hinxton-based machine rooms and started leasing machine-room space in London. Two key decisions were made. The first was that every service would run from two identical, isolated systems, such that one system could be incapacitated and the usage would switch to the other. The second decision was that only the technical groups (i.e. systems and web production) would be allowed direct access to the machines running the front-end services. 

All the testing and development happened in a separate, cloned system (running in Hinxton), and deployment was carried out via a series of automated procedures. These procedures were designed to accommodate the different standard operations of each resource, and to support complicated issues around, for example, user-uploaded data. After a couple of (rather painful) years of fixing and fine tuning, our front-end services were logically and formally separated from testing and development. All of this is done in a highly virtualised environment (I don't think anyone at EMBL-EBI logs into or uses a "real" machine anymore), allowing yet more resilience and flexibility of the system.

This preparation made the conceptual process of moving relatively easy. One half of the system was brought down, and the active traffic was diverted to the other half. Then the machines were moved, reconnected in their new location, tested and brought back up. Once the new system checked out, the services were started up on the new location, and the second system went through the same process. 

Some of our largest resources (e.g. ENA storage and 1000 Genomes storage) went through a "read only" period to allow our technicians to transfer half the disk component in a safe manner. For our very rare single-point services, in particular the European Genotype-phenotype Archive (EGA), which operates with a far, far higher level of security, we scheduled one week of downtime. Our high-bandwidth, redundant link with JANET had to be up and running, and our internal network across the three machine rooms (Hinxton, backup and Gyron) had to be configured correctly.

The dreaded downtime

I have to admit, I was concerned about this move. It is all too easy to uncover some hidden dependancies in the way machines are configured with respect to each other, or to find some unexpected flaw deep in the workings of a network or subnetwork. Even though everything seemed fine in theory, I was dreading one or two days - perhaps even a week - of serious access problems. This kind of downtime might be alright once every five years, but the fundamental process of webpages being returned in a timely manner is the first test of a robust informatics infrastructure. Any time this fails, users lose a bit of confidence and that is the last thing we want. 

Thanks, guys!

I am truly impressed at what the techincal cluster at EMBL-EBI has achieved in this move. These four separate, closely interlinked teams keep the technical infrastructure working: Systems infrastructure, Systems applications, Web Production and Web Development. They are headed up by Steven Newhouse, who came into the job just six months before this all started. Many of the people who made the move possible worked with incredible dedication throughout: Pettri Jokenien, Rodrigo Lopez, Bren Vaughan, Andy Cafferkey, Jonathan Barker, Manuela Menchi, Conor McMenamin and Mary Barlow. All of them understood implicitly what needed to be done to make the system robust, and pulled out all the stops to make it happen. 

People only really notice infrastructure when it goes wrong. When you switch on a light, do you marvel at the complexity of a system that constantly produces and ships a defined voltage and amperage to your house, to provide light the instant you wish it? When you use public transport, do you praise the metro and train network for delivering you on (or nearly on) time? Probably not. Similarly, when you click on a gene in Ensembl, look up a protein's function in UniProt, or run a GO enrichment analysis, you probably don't think about the scientific and technical complexity of delivering those results accurately and efficiently. And that's just how it should be.

So - many thanks to the EMBL-EBI technical cluster, who finished the job just before Christmas 2014. I hope you all enjoyed a well-deserved break.

(I've just about uncrossed my fingers and toes now....)


avilella said...

Spelled Petteri Jokinen..

Rutger Vos said...

So the first photo is not of Kraftwerk's concert at EBI?