Skip to main content

ARUP Launches Cloud-Based NGS Analytics Platform for Massively Larger Scale Processing


NEW YORK (360Dx) – ARUP Laboratories has boosted the speed and capacity of its next-generation sequencing capabilities through a new analytics platform that shortens NGS processing turnaround time while tapping cloud computing to massively scale the amount of NGS samples the lab can run simultaneously.

"At ARUP, there was a broad directive of being able to support 1,000 samples a day, and our legacy systems weren't able to support this, so we took the time to essentially redesign the entire infrastructure," said Elaine Gee, director of bioinformatics at ARUP Laboratories.

The legacy system, which the lab had called Pipeline.jar, was a Java-based system designed to support one sample at a time that could be scaled by running the software across multiple servers at ARUP. As a result, the lab's sample processing capacity was constrained by the number of servers available at the lab, and was able to run just a few samples at a time, according to Gee.

The new bioinformatics pipeline, nicknamed "Pipey," which replaces Pipeline.jar, was built as cloud computing infrastructure, able to scale across Amazon Web Services' compute capacity. Pipey went live in early January.

"Pipey essentially allows us to spin up or down the compute process depending on the length of the job queue," Gee said. "So the difference between this system and the previous one is that the workflows that have been designed are essentially reactive to the number of jobs that are waiting to be processed. From a job level, you can process one job or 100 jobs — it is all calculated based on how many jobs are waiting and the rules set up around it."

By tapping the cloud, the amount of samples that the system could scale to run simultaneously is almost limitless, according to Gee. ARUP does not release how much demand it has for NGS testing currently.

"In theory, if 10,000 samples come through, you could process those samples at the same time as one sample, because you can expand out and run 10,000 nodes," Gee said, noting that some minor adjustments would have to be made to the current implementation to be able to scale that extensively.

As a cloud-based system for health information, the platform was built with key security measures to protect patient data in compliance with Health Insurance Portability and Accountability Act regulations, she noted.

Meanwhile, the legacy system took anywhere from a few hours to 18 hours to process one sample based upon the degree of sequencing with the sample. That bioinformatics analytics process has also been redesigned.

"From an efficiency standpoint on the analytics side, the actual pipeline itself was also optimized so that the processing times went from 18-plus hours to two to four hours consistently. It required changes on the tooling itself, as well as bioinformatics pipeline optimization," Gee said.

Pipey speeds the part of the NGS process that occurs after DNA is extracted and sequenced with an Illumina platform, according to Gee. Pipey converts output from the Illumina sequencer, known as .bcs files, into FASTQ files, which contain the base calls, or reads of DNA fragments. The conversion to FASTQ files occurs on ARUP servers, and the remainder of the process is pushed into the cloud.

Because next-generation sequencing typically runs samples of multiple individuals at the same time, with each person's DNA barcoded, one of the initial steps of Pipey is to "demultiplex" data from the sequencer and assign the sequenced DNA back to individuals.

The next step is to give genomic coordinates to each of the DNA reads that come off the sequencer.

"Reads that come off the sequencer are just fragmented DNA sequence, so you don't really know what gene it came from. Read alignment allows you to take those reads, align them back to the reference to give you contextual information about where in the genome that read came from," Gee said.

Pipey then corrects certain upstream sequencing issues in a process known as BAM polishing, for the name of the file that is corrected. A common correction in the BAM polishing step involves PCR duplicate removal, where PCR amplified reads are collapsed into representative reads that more closely mimic the genomic sample, according to Gee. This step also involves a quality score recalibration, which corrects for systematic errors.

The final steps involve variant calling, quality control, and annotation, Gee explained. Variant calling essentially list the variants, or deviations from the reference, contained in the sample. Everyone's DNA contains multiple polymorphisms, or variants, so the annotation process essentially adds data to help interpret variants, including data on which variants are pathogenic, and data on how often specific variants appear. The quality control process generates statistics that help users of the system determine whether the sample was processed to a high enough quality.

A key new feature of Pipey is the use of a Mongo database to store data generated by the system, according to Gee. Mongo databases are open source databases often used in cloud computing and for "big data" storage. Pipey's Mongo database interfaces with ARUP's internal NGS.web system for NGS report generation.

The Mongo database enables users of ARUP's NGS.web report generation tool to access large stores of data in a searchable format, to build deeper data on sequencing for future assay improvement, according to Gee. Having a large database to be able to query against, allows users of NGS.web to access more up-to-date information and more easily access large variant sets, she noted. 

"Typically, exomes generate about 50,000 to 60,000 variants per sample. Querying that type of dataset can be very hard to do," Gee said. "Storing the information in a database so it can be pulled back into NGS.web allows us to essentially page through large sample sets so we could eventually support things like whole genome."

The system was validated by sequencing samples for myeloid malignancy, according to Gee. Myeloid malignancy panels have a large variant class, and contain FLT3 internal tandem duplications, which can have variations in position, length, and number of duplications of the FLT3. For the myeloid malignancy validation exercise, the team worked with Todd Kelley, ARUP's medical director for molecular hematopathology and co-scientific director for NGS and biocomputing, to ensure that the variant types that were expected were called and that the system met certain thresholds of detection.

Overall, the approximately 18-month project to build Pipey involved a cross-functional team of about 15 people that included bioinformaticians, software engineers, data engineers, and administrators with backgrounds in computer science and computational biology, according to Gee.

Lead data engineer, Mark Monroe led efforts to design the cloud computing infrastructure, and medical directors and clinical variant scientists contributed information about which variant types the system should identify, Gee said. NGS.web developers collaborated on integration with the NGS.web system.

While there are other labs with cloud-based NGS infrastructures, Gee said the challenge in building Pipey was a lack of commercially available NGS platforms to use as a template.

"There weren't any off-the-shelf tools for this," she said. "We did a lot of testing and redesign during the process, and we discovered various hurdles and challenges along the way. Oftentimes the challenge didn't have a cookie-cutter answer, so it required a lot of ingenuity and innovation to design a cloud infrastructure essentially from scratch."