Upgrade Boosts Speed for Accessing Genomic Data With CyVerse
Genome browsers are "the primary way researchers access genomic data," according to Eric Lyons, associate professor at the University of Arizona in the School of Plant Sciences and School of Information, and CyVerse co-principal investigator.
These powerful, web-based tools provide researchers and geneticists with fast, easy ways to search, retrieve, and analyze genomic datasets and databases.
"Think of them as an atlas of all the DNA of an organism," Lyons continued. "Life science researchers today are generating many large datasets of genetic information, and want to be able to find their gene(s) of interest, see how they are expressed across different tissues and developmental stages, and understand how they are activated in response to changing conditions, such as water stress for a plant or disease in an animal. Reference genome browsers are the gold standard for genomic analyses."
Genome browsers allow their users to access and search through multiple data repositories, including dedicated servers maintained by scientists, Amazon Web Services, local data files, and data stored in CyVerse.
Upon request by a researcher, the browser searches through a collection of genetic data files called a Track Hub, which is often a large number of files a user has put together and stored in a repository such as CyVerse's Data Store. The files themselves can be small or large, but the browser searches through the files and only extracts the pieces that have been requested. This vastly improves the speed at which researchers can get access to data of interest, rather than downloading entire genomes or associated files.
One of the most popular and widely used genome browsers is hosted by the University of California – Santa Cruz (UCSC) Genome Institute. Their massive computational engine allows users to interactively visualize genomic data. Researchers can search by multiple categories, including for genes associated with different attributes, by gene name, or by "track type," looking for gene segments that match specific criteria.
When a researcher requests to view a gene on the Browser, explained UCSC Quality Assurance Release Management Supervisor Brian Lee, "only the data from the files specific for that small section of the entire genome will be requested and transferred across the web."
This used to be a somewhat arduous process, while the browser pulled pieces of small files one at a time.
To design the solution, "we analyzed data access patterns of the UCSC Genome Browser," said Illyoung Choi, a CyVerse research and development engineer. Choi and CyVerse senior software engineer Tony Edgin installed Varnish, a web service that manages caches of data. Varnish was chosen because it supports caching for the type of data access used by the browser, Choi explained.
"When you make a request, Varnish assumes you're going to want more of the file," said Edgin. "If you request only the file header, for example, it downloads the entire file, so it's ready for your next request when you make it. The response to the next request then happens really quickly, up to one hundred times as fast as it did before."
Instead of taking a big file and uploading it, the system now only pulls the files that were requested. "By caching smaller files, we are making it much faster for the next researcher to access them, or for the same researcher to pull up the results when she makes a second request," said Lyons.
The upgrade especially will improve access speed for educators running classes or workshops, where multiple people are viewing data in the CyVerse Data Store via a reference genome browser.
Tests of the new installation showed 150 percent performance improvement for initial data access and 330 percent improvement for repeated access of the same data. The result from the researcher's perspective is that searching data stored in CyVerse is as easy as if the data were stored locally at UCSC, without the typical delays involved in pulling data from hundreds or thousands of miles away.
"CyVerse's support of these small dynamic requests helps scientists with very large datasets explore their data visually on our site, where repeatedly transferring the entire files would be impossible," noted Lee.
"The recent improvements at CyVerse greatly speed the access of data stored at CyVerse's servers to display on the UCSC Genome Browser. We are grateful to have CyVerse as a resource to point to for making a research group's data dynamically available over the internet."
The CyVerse upgrade will improve user speeds for any genome browser, including Ensembl, NCBI Genome, the Integrated Genome Browser hosted by Ann Loraine at the University of North Carolina – Charlotte, and CoGe, a powered-by-CyVerse comparative genomics database hosted at UArizona and run by Andrew Nelson, an assistant professor at the Boyce Thompson Institute.