Cosmos Q&A: Dealing with data at top speed

September 16, 2020

Cosmos

Cosmos is a quarterly science magazine. We aim to inspire curiosity in ‘The Science of Everything’ and make the world of science accessible to everyone.

In a massive leap forward for computational genomic research, Australia’s CSIRO recently announced that its VariantSpark AI platform had processed over one trillion points of genomic data.

A research team led by Denis Bauer created the platform, which can analyse traits, such as diseases or susceptibilities, and uncover which genes may jointly cause them.

Denis bauer credit csiro — Lead researcher Denis Bauer

Using an algorithm based on machine learning, it processed the trillion-point dataset in around 15 hours. According to the researchers, traditional computational methods would likely take around 100,000 years to churn through the same collection.

While the trillion points of data were part of a 100,000 individual synthetic dataset – one generated by a computer that matches genomic rules and the complexity of real-world data – VariantSpark has also been working on human data.

Cosmos spoke to Bauer about how the platform works – and how by combining thousands of people’s data, it could make medicine more personalised.

How is VariantSpark different to other AI approaches?

VariantSpark is based on the concept of machine learning, which has been around since the ‘60s and is heavily used in big data analytics. But for analysing genomic data we had to reinvent it.

Usually, the big data approach relies on simply having data from more individuals. However, we had to approach it with more information about each individual. You can think of this as a spreadsheet, with rows holding individuals and columns holding the describing information per sample. With about two million genetic differences between one person and the next, and each one of those potentially a disease gene, there is a lot more information to process for each individual than just the relatively simple consumer habits or demographic information.

It took a team of IT and genomics experts close to two years to create VariantSpark – the normal machine learning approaches kept failing because of the quantity of data. We needed it to hold more of the genomic data in the memory of the machine, and that required some new thinking.

Machine learning works in a similar way to how a student learns: from examples. For our genomic work we have several thousand samples, half of them from patients with disease and the other half from healthy individuals. We show this information to VariantSpark’s machine learning algorithm and have it compare them until it finds the set of genes that consistently has different letters in them in the disease cohort compared to the healthy controls.

The machine learning method used in VariantSpark is a random forest, which builds a large number of individual decision trees to describe the data. VariantSpark is capable of building tens of thousands of those trees, to create a high-resolution representation of the data.

The amazing thing about VariantSpark was the speed it processed the data. How did it process it so much faster than other computational methods?

We are using “distributed computing” at a whole new scale. Distributed computing is processing large jobs by distributing them to as many computers as possible. This was originally developed by Google to process search queries, where the aim was to join together lots and lots of commonly available hardware to create a scalable, cheap and powerful supercomputer.

For traditional big data, tasks are split on their largest dimension – samples or the rows of the spreadsheet. Using such a “horizontal splitting”, more individuals can be processed. However, our largest dimension is genomic information, that is the millions of genomic differences that describe the individuals, or the spreadsheet columns. So we developed a new way to split the data along this dimension, resulting in a vertical partitioning. We can now use more computers to break down the large genomic job and process them faster in parallel.

Has VariantSpark identified any new avenues for investigation?

VariantSpark looks for the set of genomic changes that are present in people with disease and absent in healthy individuals. It’s similar to trying to find a needle in a haystack, except our “needles” look like hay and only turn into the genetic drivers of disease when looked at side by side with other drivers, making the problem so much harder.

With VariantSpark, we have identified a potentially novel disease gene causing Motor Neuron Disease (ALS) in the Australian population. We are currently working with an international consortium to apply our technology to their database of 22,000 patients with ALS as well as healthy controls. We also applied VariantSpark to investigate whether there are new virus strains of COVID-19 emerging with different pathogenicity.

What is the next step?

200916 dna — Credit: Hiroshi Watanabe / Getty Images

While the genome holds information about future disease risk, other datasets hold information about actual physiological changes. Joining these two information sources enables clinicians to get a better picture of the immediate disease risk and disease progression.

So with VariantSpark, we are working to join the millions of genomic variants with likely similar-sized datasets from imaging, electronic medical records and other molecular data. This will paint a more complete picture of a person’s health and will help doctors make treatments more personalised.

What are the privacy concerns around people’s genomic data?

VariantSpark was developed with data privacy top of mind, as each individual’s genome is unique, like a thumbprint, and personally identifiable. The standard practices of data randomisation and obfuscation do not work, as they either destroy the information or are not powerful enough to break the identifiable link between you and your genome.

Instead, we encrypt all data and use complex access layers to ensure that only the relevant sub-sections of the genome get decrypted at the same time. We also developed VariantSpark in a way that means it doesn’t necessarily have to disclose data to a third party (including us) at all.

People can use VariantSpark themselves, what do they need to run it?

VariantSpark is available to cater for different capabilities, requirements and use cases.

It’s freely available through AWS Marketplace, where the user can “subscribe” to VariantSpark, enabling the latest version of the software and the full environment to automatically spin up – kind of like logging into a computer.

VariantSpark is also available on Azure through Databricks or Google through Terra, with some manual steps required to install the software through this cloud deployment.

On the other end of the spectrum is VariantSpark’s open source code. Advanced users can access the source code to use in their own cloud or high-performance computing cluster systems. They can also contribute to the project by submitting their contributions or patches to the code.

Cosmos Q&A: Dealing with data at top speed

Cosmos

By Cosmos

How is VariantSpark different to other AI approaches?

The amazing thing about VariantSpark was the speed it processed the data. How did it process it so much faster than other computational methods?

Has VariantSpark identified any new avenues for investigation?

What is the next step?

What are the privacy concerns around people’s genomic data?

People can use VariantSpark themselves, what do they need to run it?

The really big book of plants

Genetics may play a role in sexuality

Semi-identical twins identified in pregnancy

Scientist who searches for one-in-three-billion mistakes