In Brief
  • The TeraStructure algorithm can analyze genome sets much larger than current systems can efficiently handle, including those as big as 100,000 or 1 million genomes.
  • Finding an efficient way to analyze genome databases would allow for personalized healthcare that takes into account any genetic mutations that could exist in a person's DNA.

Keeping Pace

Ever since the first complete mapping of the human genome in 2003, biologists and other scientists have been hard at work making the process easier and faster. They’ve gotten so good at it, in fact, that they’ve now sequenced the genomes of more than a million people and believe that number could rise to nearly 2 billion by 2025.

However, our ability to make sense of all this data remains lackluster. Machine learning could change that, though, as a new algorithm has been developed that can read large genome data sets for the goal of personalizing healthcare.

Currently being used for this purpose is the STRUCTURE algorithm, which was first described in 2000. It looks at all the variants in each genome in a data set before updating its model to characterizing ancestral populations and how they affect an individual’s own genome. Then it moves on to the next genome.

The new algorithm, TeraStructure, looks at one variant in all of the genomes in a data set before it updates its model to produce a working estimate of population structure. That allows it to create ancestry models more accurately and quickly — two to three times faster, in fact, on a simulated data set of 10,000 genomes. It can even analyze sets as large as 100,000 or 1 million genomes.

The accuracy of TeraStructure. Wei Hao/Princeton

Personalized Medicine

Each mapped genome is several billion characters long, and given that we could have as many as two billion mapped within the next decade, we need to find a efficient way to make use of this data. Therefore, machine learning could be an invaluable tool in the field of genomics as it could be used for genetic diagnostics, refining drug targets, pharmaceutical development, and personalized medicine, according to Brandon Frey, founder of company Deep Genomics.

Knowing the ancestry of an individual enables doctors to see which variants in an individual genome are due to normal genetic variation in a population and which are due to disease-causing variants passed down from ancestors. This allows for personalized healthcare — if your doctor knows which diseases you are more susceptible to due to mutations in your DNA, they can better treat or prevent those diseases. Ultimately, machine learning may provide a way for us to tailor medicine to provide better and faster relief for individuals.