A new report details how Artificial Intelligence (AI) can be used to create efficient models for genomic selection of sugarcane and forage grass varieties, while also predicting their field performance on the basis of their DNA.
This is the first time a highly efficient genomic selection method based on machine learning has been proposed for polyploid plants – in which cells have more than two complete sets of chromosomes.
The methodology, published in Scientific Reports, improved the predictive power of machine learning by more than 50%. This means that this model is much more accurate than traditional breeding techniques.
The complexities of breeding techniques
Machine learning is a branch of AI that involves computer statistics and optimisation with countless applications. Its main goal is to create algorithms that automatically extract patterns from datasets. Its usage extends to a wide range, including the performance of plants, their resistance and tolerance to biotic stresses, and abiotic stresses – such as cold, drought, salinity, and insufficient soil nutrients.
In traditional breeding programmes, crossing is the most widely used technique. Alexandre Hild Aono, a computer scientist and lead author of the study, said: “You establish populations by crossing plants that are interesting. In the case of sugarcane, you cross a variety that produces a lot of sugar with another that’s more resistant, for example. You cross them and then assess the performance of the resulting genotypes in the field.”
He continued: “But this assessment process takes a long time, and is very expensive. Our genomic selection method can predict the performance of these plants even before they grow. We succeeded in predicting yield on the basis of the genetic material. This is significant because it saves many years of assessment.”
In the case of sugarcane, plant performance prediction is highly complex. Traditional breeding techniques take between nine and 12 years and incur high costs, according to Anete Pereira de Souza, Professor of Plant Genetics at the State University of Campinas’ Center for Molecular Biology and Genetic Engineering.
“When breeders identify an interesting plant, they multiply it by cloning so that the genotype isn’t lost, but this takes time and costs a great deal. An extreme example is the breeding of rubber trees, which can take as long as 30 years,” Souza explained.
One method that can be used to overcome these difficulties is ‘Plant Breeding 4.0’, which makes intensive use of data analysis and highly efficient computational and statistical tools. Each genomic selection model can involve up to one billion sequences.
However, the main hurdle scientists face in breeding better varieties of polyploid plants is the complexity of their genomes. “In this case, we didn’t even know if genomic selection would be possible, given the scarce resources and the difficulty of working with this complexity,” Aono said.
Developing a new method to predict plant performance
The researchers began the genomic selection process with diploid plants, as they have similar chromosomes to polyploid plants. Souza said: “The problem is that high-value tropical plants like sugarcane aren’t diploids but polyploids, which is a complication.”
While animals and humans are diploid, sugarcane may have as many as 12 copies of every chromosome. Any individual of the species Homo sapiens can have up to two variants of each gene, one inherited from the father and one from the mother. Sugarcane is more complex because any gene can have many variants in the same individual, with some genomes having six, eight, ten, or even 12 sets.
“The genetics is so complex that breeders work with sugarcane as if it were diploid,” Souza stated.
Can genomic selection work efficiently to predict plant breeding?
In 2001, Theodorus Meuwissen, a Dutch scientist proposed genomic selection to predict complex traits in animals and plants in association with their phenotypes – which are observable characteristics resulting from the interaction of their genotypes with the environment. The advantage of this approach to plant breeding is the link between the phenotypic traits of interest, such as yield, sugar level or precocity, and single nucleotide polymorphisms (SNPs). An SNP is a genomic variant at a single base position in the DNA.
Souza explained: “It’s the difference in the genomes of any two individuals. For example, one may have an A [corresponding to the nucleotide adenine] that produces a little more than another with a G [guanine] at the same location in the genome. That changes everything,”
“When you find an association with what you’re looking for, like a high level of sugar production, and specific SNPs at different locations in the genome, you can sequence only the population on which your breeding work focuses.”
The genomic selection method proposed by the team dispense with the need to plant and phenotype throughout the breeding cycle. “We do field experiments in the initial stages of the programme to obtain the phenotype of interest for each clone,” Souza said.
“In parallel, we sequence all the clones in the breeding population quite straightforwardly, without needing to have the whole genome for every clone. This is what’s called genotyping-by-sequencing – partial sequencing in search of the differences and similarities in the base pairs for the clones, and their association with each clone’s production.
“The association between phenotype and genome shows which produces more and which SNPs are associated with higher production. In this manner, we can identify clones with a large proportion of the SNPs that contribute to the higher production observed in the initial experiments and obtain the most productive variety faster and more cheaply.”