Machine Learning & Metabolism - Undergraduate project by Tanusree Chepuri

Abigail LaBella
May 21, 2025
2 min read

I am Tanusree Chepuri, a 3rd year senior at UNC Charlotte majoring in Biology (BS), and minoring in Bioinformatics and Data Science. In the Spring semester of 2025, I conducted undergraduate research in bioinformatics under the guidance of Dr. Abigail Leavitt LaBella. My project focused on studying codon usage bias, specifically Relative Synonymous Codon Usage (RSCU), across four yeast species: Saccharomyces cerevisiae, Saccharomyces paradoxus, Nakaseomyces glabrata, and Nakaseomyces delphensis. These four yeasts, Saccharomyces cerevisiae, Saccharomyces paradoxus, Nakaseomyces glabrata, and Nakaseomyces delphensis, are all ascomycetous yeasts, meaning they reproduce by forming asci and ascospores. However, they differ significantly in genetics, ecology, physiology, and relevance to humans.

The goal of my project was to explore whether genes that are active together in the same biological pathways tend to use similar codons. Our hypothesis is that coordinating codon usage between co-regulated genes could improve how efficiently proteins are made. To test this, I worked with data from the KEGG database, which connects genes to pathways like the Pentose Phosphate Pathway and Autophagy. I cleaned and merged codon usage data with KEGG pathway data, transforming it into a format that made it easier to analyze. I then used Random Forest, a machine learning algorithm, to build models for each species that could predict whether a gene belonged to a particular pathway based on its codon usage. One challenge was that some pathways had fewer genes than others. To address this, I used two approaches: balancing the data by randomly selecting non-pathway genes and directly comparing genes from two different pathways. These strategies helped improve model accuracy, especially when comparing distinct pathways like Pentose and Autophagy, where out-of-the-box error rates dropped below 12.5 percent.

Certain codons, such as those encoding leucine, alanine, and glutamine, were important in making accurate predictions. We found, for example, that GCA codons are found at a higher frequency than expected in S. cerevisiae autophagy genes and less frequently than expected in pentose genes. Interestingly, S. cerevisiae and N. glabrata shared the most codons in common, despite not being the most closely related.

CHROMAgar with Nakaseomyces glabratus (formerly Candida glabrata), Pichia kudriavzevii (formerly Candida krusei), Candida albicans and Candida tropicalis, annotated. Image by Mikael Häggström, MD. Public Domain (CC0 1.0)

Our results support the idea that genes working together often use similar codons, which likely helps cells make proteins more efficiently. We also saw differences in codon preferences between yeast species, which may reflect how each has adapted to its environment. Through this project, I gained valuable experience in data analysis, coding in R, using biological databases, and applying machine learning to real-world biological questions. This experience has strengthened my interest in bioinformatics and prepared me for future research in systems biology and computational genomics.

Machine Learning & Metabolism - Undergraduate project by Tanusree Chepuri

Recent Posts

Comments