Associate Professor Albert Einstein College of Medicine Brooklyn, New York, United States
Description: Viral genomes are poorly annotated in metagenomic samples, representing an obstacle to understanding viral diversity and function. Current annotation approaches rely on alignment-based sequence homology methods, which are limited by the paucity of characterized viral proteins and divergence among viral sequences. Here, we show that protein language models can capture prokaryotic viral protein function, enabling new portions of viral sequence space to be assigned biologically meaningful labels.
Learning Objectives: - Understand how protein language models are constructed. - Appreciate how protein language models differ from alignment-based sequence homology methods. - Demonstrate how new biologically meaningful labels of viral sequences can be developed from protein language models.