[特邀报告]Predicting the animal host ranges of viruses using machine learning of deep language model

Predicting the animal host ranges of viruses using machine learning of deep language model
编号:24 稿件编号:40 访问权限:仅限参会人 更新:2022-07-05 10:59:01 浏览:827次 特邀报告

报告开始:2022年07月23日 14:25 (Asia/Shanghai)

报告时间:20min

所在会议:[S1] 分会场1 » [S1-1] 生物医学大数据与人工智能

暂无文件

摘要
Background
  Most novel pathogenic viruses originate in wild animals and emerge to cause human infectious diseases. The COVID-19 pandemic has demonstrated that emerging zoonotic viruses (infectious virus jumping from animals to humans) can transmit at an unprecedented speed, which may cause severe socioeconomic impacts. Due to the development of high-throughput sequencing, a growing number of novel viruses are detected in humans and wildlife every year. However, identifying the animal origin of novel human viruses and determining the zoonotic risk of animal viruses requires years of field research and laboratory studies. Moreover, a complicated situation is that some viruses can naturally infect a broad phylogenetic range of hosts and undergo frequent host-jumping events. For example, Arboviruses (Arthropod-borne viruses), including Zika and Dengue viruses, can efficiently replicate in divergent hosts like vertebrates and insects. Computational methods can provide rapid and low-cost solutions in this scenario, with accumulating data promising the alignment-based and machine learning models to identify the potential host for novel viruses. However, the diversity of viruses and the limited knowledge of virus-host interaction prevented the current methods from acquiring accurate predictions.
Methods
  In the current study, we proposed a deep language model (DLM) to predict the host ranges based on viral genomic sequences. Our deep model can take advantage of the hierarchical information of host taxonomic and outperforms other computational methods. We first built a comprehensive Vertebrate Insect Virome (VIV) dataset based on the virus-host interaction for vertebrates and insects in the Virus-Host Database, VIRION, and InsectBase 2.0. In total, the VIV dataset included 4,597 resolved virus “species”, which covers 103 of 189 International Committee on Taxonomy of Viruses (ICTV) resolved virus families, 4,097 resolved vertebrate and insect species, and 17,454 unique interactions between virus and host species.
  To extract host-related features from the highly complex and heterogeneous viral genomes, we utilized the sequence embedding strategy of the BERT (Bidirectional Encoder Representations from Transformers) model, a typical architecture of deep language models that achieved state-of-the-art performance in natural language processing tasks. To better extract the genomic information encoded in viral genomes, we utilized the transfer learning strategy and pre-trained our model with a self-supervised task on the Virosaurus database, which provided a deduplicated reference for exploring virus genetic diversity. We also build a byte-pair encoding table for the viral genomic sequence. In the next stage, we re-trained the model with our VIV dataset. The BERT layers extracted sequence embeddings from the genome sequences and fed them to a Hierarchical Multi-label Classification Network (HMCN), which share the host label information globally at each taxonomic level. For host labels, we accommodated a trade-off between granularity and fidelity and subdivided host categories to lower taxonomic levels only when there are more than 100 virus records for a subclass. As a result, we obtained 59 host classes from Phylum to Species levels (All in the Animalia Kingdom).
  We utilized a re-weighting strategy, which uses the phylogenetic information to alleviate the influence of the unknowingly positive interactions and the imbalanced positive-negative ratio in dataset. Five-fold cross-validation was performed to test our model.
Results
  As described above, we developed a method based on the BERT deep language model, herein we named VBERT, to predict the host range of viruses and further inferring the cross-species transmissions between humans and animals. Base on viral genome sequences, VBERT can be used on a broad scale, especially for newly discovered viruses. We use it to predict the host information for viruses, across 59 host taxonomic classes, from Phylum to Species levels, including insects, birds, and mammals. VBERT achieved a satisfying micro-average (weight each instance equally) Area Under the Receiver Operating Characteristic (AUC) of 0.906, macro-average (weight each class equally) AUC of 0.862, and mean-AP of 0.70 for five-fold cross-validation while achieving micro-average AUC of 0.920, macro-average AUC of 0.879, and mean-AP of 0.71 for the independent test set. In contrast, BLAST achieved micro-average AUC for 0.733, macro-average AUC of 0.681, and mean-AP of 0.44 for five-fold cross-validation, while achieving micro-average AUC of 0.730, macro-average AUC of 0.679, and mean-AP of 0.43 for the independent test set. Taking advantage of the sharing taxonomic information and the sequencing embedding strategy, VBERT showed superior performances in classes with adequate examples and sparse examples. VBERT also achieved a relative balancing performance on each host taxonomic level, with macro-AUC between 0.860 and 0.872 for taxonomic levels lower than Class level, which demonstrates the HCMN architecture can prevent the error propagation down the taxonomic hierarchy.
Conclusions
  We developed a novel deep language model-based classification method, named VBERT, to predict the host range from virus genomes. We curated a virus-host interaction dataset with high-quality genome information, which can be used for future algorithm benchmarking. Besides, our model identifies the high-risk animal hosts and virus families, narrowing down the search scope for further laboratory experiments.
关键字
Virus host prediction; Deep language model; Hierarchical multi-label classification
报告人
朱怀球
教授 北京大学

稿件作者
朱怀球 北京大学
发表评论
验证码 看不清楚,更换一张
全部评论