Abstract—General-purpose protein structure embedding can be used for many important protein biology tasks, such as protein design, drug design and binding affinity prediction. Recent researches have shown that attention-based encoder layers are more suitable to learn high-level features. Based on this key observation, we treat low-level representation learning and highlevel representation learning separately, and propose a two-level general-purpose protein structure embedding neural network, called ContactLib-ATT. On the local embedding level, a simple yet meaningful hydrogen-bond representation is learned. On the global embedding level, attention-based encoder layers are employed for global representation learning. In our experiments, ContactLib-ATT achieves a SCOP superfamily classification accuracy of 82.4% (i.e., 6.7% higher than state-of-the-art method) on the SCOP40 2.07 dataset. Moreover, ContactLib-ATT is demonstrated to successfully simulate a structure-based search engine for remote homologous proteins, and our top-10 candidate list contains at least one remote homolog with a probability of 91.9%.
发表评论