Résumé:
Understanding the language of non-coding DNA is a major topic in genomic
research. Gene regulatory code is extremely complicated due to the presence of polysemy
and distant semantic relationships, which earlier informatics approaches frequently fail to
capture.
To address this difficulty, we used DNABERT, a unique pre-trained bidirectional
encoder representation that captures global and transferable comprehension of genomic
DNA sequences based on up and downstream nucleotide contexts. We compared
DNABERT to the most popular systems for predicting genome-wide regulatory elements
and found that it was easier to use, more accurate, and more efficient. We demonstrate
that a single pre-trained transformers model can reach state-of-the-art performance in
the prediction of promoters, splice sites, and transcription factor binding sites following
simple fine-tuning using modest task-specific labeled data. Furthermore, DNABERT allows
for direct display of nucleotide-level significance and semantic relationships within input
sequences, resulting in improved interpretability and more accurate identification of
conserved sequence motifs and functional genetic variant possibilities.