Protein language models and the semiotic language of Signal Peptides
Mariana Vitti Rodrigues, Claus Emmeche & Henrik Nielsen
Are language models learning to read the language of nature? This communication explores the learning potential of applying automated methods to predict signal peptides using protein language models. Signal Peptides (SP) are short amino acid sequences whose function is to guide its protein towards the secretory pathway in a cell. Being non-stable parts of proteins, signal peptides are difficult to identify experimentally. One approach to identify SP bioinformatically is to use Protein Language Models which are inspired by Natural Language Processing (NLP) algorithms (SignalP 6.0, Teufel et al. 2022). In this context, we inquire: What is the role of the computer (language) model in our knowledge about signal peptides? To investigate this question, we will propose a semiotic analysis of protein language, inquiring the extent to which language models are able to read bits of ‘the language of nature’. Our hypothesis (H1) is that the semiotic concept of dicisign, or natural proposition (Stjernfelt 2014), understood as a kind of sign which conveys information, can shed light on the SP phenomenon. Peirce presented the dicisign as a double-sign structure by the colocalization between an Icon and an Index (Peirce CP 2.312, 1903). The iconic part of a dicisign embodies the possible properties that can be attributed to the sign’s object, being its description. The indexical part of the dicisign indicates possible objects of attribution, being the reference of its object. The function of a dicisign is to inform, to convey the form of its object by its own abilities, in order to generate new interpretants which claim that the icon-index syntax holds. The SP, viewed as a dicisign, can be understood as playing a zipcode function: it guides a given package (icon/protein) to a certain location (index/transmembrane or outside the cell), conveying the form of its object (protein translocation). Within this framework, we will discuss the extent to which Signal Peptides can be considered as part of semiosis in order to investigate which semiotic aspects of SP can be ‘translated’ (or digitalized) to be analyzed through language models (we will even pose this question to another language model). Finally, we invite the audience to reflect upon the extent to which a biosemiotic approach to SPs plays a relevant role as a source of new knowledge in the growing automation of scientific practice.
References
Teufel, F., Almagro Armenteros, J.J., Johansen, A.R. et al. (2022). SignalP 6.0 predicts all five types of signal peptides using protein language models. Nat Biotechnol 40, 1023–1025. https://doi.org/10.1038/s41587-021-01156-3
Peirce, C. S (1958) The Collected Papers of Charles Sanders Peirce. Electronic edition. Vols. I-VI, Hartshorne, C., Weiss, P. (Eds.), 1931-1935. Vols. VII-VIII, Burks, A. W. (Ed.). Charlottesville: Intelex Corporation. Cambridge: Harvard University Press. [Quoted as CP, followed by the volume and paragraph].
Stjernfelt, F. (2014) Natural Propositions. The Actuality of Peirce’s Doctrine of Dicisigns. Boston, Mass.: Docent Press.