“Libraries and journals articles as we know them will cease to exists” said Barend Mons at the symposium in honor of our Library 25th Anniversary (June 3rd). “Possibly we will have another kind of party in another 25 years”…. he continued, grinning.
What he had to say the next half hour intrigued me. And although I had no pen with me (it was our party, remember), I thought it was interesting enough to devote a post to it.
I’m basing this post not only on my memory (we had a lot of Italian wine at the buffet), but on an article Mons referred to , a Dutch newspaper article ), other articles [3-6] and Powerpoints [7-9] on the topic.
This is a field I know little about, so I will try to keep it simple (also for my sake).
Mons started by touching on a problem that is very familiar to doctors, scientists and librarians: information overload by a growing web of linked data. He showed a picture that looked like the one at the right (though I’m sure those are Twitter Networks).
As he said elsewhere :
(..) the feeling that we are drowning in information is widespread (..) we often feel that we have no satisfactory mechanisms in place to make sense of the data generated at such a daunting speed. Some pharmaceutical companies are apparently seriously considering refraining from performing any further genome-wide association studies (… whole genome association –…) as the world is likely to produce many more data than these companies will ever be able to analyze with currently available methods .
With the current search engines we have to do a lot of digging to get the answers . Computers are central to this digging, because there is no way people can stay updated, even in their own field.
However, computers can’t deal with the current web and the scientific information as produced in the classic articles (even the electronic versions), because of the following reasons:
- Homonyms. Words that sound or are the same but have a different meaning. Acronyms are notorious in this respect. Barend gave PSA as an example, but, without realizing it, he used a better example: PPI. This means Protein Pump Inhibitor to me, but apparently Protein Protein Interactions to him.
- Redundancy. To keep journal articles readable we often use different words to denote the same. These do not add to the real new findings in a paper. In fact the majority of digital information is duplicated repeatedly. For example “Mosquitoes transfer malaria”, is a factual statement repeated in many consecutive papers on the subject.
- The connection between words is not immediately clear (for a computer). For instance, anti-TNF inhibitors can be used to treat skin disorders, but the same drugs can also cause it.
- Data are not structured beforehand.
- Weight: some “facts” are “harder” than others.
- Not all data are available or accessible. Many data are either not published (e.g. negative studies), not freely available or not easy to find. Some portals (GoPubmed, NCBI) provide structural information (fields, including keywords), but do not enable searching full text.
- Data are spread. Data are kept in “data silos” not meant for sharing (ppt2). One would like to simultaneously query 1000 databases, but this would require semantic web standards for publishing, sharing and querying knowledge from diverse sources…..
In a nutshell, the problem is as Barend put it: “Why bury data first and then mine it again?” 
Homonyms, redundancy and connection can be tackled, at least in the field Barend is working in (bioinformatics).
Different terms denoting the same concept (i.e. synonyms) can be mapped to a single concept identifier (i.e. a list of synonyms), whereas identical terms used to indicate different concepts (i.e. homonyms) can be resolved by a disambiguation algorithm.
The shortest meaningful sentence is a triplet: a combination of subject, predicate and object. A triplet indicates the connection and direction. “Mosquitoes cause/transfer malaria” is such a triplet, where mosquitoes and malaria are concepts. In the field of proteins: “UNIPROT 05067 is a protein” is a triplet (where UNIPROT 05067 and protein are concepts), as are: “UNIprotein 05067 is located in the membrane” and “UNIprotein 0506 interacts with UNIprotein 0506″. Since these triplets (statements) derive from different databases, consistent naming and availability of information is crucial to find them. Barend and colleagues are the people behind Wikiproteins, an open, collaborative wiki focusing on proteins and their role in biology and medicine [4-6].
Concepts and triplets are widely accepted in the world of bio-informatics. To have an idea what this means for searching, see the search engine Quertle, which allows semantic search of PubMed & full-text biomedical literature, automatic extraction of key concepts; Searching for ESR1 $BiologicalProcess will search abstracts mentioning all kind of processes where ESR1 (aka ERα, ERalpha, EStrogen Receptor 1) are involved. The search can be refined by choosing ‘narrower terms’ like “proliferation” or “transcription”.
The new aspects is that Mons wants to turn those triplets into (what he calls) nano-publications. Because not every statement is as ‘hard’, nano-publications are weighted by assigning numbers from 0 (uncertain) to 1 (very certain). The nano-publication “mosquitoes transfer malaria” will get a number approaching 1.
Such nano-publications offer little shading and possibility for interpretation and discussion. Mons does not propose to entirely replace traditional articles by nano-publications. Quote :
While arguing that research results should be available in the form of nano-publications, are emphatically not saying that traditional, classical papers should not be published any longer. But their role is now chiefly for the official record, the “minutes of science” , and not so much as the principle medium for the exchange of scientific results. That exchange, which increasingly needs the assistance of computers to be done properly and comprehensively, is best done with machine-readable, semantically consistent nano-publications.
According to Mons, authors and their funders should start requesting and expecting the papers that they have written and funded to be semantically coded when published, preferably by the publisher and otherwise by libraries: the technology exists to provide Web browsers with the functionality for users to identify nano-publications, and annotate them.
Like the wikiprotein-wiki, nano-publications will be entirely open access. It will suffice to properly cite the original finding/publication.
In addition there is a new kind of “peer review”. An expert network is set up to immediately assess a twittered nano-publication when it comes out, so that the publication is assessed by perhaps 1000 experts instead of 2 or 3 reviewers.
On a small-scale, this is already happening. Nano-publications are send as tweets to people like Gert Jan van Ommen (past president of HUGO and co-author of 5 of my publications (or v.v.)) who then gives a red (don’t believe) or a green light (believe) via one click on his blackberry.
As Mons put it, it looks like a subjective event, quite similar to “dislike” and “like” in social media platforms like Facebook.
Barend often referred to a PLOS ONE paper by van Haagen et al , showing the superiority of the concept-profile based approach not only in detecting explicitly described PPI’s, but also in inferring new PPI’s.
[You can skip the part below if you’re not interested in details of this paper]
Van Haagen et al first established a set of a set of 61,807 known human PPIs and of many more probable Non-Interacting Protein Pairs (NIPPs) from online human-curated databases (and NIPPs also from the IntAct database).
For the concept-based approach they used the concept-recognition software Peregrine, which includes synonyms and spelling variations of concepts and uses simple heuristics to resolve homonyms.
This concept-profile based approach was compared with several other approaches, all depending on co-occurrence (of words or concepts):
- Word-based direct relation. This approach uses direct PubMed queries (words) to detect if proteins co-occur in the same abstract (thus the names of two proteins are combined with the boolean ‘AND’). This is the simplest approach and represents how biologists might use PubMed to search for information.
- Concept-based direct relation (CDR). This approach uses concept-recognition software to find PPIs, taking synonyms into account, and resolving homonyms. Here two concepts (h.l. two proteins) are detected if they co-occur in the same abstract.
- STRING. The STRING database contains a text mining score which is based on direct co-occurrences in literature.
The results show that, using concept profiles, 43% of the known PPIs were detected, with a specificity of 99%, and 66% of all known PPIs with a specificity of 95%. In contrast, the direct relations methods and STRING show much lower scores:
|Sensitivity at spec = 99%
|Sensitivity at spec = 95%
|Area under Curve
These findings suggested that not all proteins with high similarity scores are known to interact but may be related in another way, e.g.they could be involved in the same pathway or be part of the same protein complex, but do not physically interact. Indeed concept-based profiling was superior in predicting relationships between proteins potentially present in the same complex or pathway (thus A-C inferred from concurrence protein pairs A-B and B-C).
Since there is often a substantial time lag between the first publication of a finding, and the time the PPI is entered in a database, a retrospective study was performed to examine how many of the PPIs that would have been predicted by the different methods in 2005 were confirmed in 2007. Indeed, using concept profiles, PPIs could be efficiently predicted before they enter PPI databases and before their interaction was explicitly described in the literature.
The practical value of the method for discovery of novel PPIs is illustrated by the experimental confirmation of the inferred physical interaction between CAPN3 and PARVB, which was based on frequent co-occurrence of both proteins with concepts like Z-disc, dysferlin, and alpha-actinin. The relationships between proteins predicted are broader than PPIs, and include proteins in the same complex or pathway. Dependent on the type of relationships deemed useful, the precision of the method can be as high as 90%.
In line with their open access policy, they have made the full set of predicted interactions available in a downloadable matrix and through the webtool Nermal, which lists the most likely interaction partners for a given protein.
According to Mons, this framework will be a very rich source for new discoveries, as it will enable scientists to prioritize potential interaction partners for further testing.
Barend Mons started with the statement that nano-publications will replace the classic articles (and the need for libraries). However, things are never as black as they seem.
Mons showed that a nano-publication is basically a “peer-reviewed, openly available” triplet. Triplets can be effectively retrieved ànd inferred from available databases/papers using a concept-based approach.
Nevertheless, effectivity needs to be enhanced by semantically coding triplets when published.
What will this mean for clinical medicine? Bioinformatics is quite another discipline, with better structured and more straightforward data (interaction, identity, place). Interestingly, Mons and van Haage plan to do further studies, in which they will evaluate whether the use of concept profiles can also be applied in the prediction of other types of relations, for instance between drugs or genes and diseases. The future will tell whether the above-mentioned approach is also useful in clinical medicine.
Implementation of the following (implicit) recommendations would be advisable, independent of the possible success of nano-publications:
- Less emphasis on “publish or perish” (thus more on the data themselves, whether positive, negative, trendy or not)
- Better structured data, partly by structuring articles. This has already improved over the years by introducing structured abstracts, availability of extra material (appendices, data) online and by guidelines, such as STARD (The Standards for Reporting of Diagnostic Accuracy)
- Open Access
- Availability of full text
- Availability of raw data
One might argue that disclosing data is unlikely when pharma is involved. It is very hopeful therefore, that a group of major pharmaceutical companies have announced that they will share pooled data from failed clinical trials in an attempt to figure out what is going wrong in the studies and what can be done to improve drug development (10).
Unfortunately I don’t dispose of Mons presentation. Therefore two other presentations about triplets, concepts and the semantic web.
- van Haagen HH, ‘t Hoen PA, Botelho Bovo A, de Morrée A, van Mulligen EM, Chichester C, Kors JA, den Dunnen JT, van Ommen GJ, van der Maarel SM, Kern VM, Mons B, & Schuemie MJ (2009). Novel protein-protein interactions inferred from literature context. PloS one, 4 (11) PMID: 19924298
- Twitteren voor de wetenschap, Maartje Bakker, Volskrant (2010-06-05) (Twittering for Science)
- Barend Mons and Jan Velterop (?) Nano-Publication in the e-science era (Concept Web Alliance, Netherlands BioInformatics Centre, Leiden University Medical Center.) http://www.nbic.nl/uploads/media/Nano-Publication_BarendMons-JanVelterop.pdf, assessed June 20th, 2010.
- Mons, B., Ashburner, M., Chichester, C., van Mulligen, E., Weeber, M., den Dunnen, J., van Ommen, G., Musen, M., Cockerill, M., Hermjakob, H., Mons, A., Packer, A., Pacheco, R., Lewis, S., Berkeley, A., Melton, W., Barris, N., Wales, J., Meijssen, G., Moeller, E., Roes, P., Borner, K., & Bairoch, A. (2008). Calling on a million minds for community annotation in WikiProteins Genome Biology, 9 (5) DOI: 10.1186/gb-2008-9-5-r89
- Science Daily (2008/05/08) Large-Scale Community Protein Annotation — WikiProteins
- Boing Boing: (2008/05/28) WikiProteins: a collaborative space for biologists to annotate proteins
- (ppt1) SWAT4LS 2009Semantic Web Applications and Tools for Life Sciences http://www.swat4ls.org/
Amsterdam, Science Park, Friday, 20th of November 2009
- (ppt2) Michel Dumontier: triples for the people scientists liberating biological knowledge with the semantic web
- (ppt3, only slide shown): Bibliography 2.0: A citeulike case study from the Wellcome Trust Genome Campus – by Duncan Hill (EMBL-EBI)
- WSJ (2010/06/11) Drug Makers Will Share Data From Failed Alzheimer’s Trials