Will Nano-Publications & Triplets Replace The Classic Journal Articles?

23 06 2010

ResearchBlogging.org“Libraries and journals articles as we know them will cease to exists” said Barend Mons at the symposium in honor of our Library 25th Anniversary (June 3rd). “Possibly we will have another kind of party in another 25 years”…. he continued, grinning.

What he had to say the next half hour intrigued me. And although I had no pen with me (it was our party, remember), I thought it was interesting enough to devote a post to it.

I’m basing this post not only on my memory (we had a lot of Italian wine at the buffet), but on an article Mons referred to [1], a Dutch newspaper article [2]), other articles [3-6] and Powerpoints [7-9] on the topic.

This is a field I know little about, so I will try to keep it simple (also for my sake).

Mons started by touching on a problem that is very familiar to doctors, scientists and librarians: information overload by a growing web of linked data.  He showed a picture that looked like the one at the right (though I’m sure those are Twitter Networks).

As he said elsewhere [3]:

(..) the feeling that we are drowning in information is widespread (..) we often feel that we have no satisfactory mechanisms in place to make sense of the data generated at such a daunting speed. Some pharmaceutical companies are apparently seriously considering refraining from performing any further genome-wide association studies (… whole genome association –…) as the world is likely to produce many more data than these companies will ever be able to analyze with currently available methods .

With the current search engines we have to do a lot of digging to get the answers [8]. Computers are central to this digging, because there is no way people can stay updated, even in their own field.

However,  computers can’t deal with the current web and the scientific  information as produced in the classic articles (even the electronic versions), because of the following reasons:

  1. Homonyms. Words that sound or are the same but have a different meaning. Acronyms are notorious in this respect. Barend gave PSA as an example, but, without realizing it, he used a better example: PPI. This means Protein Pump Inhibitor to me, but apparently Protein Protein Interactions to him.
  2. Redundancy. To keep journal articles readable we often use different words to denote the same. These do not add to the real new findings in a paper. In fact the majority of digital information is duplicated repeatedly. For example “Mosquitoes transfer malaria”, is a factual statement repeated in many consecutive papers on the subject.
  3. The connection between words is not immediately clear (for a computer). For instance, anti-TNF inhibitors can be used to treat skin disorders, but the same drugs can also cause it.
  4. Data are not structured beforehand.
  5. Weight: some “facts” are “harder” than others.
  6. Not all data are available or accessible. Many data are either not published (e.g. negative studies), not freely available or not easy to find.  Some portals (GoPubmed, NCBI) provide structural information (fields, including keywords), but do not enable searching full text.
  7. Data are spread. Data are kept in “data silos” not meant for sharing [8](ppt2). One would like to simultaneously query 1000 databases, but this would require semantic web standards for publishing, sharing and querying knowledge from diverse sources…..

In a nutshell, the problem is as Barend put it: “Why bury data first and then mine it again?” [9]

Homonyms, redundancy and connection can be tackled, at least in the field Barend is working in (bioinformatics).

Different terms denoting the same concept (i.e. synonyms) can be mapped to a single concept identifier (i.e. a list of synonyms), whereas identical terms used to indicate different concepts (i.e. homonyms) can be resolved by a disambiguation algorithm.

The shortest meaningful sentence is a triplet: a combination of subject, predicate and object. A triplet indicates the connection and direction.  “Mosquitoes cause/transfer malaria”  is such a triplet, where mosquitoes and malaria are concepts. In the field of proteins: “UNIPROT 05067 is a protein” is a triplet (where UNIPROT 05067 and protein are concepts), as are: “UNIprotein 05067 is located in the membrane” and “UNIprotein 0506 interacts with UNIprotein 0506″[8].  Since these triplets  (statements)  derive from different databases, consistent naming and availability of  information is crucial to find them. Barend and colleagues are the people behind Wikiproteins, an open, collaborative wiki  focusing on proteins and their role in biology and medicine [4-6].

Concepts and triplets are widely accepted in the world of bio-informatics. To have an idea what this means for searching, see the search engine Quertle, which allows semantic search of PubMed & full-text biomedical literature, automatic extraction of key concepts; Searching for ESR1 $BiologicalProcess will search abstracts mentioning all kind of processes where ESR1 (aka ERα, ERalpha, EStrogen Receptor 1) are involved. The search can be refined by choosing ‘narrower terms’ like “proliferation” or “transcription”.

The new aspects is that Mons wants to turn those triplets into (what he calls) nano-publications. Because not every statement is as ‘hard’, nano-publications are weighted by assigning numbers from 0 (uncertain) to 1 (very certain). The nano-publication “mosquitoes transfer malaria” will get a number approaching 1.

Such nano-publications offer little shading and possibility for interpretation and discussion. Mons does not propose to entirely replace traditional articles by nano-publications. Quote [3]:

While arguing that research results should be available in the form of nano-publications, are emphatically not saying that traditional, classical papers should not be published any longer. But their role is now chiefly for the official record, the “minutes of science” , and not so much as the principle medium for the exchange of scientific results. That exchange, which increasingly needs the assistance of computers to be done properly and comprehensively, is best done with machine-readable, semantically consistent nano-publications.

According to Mons, authors and their funders should start requesting and expecting the papers that they have written and funded to be semantically coded when published, preferably by the publisher and otherwise by libraries: the technology exists to provide Web browsers with the functionality for users to identify nano-publications, and annotate them.

Like the wikiprotein-wiki, nano-publications will be entirely open access. It will suffice to properly cite the original finding/publication.

In addition there is a new kind of “peer review”. An expert network is set up to immediately assess a twittered nano-publication when it comes out, so that  the publication is assessed by perhaps 1000 experts instead of 2 or 3 reviewers.

On a small-scale, this is already happening. Nano-publications are send as tweets to people like Gert Jan van Ommen (past president of HUGO and co-author of 5 of my publications (or v.v.)) who then gives a red (don’t believe) or a green light (believe) via one click on his blackberry.

As  Mons put it, it looks like a subjective event, quite similar to “dislike” and “like” in social media platforms like Facebook.

Barend often referred to a PLOS ONE paper by van Haagen et al [1], showing the superiority of the concept-profile based approach not only in detecting explicitly described PPI’s, but also in inferring new PPI’s.

[You can skip the part below if you’re not interested in details of this paper]

Van Haagen et al first established a set of a set of 61,807 known human PPIs and of many more probable Non-Interacting Protein Pairs (NIPPs) from online human-curated databases (and NIPPs also from the IntAct database).

For the concept-based approach they used the concept-recognition software Peregrine, which includes synonyms and spelling variations  of concepts and uses simple heuristics to resolve homonyms.

This concept-profile based approach was compared with several other approaches, all depending on co-occurrence (of words or concepts):

  • Word-based direct relation. This approach uses direct PubMed queries (words) to detect if proteins co-occur in the same abstract (thus the names of two proteins are combined with the boolean ‘AND’). This is the simplest approach and represents how biologists might use PubMed to search for information.
  • Concept-based direct relation (CDR). This approach uses concept-recognition software to find PPIs, taking synonyms into account, and resolving homonyms. Here two concepts (h.l. two proteins) are detected if they co-occur in the same abstract.
  • STRING. The STRING database contains a text mining score which is based on direct co-occurrences in literature.

The results show that, using concept profiles, 43% of the known PPIs were detected, with a specificity of 99%, and 66% of all known PPIs with a specificity of 95%. In contrast, the direct relations methods and STRING show much lower scores:

Word-based CDR Concept profiles STRING
Sensitivity at spec = 99% 28% 37% 43% 39%
Sensitivity at spec = 95% 33% 41% 66% 41%
Area under Curve 0.62 0.69 0.90 0.69

These findings suggested that not all proteins with high similarity scores are known to interact but may be related in another way, e.g.they could be involved in the same pathway or be part of the same protein complex, but do not physically interact. Indeed concept-based profiling was superior in predicting relationships between proteins potentially present in the same complex or pathway (thus A-C inferred from concurrence protein pairs A-B and B-C).

Since there is often a substantial time lag between the first publication of a finding, and the time the PPI is entered in a database, a retrospective study was performed to examine how many of the PPIs that would have been predicted by the different methods in 2005 were confirmed in 2007. Indeed, using concept profiles, PPIs could be efficiently predicted before they enter PPI databases and before their interaction was explicitly described in the literature.

The practical value of the method for discovery of novel PPIs is illustrated by the experimental confirmation of the inferred physical interaction between CAPN3 and PARVB, which was based on frequent co-occurrence of both proteins with concepts like Z-disc, dysferlin, and alpha-actinin. The relationships between proteins predicted are broader than PPIs, and include proteins in the same complex or pathway. Dependent on the type of relationships deemed useful, the precision of the method can be as high as 90%.

In line with their open access policy, they have made the full set of predicted interactions available in a downloadable matrix and through the webtool Nermal, which lists the most likely interaction partners for a given protein.

According to Mons, this framework will be a very rich source for new discoveries, as it will enable scientists to prioritize potential interaction partners for further testing.

Barend Mons started with the statement that nano-publications will replace the classic articles (and the need for libraries). However, things are never as black as they seem.
Mons showed that a nano-publication is basically a “peer-reviewed, openly available” triplet. Triplets can be effectively retrieved ànd inferred from available databases/papers using a
concept-based approach.
Nevertheless, effectivity needs to be enhanced by semantically coding triplets when published.

What will this mean for clinical medicine? Bioinformatics is quite another discipline, with better structured and more straightforward data (interaction, identity, place). Interestingly, Mons and van Haage plan to do further studies, in which they will evaluate whether the use of concept profiles can also be applied in the prediction of other types of relations, for instance between drugs or genes and diseases. The future will tell whether the above-mentioned approach is also useful in clinical medicine.

Implementation of the following (implicit) recommendations would be advisable, independent of the possible success of nano-publications:

  • Less emphasis on “publish or perish” (thus more on the data themselves, whether positive, negative, trendy or not)
  • Better structured data, partly by structuring articles. This has already improved over the years by introducing structured abstracts, availability of extra material (appendices, data) online and by guidelines, such as STARD (The Standards for Reporting of Diagnostic Accuracy)
  • Open Access
  • Availability of full text
  • Availability of raw data

One might argue that disclosing data is unlikely when pharma is involved. It is very hopeful therefore, that a group of major pharmaceutical companies have announced that they will share pooled data from failed clinical trials in an attempt to figure out what is going wrong in the studies and what can be done to improve drug development (10).

Unfortunately I don’t dispose of Mons presentation. Therefore two other presentations about triplets, concepts and the semantic web.



  1. van Haagen HH, ‘t Hoen PA, Botelho Bovo A, de Morrée A, van Mulligen EM, Chichester C, Kors JA, den Dunnen JT, van Ommen GJ, van der Maarel SM, Kern VM, Mons B, & Schuemie MJ (2009). Novel protein-protein interactions inferred from literature context. PloS one, 4 (11) PMID: 19924298
  2. Twitteren voor de wetenschap, Maartje Bakker, Volskrant (2010-06-05) (Twittering for Science)
  3. Barend Mons and Jan Velterop (?) Nano-Publication in the e-science era (Concept Web Alliance, Netherlands BioInformatics Centre, Leiden University Medical Center.) http://www.nbic.nl/uploads/media/Nano-Publication_BarendMons-JanVelterop.pdf, assessed June 20th, 2010.
  4. Mons, B., Ashburner, M., Chichester, C., van Mulligen, E., Weeber, M., den Dunnen, J., van Ommen, G., Musen, M., Cockerill, M., Hermjakob, H., Mons, A., Packer, A., Pacheco, R., Lewis, S., Berkeley, A., Melton, W., Barris, N., Wales, J., Meijssen, G., Moeller, E., Roes, P., Borner, K., & Bairoch, A. (2008). Calling on a million minds for community annotation in WikiProteins Genome Biology, 9 (5) DOI: 10.1186/gb-2008-9-5-r89
  5. Science Daily (2008/05/08) Large-Scale Community Protein Annotation — WikiProteins
  6. Boing Boing: (2008/05/28) WikiProteins: a collaborative space for biologists to annotate proteins
  7. (ppt1) SWAT4LS 2009Semantic Web Applications and Tools for Life Sciences http://www.swat4ls.org/
    Amsterdam, Science Park, Friday, 20th of November 2009
  8. (ppt2) Michel Dumontier: triples for the people scientists liberating biological knowledge with the semantic web
  9. (ppt3, only slide shown): Bibliography 2.0: A citeulike case study from the Wellcome Trust Genome Campus – by Duncan Hill (EMBL-EBI)
  10. WSJ (2010/06/11) Drug Makers Will Share Data From Failed Alzheimer’s Trials

Week 8: In een andere baan

20 03 2008

Спутник, Спутник! Hier spreken George en Laika tot u vanuit de virtuele ruimte. Er heeft zich een stille revolte voltrokken binnen het spoetnikkamp. Onder het mom van een plaspauze heeft ground-control die geen vat meer had op sommige deelnemers in een spoedberaad besloten de minst in het gareel lopende astronauten van verdere deelname uit te sluiten. Verbannen,Спутник, wij zijn verbannen! Terwijl we dachten gezellig met de andere Spoetnikkersop ontdekkingsreis te zijn, maande onze Afvallige Chinese commandante ons subiet om met een ruimtecapsule het ruimteschip te verlaten.Volgens voorzitter Nol van het Presidium van de Opperste Sovjet hadden wij het Decreet van het Volks Commissariaat overtreden door zonder toestemming de stukjes van anderen over te nemen in onze blogs. Zelfbewust blafte ik, Laika, dat jullie Спутник daar toch zeker zelf buitengemeen van hadden geprofiteerd, omdat jullie door alle pings en trackbacks toch ook meer bezoekers hadden gekregen. Ik, Brughagedis George gromde, dat dankzij mijn computerkennis Spoetnik toch een hogere vlucht had genomen. Hadden we er niet beiden blijk van gegeven juist veel te doen aan social tagging? Waarom het Presidium het nodig vond om nu een daad te stellen waar ze anders toch altijd vrij flegmatiek het verloop van het programma volgen was niet duidelijk. Het zou kunnen dat de machinaties van Raspoetin Wout en zijn handlanger Perestroijka Bert die onze daden luidruchtig hebben rondgebazuind hen op deze gedachte had gebracht. Misschien niet geheel belangeloos, want de uitschakeling van 2 concurrenten in de jacht op de zilveren trofee was mooi meegenomen.

Dezelfde dag nog moesten wij het ruimteschip verlaten en werden we verbannen naar het zwarte gat Sagitarius A in het centrum van onze Melkweg. Het grondstation in Baikonoer liet weten dat het ruimteschip Sojoez wat ons daarheen zou transporteren al gelanceerd en in een Low Earth orbit op aankoppeling wachte. Beiden letterlijk met de staart tussen de benen maakten we ons op voor het vertrek. We probeerden er nog wat van te maken door op te merken dat we dan genoeg tijd zouden hebben om verder te bloggen. Maar dat werd niet gewaardeerd door adjudant Pascal, die krijtwit wegtrok. Nu waren we op elkaar aangewezen, en waar brughagedis eerst een louse in de pels van hond Laika was, werd Laika nu een trouwe metgezel van Brughagedis. Samen zouden ze er het beste van maken en de rest nog wel een poepie laten ruiken.
Met de afvalmachine werden we weggeschoten. Hoewel een van ons de nodige know how had van computers was dit toch te hoog gegrepen. Zouden we nu reddeloos verloren zijn? Gelukkig hadden we nog wel contact met moeder aarde en zagen we in een van de blogs de spreuk: “wie schrijft, die blijft”. Er was dus nog hoop. We spraken af onze blogs voort te zetten en via deze route hulp binnen te halen. Als middel daartoe kozen we Google Word, want we zaten toch in dezelfde cabine en zo bleven we in pas met de Spoetnikopdrachten. Nou lijkt het ons al moeilijk om vanaf 2 computers samen één Google Word te editen: wie weet nou wanneer een ander wat aan het editen is. Maar naast elkaar steeds tegelijk op het toetsenbord willen slaan is ook niet alles. In ieder geval is het gelukt en het resultaat hiervan zullen we jullie zo tonen.

Onderwijl was ons noodsignaal opgevangen door WoW!ter, die net de hoofdprijs bij de Dutch Bloggies was misgelopen en steun zocht bij zijn poging voor volgend jaar. Hij ronselde hierbij nieuwe bloggers en trok zich ons lot in het bijzonder aan. Speciaal voor ons, arme beginnende bloggers, heeft hij 17 tips opgesteld. Deze tips kunnen ons helpen een netwerk te creeeren die het ons mogelijk maakt terug te keren. Enkele punten hadden we reeds gerealiseerd. Pingbacks en trackbacks bijvoorbeeld :). Librarything ook, al doorzag Laika daarvan niet alle mogelijkheden, hetgeen WoW!ter zeer bedroefde. De volgende opdracht was Technorati. Beiden melden we ons aan en verklaarden elkaar gelijk tot fan, Ook voerden we del.ici.ous uit, werden we lid van co-comment etcetera. Al netwerkend in de virtuele ruimte kregen wij overal fans en bereikte ons van de onmogelijkste plekken hulp.

Maar… wacht even, wacht even, wat is dat daar??? Prachtig, Спутник, prachtig!! We zouden willen dat jullie dit mee mochten maken. Terwijl jullie zo bezig zijn web2.0 te verkennen zijn wij, geloof het of niet, beland in web3.0.
Hopelijk kan ons grote voorbeeld, WoW!ter, ons hier door heen loodsen vanaf het grondstation in Wageningen. Als alles goedgaat, Спутник, zullen we jullie van onze belevenissen verhalen. Op aanraden van ons boegbeeld WoW!ter hebben we onszelf gebacklinked zodat jullie onze belevenissen kunnen blijven volgen!!!

камрады приветствиям, George en Laika

hagedis in de ruimtelaika-echt.jpg