When more is less: Truncation, Stemming and Pluralization in the Cochrane Library

5 01 2010

I’m on two mail lists of the Cochrane Collaboration, one is the TSC -list (TSC=Trials Search Coordinator) and the other the IRMG-list. IMRG stands for Information Retrieval Methods Group (of the Cochrane). Sometimes, difficult search problems are posted on the list. It is challenging to try to find the solutions. I can’t remember that a solution was not found.

A while ago a member of the list was puzzled why he got the following retrieval result from the Cochrane Library:

ID Search Hits
#1 (breast near tumour* ) ….. 254
#2 (breast near tumour) …… 640
#3 (breast near tumor*) ….. 428
#4 (breast near tumor) …… 640

where near = adjacent (thus breast should be just before tumour) and the asterisk * is the truncation symbol.  At the end of the word an asterisk is used for all terms that begin with that basic word root. Thus tumour* should find: tumours and tumour and thus broaden the search.

The results are odd, because #2 (without truncation) gives more hits than #1 (with truncation), and the same is true for #4 versus #3. One would expect truncation to give more results. What could be the reason behind it?

I suspected the problem had to do with the truncation. I searched for breast and tumour with or without truncation (#1 to #4) and only tumour* gave odd results: tumour* gave much less results than tumour. (to exclude that it had to do with the fields being searched I only searched the fields ti (title), ab (abstract) and kw (keywords))

Records found with tumour, not with tumour*, contained the word tumor (not shown). Thus tumour automatically searches for tumor (and vice versa). This process is called stemming.

According to the Help-function of the Cochrane Library:

Stemming: The stemming feature within the search allows words with small spelling variants to be matched. The term tumor will also match tumour.

In addition, as I realized later, the Cochrane has pluralization and singularization features.

Pluralization and singularization matches Pluralized forms of words also match singular versions, and vice versa. The term drugs will find both drug and drugs. To match either just the singular or plural form of a terms, use an exact match search and include the word in quotation marks.

Indeed (tumor* OR tumour*) (or shortly tumo*r*) retrieves a little more than tumor OR tumour: words like tumoral, tumorous, tumorectomy. Not particularly useful, although it might not be disadvantagous when used adjacent to breast, as this will filter most noise.

tumor spelling variants searched in the title (ti) only: it doesn't matter how you spell tumor (#8, #9, #10,#11), as long as you don't truncate (while using a single variant)

Thus stemming, pluralization and singularization only work without truncation. In case of truncation you should add the spelling variants yourselves if case stemming/pluralization takes place. This is useful if you’re interested in other word variants that are not automatically accounted for.

Put it another way: knowing that stemming and pluralization takes place you can simply search for the single or plural form, American or English spelling. So breast near tumor (or simply breast tumor) would have been o.k. This is the reason why these features were introduced in the first way. 😉

By the way, truncation and stemming (but not pluralization) are also features in PubMed. And this can give similar and other problems. But this will be dealt with in another blogpost.

Reblog this post [with Zemanta]