When more is less: Truncation, Stemming and Pluralization in the Cochrane Library

5 01 2010

I’m on two mail lists of the Cochrane Collaboration, one is the TSC -list (TSC=Trials Search Coordinator) and the other the IRMG-list. IMRG stands for Information Retrieval Methods Group (of the Cochrane). Sometimes, difficult search problems are posted on the list. It is challenging to try to find the solutions. I can’t remember that a solution was not found.

A while ago a member of the list was puzzled why he got the following retrieval result from the Cochrane Library:

ID Search Hits
#1 (breast near tumour* ) ….. 254
#2 (breast near tumour) …… 640
#3 (breast near tumor*) ….. 428
#4 (breast near tumor) …… 640

where near = adjacent (thus breast should be just before tumour) and the asterisk * is the truncation symbol.  At the end of the word an asterisk is used for all terms that begin with that basic word root. Thus tumour* should find: tumours and tumour and thus broaden the search.

The results are odd, because #2 (without truncation) gives more hits than #1 (with truncation), and the same is true for #4 versus #3. One would expect truncation to give more results. What could be the reason behind it?

I suspected the problem had to do with the truncation. I searched for breast and tumour with or without truncation (#1 to #4) and only tumour* gave odd results: tumour* gave much less results than tumour. (to exclude that it had to do with the fields being searched I only searched the fields ti (title), ab (abstract) and kw (keywords))

Records found with tumour, not with tumour*, contained the word tumor (not shown). Thus tumour automatically searches for tumor (and vice versa). This process is called stemming.

According to the Help-function of the Cochrane Library:

Stemming: The stemming feature within the search allows words with small spelling variants to be matched. The term tumor will also match tumour.

In addition, as I realized later, the Cochrane has pluralization and singularization features.

Pluralization and singularization matches Pluralized forms of words also match singular versions, and vice versa. The term drugs will find both drug and drugs. To match either just the singular or plural form of a terms, use an exact match search and include the word in quotation marks.

Indeed (tumor* OR tumour*) (or shortly tumo*r*) retrieves a little more than tumor OR tumour: words like tumoral, tumorous, tumorectomy. Not particularly useful, although it might not be disadvantagous when used adjacent to breast, as this will filter most noise.

tumor spelling variants searched in the title (ti) only: it doesn't matter how you spell tumor (#8, #9, #10,#11), as long as you don't truncate (while using a single variant)

Thus stemming, pluralization and singularization only work without truncation. In case of truncation you should add the spelling variants yourselves if case stemming/pluralization takes place. This is useful if you’re interested in other word variants that are not automatically accounted for.

Put it another way: knowing that stemming and pluralization takes place you can simply search for the single or plural form, American or English spelling. So breast near tumor (or simply breast tumor) would have been o.k. This is the reason why these features were introduced in the first way. 😉

By the way, truncation and stemming (but not pluralization) are also features in PubMed. And this can give similar and other problems. But this will be dealt with in another blogpost.

Reblog this post [with Zemanta]

Actions

Information

4 responses

5 01 2010
Tweets that mention When more is less: Truncation, Stemming and Pluralization in the Cochrane Library « Laika’s MedLibLog -- Topsy.com

[…] This post was mentioned on Twitter by Laika (Jacqueline), sandnsurf and topsy_top20k, topsy_top20k_en. topsy_top20k_en said: Blogging: When more is less: Truncation, Stemming and Pluralization in the Cochrane Library http://bit.ly/8yn745 […]

5 01 2010
Medlib’s Round 1.10 | Dr Shock MD PhD

[…] Laika’s MedLibLog has a post about When more is less: Truncation, Stemming and Pluralization in the Cochrane Library. The ins and outs of retrieval results from the Cochrane […]

5 01 2010
uberVU - social comments

Social comments and analytics for this post…

This post was mentioned on Twitter by sandnsurf: Blogging: When more is less: Truncation, Stemming and Pluralization in the Cochrane Library http://bit.ly/8yn745 /via @laikas…

8 01 2010
Cochrane search tips « MHSLA Blog

[…] recent post in Laika’s MedLibLog demonstrates this principle with the Cochrane Library. When more is less: Truncation, Stemming and Pluralization in the Cochrane Library points out that Cochrane’s search feature automatically uses stemming, pluralization, and […]

Leave a comment