Tuesday, November 20, 2007

Access 2007 Wrap Up Part Three - The Importance of Being Relevent

Peter Binkley did a simple and elegant comparison of a number of online library catalogues to begin his presentation, Searching the OPAC - The State of Play. He did performed the same search in all the library catalogues he examined; he did a keyword search for the word canada.

Its a ingenious example. You see, if a user types in a one word search in a catalogue for, say, dogs and gets results in which all the books that are entitled 'dogs' came up first, you would be hard pressed to say that those results weren't relevant.

But if you search for the word, canada, in say in most university library catalogues, you don't get books entitled Canada coming up first. Instead, the first result you get will likely be a document with a long name like Emergency food service : planning for disasters (U of A), Post-war Canadian housing and residential mortgage markets and the role of government (University of Windsor), or Progress report by the United States. Environmental Protection Agency. Great Lakes National Program Office and Environment Canada (University of Toronto).

This is because these library catalogues use a scoring system in which search and item word matches are used to assign points to items in order to determine relevancy. So a book by a Canadian government agency (which will frequently have the word Canada in its name twice - once for the English version and one for the French) about something Canadian, will tend to outscore the single word titled books.

Discovery layers don't correct this problem: La sécurité humaine et les femmes autochtones au Canada is the first result from McMaster's library catalogue; NCSU's first response is Compendium of plant disease and decay fungi in Canada, 1960-1980.

Librarians have largely left the responsibility of the library catalogue's search algorithm to the commercial library vendors who (putting this delicately) don't dedicate as much development time to it as, say, Google. But with the advent of open source indexing software such as Lucene some librarians are beginning to tinker with relevancy rules. Is it just coincidence that library catalogues that are running either the open-source Koha or the open-source Evergreen both pass the canada keyword test?

Another means of improving relevancy is to put the most popular items at the top of the list, like Amazon does. Bibliocommons and Koha use circulation counts in their relevancy rankings while WorldCat Local uses the number of global holdings in theirs.

But relevancy is too difficult a problem that can be solved by the simple tinkering of the numbered weights in the search algorithm.
A fundamental shortcoming of the library catalog is that it doesn't (and as currently designed can't) know the why for any given search. A search for the subject "breast cancer" in PVLD's catalog results in over 40 distinct subject headings listed in alphabetical order from the simple "Breast – Cancer" through "Breast – Cancer – Religious Life" to "Breast – Cancer – United States". The catalog doesn't know that one person is searching for books on this subject because she has to write a term paper, and another because his wife has just been diagnosed with the disease – and therefore it gives no clue as to which of these subject headings is most relevant to each person. [PVDL Director via Everything is Miscellaneous]

One of the means by which Bibliocommons addresses this issue is through the resorting of items so that highly rated items by trusted sources appear on the top of the list. This is one way of informing the library catalogue of who you are, at who you are as defined by who else you find trustworthy. Richard Wallis of Talis suggested during his Access 2007 presentation that if the catalogue knew a bit about you, it could present better search results to you. Using the example of the keyword lotus, an engineering student would get items about the car of the same name while the botanist would get items about the genus Nelumbo. Mark Leggot, using his metaphor of the search box as Lego brick, suggested in his Access 2007 presentation that it shouldn't be difficult to embed a library search box into a course-specific page that was already constrained to the sources that were relevant to both the subject of the course and the course level. Addendum: As Shibboleth allows for some attribute-based authentication to be passed to an online source, there may be a potential to tailor the interface or search results based on the user's subject specialty or readership level [Scholr 2.0: 3.2.2 Shibboleth].

That's the last of my Access 2007 wrap-up blog posts. Sorry for the irrelevant ending.

No comments: