Thursday 27 October 2011

Information Retrieval: a very quick introduction

Information Retrieval, or IR is to do with information seeking behaviour, or searching for something relevant to the task at hand. In IR terms, relevance is defined as fulfilling the user's information needs (this is not so in philosophy, which I hope to blog about soon if I get time!). The difference between IR and querying a database, is that IR returns results ranked based on the probability it matches to your search, whereas a database will return an exact match (or nothing). Lancaster (1968) says that an information retrieval system does not inform (i.e. change the knowledge of) the user on the subject of his inquiry. It merely informs on the existence (or non-existence) and whereabouts of  documents relating to his request. (Taken from article on Information Retrieval here). It has also been defined as ' finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)' (see more here).


There are lots of different ways of defining IR, but there are 3 formal definitions based on the perspective of the process. First, there is the user view, which is the user's 'anomalous state of knowledge' or ASK. This means they have knowledge gaps and are seeking information. Second is the systems view, which is the hardware and software for IR. Finally there is a sources view, which is the presentation of the information (usually not your information) using technology.


Searching for information happens in lots of different ways. Searching for a website, is not the same as searching for a new flat, is not the same as searching for some information about goldfish. Or so says Broder, who argues that these queries are split up into Navigational queries, Transactional queries and Informational queries respectively.


We experimented with this in the lab, trying out different types of queries using Google and Bing. We experimented with different searches, using Boolean logic - terms like AND, OR and NOT can make a huge difference to your search. The same is true of using quotation marks to search for an exact phrase. It is suggested that many users will try to formulate this in natural language, although personally I wouldn't state that categorically without doing a bit more research!


Sometimes your first search does't give you results that you deem relevant. This is when it becomes necessary to modify your query. This can be adding in extra words, removing or changing operators or trying synonyms to see if you get different results. You can evaluate the effectiveness of a search qualitatively or quantitatively. Qualitative analysis is from a user perspective - does it satisfy their information need? To establish this is quite time consuming, as you need to do consumer surveys and questionnaires to find this out. Quantitative experiments test the speed of retrieval (efficiency) and how many relevant documents were retrieved (effectiveness). 


There are two ways to measure effectiveness - precision and recall. Precision is the proportion of documents which are relevant. For example if you return 5 results and 3 are relevant, you have 60% relevance. The formula is relevant documents received / total documents retrieved. For recall, the formula is relevant documents retrieved / total number of relevant documents in database. This is more difficult to to calculate,but is useful to know as it tells you how good the system is at detecting relevant documents in the index.


The Information Retrieval Nirvana would be 100% recall with 100% precision. Unfortunately, this tends not to happen! Instead, there is an inverse relationship between the two. If a user searches for something very specific, they are likely to get a good level of precision but they might not get such good recall because they are not capturing everything they want. Likewise, if a user casts their net very widely they will retrieve a lot of relevant information, but they might also get a lot that they are not interested in, so precision is low.


This has been a bit of a crash course in Information Retrieval (I mean, I didn't even mention indexing), but hopefully readers are inspired to go and find out more. If you have a knowledge gap, this would be a great opportunity to practice IR as a user. Do a Google search (or Bing, or Yahoo, or whatever) and play around with Boolean operators. I should be covering IR in a lot more detail next semester, so keep an eye out for another, more detailed, blog then!

No comments:

Post a Comment