This is a brief overview of my paper “Information Retrieval Performance Measurement Using Extrapolated Pr...,” which I’ll be presenting on June 8^{th} at the DESI VI workshop at ICAIL 2015. The paper provides a novel method for extrapolating a precision-recall point to a different level of recall, and advocates making performance comparisons by extrapolating results for all systems to the same level of recall if the systems cannot be evaluated at exactly the same recall.

Recall, R, is the proportion of the relevant documents retrieved by the information retrieval (IR) system, and precision, P, is the proportion of retrieved documents that are relevant. It is sometimes desirable to have high recall while also having high precision in order to find most of the relevant documents without having a lot of non-relevant documents mixed in, but higher recall is usually accompanied by lower precision.

Some IR systems generate a relevance score for each document, allowing the documents to be sorted so that the ones that are deemed most likely to be relevant appear at the top of the list. A precision-recall curve can be generated for such systems by computing the precision and recall for each possible point in the sorted list, viewing the documents above that point as being the retrieved documents. Unfortunately, we sometimes need to compare performance of IR systems without knowing the entire precision-recall curve. For example, some systems only provide a binary yes/no relevance prediction instead of a relevance score, so we have only a single precision-recall point for such systems.

Making judgments about the relative performance of two IR systems knowing only a single precision-recall point for each system is problematic—if one system has higher recall but lower precision for a particular task, is it the better system for that task?

There are various performance measures like the F_{1} score that combine precision and recall into a single number to allow performance comparisons:

F_{1} = 2*P*R/(P + R)

The value of the F_{1} score will be between R and P, and it tends to be closer to the smaller of the two, so it is impossibly to get a large F_{1} if either the precision or the recall is small.

Unfortunately, such measures often assume a trade-off between precision and recall that is not appropriate for the actual information need (I’ve written about problems with the F_{1} score before). To understand the problem, it is useful to look at how F_{1} varies as a function of the recall where it is measured. Here are two precision-recall curves, with the one on the left being for an easy categorization task and the one on the right being for a hard task, with the F_{1} score corresponding to each point on the precision-recall curve superimposed:

If we pick a single point from the precision-recall curve and compute the value of F_{1} for that point, the resulting F_{1} is very sensitive to the precision-recall point we choose. F_{1} is maximized at 46% recall in the graph on the right, which means that the trade-off between precision and recall that F_{1} deems to be reasonable implies that it is not worthwhile to retrieve more than 46% of the relevant documents for that task because precision suffers too much when you push to higher recall (i.e., when you go farther down the list of documents sorted by relevance score to find additional relevant documents you encounter too many non-relevant ones). That is simply not compatible with the information need in some situations.

For example, a litigant can request that its opposition or a third party be compelled by the court to produce documents that satisfy some relevance criteria so that evidence that may resolve the dispute can be discovered. When electronic documents are involved this process is called e-discovery. High recall (e.g., 75% or more) is required to satisfy the court. High precision is desirable to reduce the number of non-relevant documents that will undergo expensive human review before the documents are turned over to the requesting party. When determining the recall that is necessary the court will take cost into account, but it does so in light of the value of the case and the likelihood that e-discovery will turn up useful evidence—this is called "proportionality." The F_{1} score cannot possibly strike the right balance between recall and precision because it is completely oblivious to the value of the case. Other problems with the F_{1} score are detailed in the paper.

The strong dependence that F_{1} has on recall as we move along the precision-recall curve means that it is easy to draw wrong conclusions about which system is performing better when performance is measured at different levels of recall. This strong dependence on recall occurs because the contours of equal F_{1} are not shaped like precision-recall curves, so a precision-recall curve will cut across many contours. In order to have the freedom to measure performance at recall levels that are relevant for e-discovery without drawing wrong conclusions about which system is performing best, the paper proposes a performance measure that has constant-performance contours that are shaped like precision-recall curves, so the performance measure depends much less on the recall level where the measurement is made than F_{1} does. In other words, the proposed performance measure aims to be sensitive to how well the system is working while being insensitive to the specific point on the precision-recall curve where the measurement is made. This graph compares the constant-performance contours for F_{1} to the measure proposed in the paper:

Since the constant-performance contours are shaped like typical precision-recall curves, we can view this measure as being equivalent to extrapolating the precision-recall point to some other target recall level, like 75%, by simply finding an idealized precision-recall curve that passes through the point and moving along that curve to the target recall. This figure illustrates extrapolation of precision measurements for three different systems at different recall levels to 75% recall for comparison:

Finally, here is what the performance measure looks like if we evaluate it for each point in the two precision-recall curves from the first figure:

The blue performance curves are much flatter than the red F_{1} curves from the first figure, so the value is much less sensitive to the recall level where it is measured. As an added bonus, the measure is an extrapolated estimate of the precision that the system would achieve at 75% recall, so it is inversely proportional to the cost of the human document review needed (excluding training and testing) to reach 75% recall.

For more details, including the mathematical model used for the idealized precision-recall curves and some examples of how well the extrapolation works, read the paper.

© 2019 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

**Technical**

- Free Books and Resources for DSC Members
- Learn Machine Learning Coding Basics in a weekend
- New Machine Learning Cheat Sheet | Old one
- Advanced Machine Learning with Basic Excel
- 12 Algorithms Every Data Scientist Should Know
- Hitchhiker's Guide to Data Science, Machine Learning, R, Python
- Visualizations: Comparing Tableau, SPSS, R, Excel, Matlab, JS, Pyth...
- How to Automatically Determine the Number of Clusters in your Data
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- Fast Combinatorial Feature Selection with New Definition of Predict...
- 10 types of regressions. Which one to use?
- 40 Techniques Used by Data Scientists
- 15 Deep Learning Tutorials
- R: a survival guide to data science with R

**Non Technical**

- Advanced Analytic Platforms - Incumbents Fall - Challengers Rise
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- How to Become a Data Scientist - On your own
- 16 analytic disciplines compared to data science
- Six categories of Data Scientists
- 21 data science systems used by Amazon to operate its business
- 24 Uses of Statistical Modeling
- 33 unusual problems that can be solved with data science
- 22 Differences Between Junior and Senior Data Scientists
- Why You Should be a Data Science Generalist - and How to Become One
- Becoming a Billionaire Data Scientist vs Struggling to Get a $100k Job
- Why do people with no experience want to become data scientists?

**Articles from top bloggers**

- Kirk Borne | Stephanie Glen | Vincent Granville
- Ajit Jaokar | Ronald van Loon | Bernard Marr
- Steve Miller | Bill Schmarzo | Bill Vorhies

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives**: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central