This is a brief overview of my paper “Information Retrieval Performance Measurement Using Extrapolated Pr...,” which I’ll be presenting on June 8^{th} at the DESI VI workshop at ICAIL 2015. The paper provides a novel method for extrapolating a precision-recall point to a different level of recall, and advocates making performance comparisons by extrapolating results for all systems to the same level of recall if the systems cannot be evaluated at exactly the same recall.

Recall, R, is the proportion of the relevant documents retrieved by the information retrieval (IR) system, and precision, P, is the proportion of retrieved documents that are relevant. It is sometimes desirable to have high recall while also having high precision in order to find most of the relevant documents without having a lot of non-relevant documents mixed in, but higher recall is usually accompanied by lower precision.

Some IR systems generate a relevance score for each document, allowing the documents to be sorted so that the ones that are deemed most likely to be relevant appear at the top of the list. A precision-recall curve can be generated for such systems by computing the precision and recall for each possible point in the sorted list, viewing the documents above that point as being the retrieved documents. Unfortunately, we sometimes need to compare performance of IR systems without knowing the entire precision-recall curve. For example, some systems only provide a binary yes/no relevance prediction instead of a relevance score, so we have only a single precision-recall point for such systems.

Making judgments about the relative performance of two IR systems knowing only a single precision-recall point for each system is problematic—if one system has higher recall but lower precision for a particular task, is it the better system for that task?

There are various performance measures like the F_{1} score that combine precision and recall into a single number to allow performance comparisons:

F_{1} = 2*P*R/(P + R)

The value of the F_{1} score will be between R and P, and it tends to be closer to the smaller of the two, so it is impossibly to get a large F_{1} if either the precision or the recall is small.

Unfortunately, such measures often assume a trade-off between precision and recall that is not appropriate for the actual information need (I’ve written about problems with the F_{1} score before). To understand the problem, it is useful to look at how F_{1} varies as a function of the recall where it is measured. Here are two precision-recall curves, with the one on the left being for an easy categorization task and the one on the right being for a hard task, with the F_{1} score corresponding to each point on the precision-recall curve superimposed:

If we pick a single point from the precision-recall curve and compute the value of F_{1} for that point, the resulting F_{1} is very sensitive to the precision-recall point we choose. F_{1} is maximized at 46% recall in the graph on the right, which means that the trade-off between precision and recall that F_{1} deems to be reasonable implies that it is not worthwhile to retrieve more than 46% of the relevant documents for that task because precision suffers too much when you push to higher recall (i.e., when you go farther down the list of documents sorted by relevance score to find additional relevant documents you encounter too many non-relevant ones). That is simply not compatible with the information need in some situations.

For example, a litigant can request that its opposition or a third party be compelled by the court to produce documents that satisfy some relevance criteria so that evidence that may resolve the dispute can be discovered. When electronic documents are involved this process is called e-discovery. High recall (e.g., 75% or more) is required to satisfy the court. High precision is desirable to reduce the number of non-relevant documents that will undergo expensive human review before the documents are turned over to the requesting party. When determining the recall that is necessary the court will take cost into account, but it does so in light of the value of the case and the likelihood that e-discovery will turn up useful evidence—this is called "proportionality." The F_{1} score cannot possibly strike the right balance between recall and precision because it is completely oblivious to the value of the case. Other problems with the F_{1} score are detailed in the paper.

The strong dependence that F_{1} has on recall as we move along the precision-recall curve means that it is easy to draw wrong conclusions about which system is performing better when performance is measured at different levels of recall. This strong dependence on recall occurs because the contours of equal F_{1} are not shaped like precision-recall curves, so a precision-recall curve will cut across many contours. In order to have the freedom to measure performance at recall levels that are relevant for e-discovery without drawing wrong conclusions about which system is performing best, the paper proposes a performance measure that has constant-performance contours that are shaped like precision-recall curves, so the performance measure depends much less on the recall level where the measurement is made than F_{1} does. In other words, the proposed performance measure aims to be sensitive to how well the system is working while being insensitive to the specific point on the precision-recall curve where the measurement is made. This graph compares the constant-performance contours for F_{1} to the measure proposed in the paper:

Since the constant-performance contours are shaped like typical precision-recall curves, we can view this measure as being equivalent to extrapolating the precision-recall point to some other target recall level, like 75%, by simply finding an idealized precision-recall curve that passes through the point and moving along that curve to the target recall. This figure illustrates extrapolation of precision measurements for three different systems at different recall levels to 75% recall for comparison:

Finally, here is what the performance measure looks like if we evaluate it for each point in the two precision-recall curves from the first figure:

The blue performance curves are much flatter than the red F_{1} curves from the first figure, so the value is much less sensitive to the recall level where it is measured. As an added bonus, the measure is an extrapolated estimate of the precision that the system would achieve at 75% recall, so it is inversely proportional to the cost of the human document review needed (excluding training and testing) to reach 75% recall.

For more details, including the mathematical model used for the idealized precision-recall curves and some examples of how well the extrapolation works, read the paper.

© 2020 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Upcoming DSC Webinar**

- Data Science Leadership Exchange: Best Practices for Driving Outcomes

Despite an increasing awareness of the role data science plays in successful business outcomes, data science leaders still struggle to organize, implement and communicate effective data science initiatives.

Join this latest DSC webinar and gain advice on optimizing your data management strategies. Some of the industry’s best and brightest from Bayer, S&P Global and Transamerica will be presenting their insights and experiences. Register today.

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Upcoming DSC Webinar**

- Data Science Leadership Exchange: Best Practices for Driving Outcomes

Despite an increasing awareness of the role data science plays in successful business outcomes, data science leaders still struggle to organize, implement and communicate effective data science initiatives.

Join this latest DSC webinar and gain advice on optimizing your data management strategies. Some of the industry’s best and brightest from Bayer, S&P Global and Transamerica will be presenting their insights and experiences. Register today.

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central