Question answering tutorial with Hugging Face BERT

What is Question Answering AI

Question answering AI refers to systems and models designed to understand natural language questions posed by users and provide relevant and accurate answers. These systems leverage techniques from natural language processing (NLP), machine learning, and sometimes deep learning to comprehend the meaning of questions and generate appropriate responses.

The goal of question answering AI is to enable machines to interact with users in a way that simulates human-like comprehension and communication. In the ever-evolving domain of natural language processing (NLP), the advent of models like Bidirectional Encoder Representation Transformers (BERT) has opened doors to profound advancements, enabling machines to comprehend and generate human-like text with unprecedented accuracy. These models become more intricate, setting benchmarks in a variety of tasks, from simple text classification to complex question answering AI.

For NLP enthusiast or a professional looking to harness the potential of BERT for AI-powered QA, this comprehensive guide shows the steps in using BERT for Question Answering (QA).

How to build a question answering AI with BERT?

While BERT is a powerhouse trained on massive amounts of text, it is not highly specialized so using it out of the box is not ideal. However, by fine-tuning BERT on QA datasets, we’re helping the model grasp the nuances of QA tasks, especially for domain specific prompts like medicine and law, saving time to response and computing resources. By leveraging its existing knowledge and then tailoring it to QA, we stand a better chance at getting impressive results, even with limited data.

There are two predominant methods to fine-tune a BERT model specifically for question-answering (QA) capabilities – Fine tuning with Questions and Answers alone vs. Questions, Answers, and Context

Fine-tuning with questions and answers alone

In this approach, the BERT model is treated somewhat similarly to a classification or regression task, provided with pairs of questions and their corresponding answers during training. Here, the question-answering model essentially learns to map a given question to a specific answer.

Over time, the model memorizes these mappings. When presented with a familiar question (or something very similar) during inference, the model can recall or generate a suitable answer. However, this method has its limitations. It largely depends on the training data, and the model might not generalize well to questions outside of its training set, especially if they require contextual understanding or extraction of information from a passage.

It is flawed in its uncreative nature, relying heavily on memorization which can lead to empty responses when a question that it never encountered or can attribute to an existing question is asked.

Fine-tuning with questions, answers, and context

BERT models trained using context (or passage), a related question, and an answer within the context is far more performant and creative. Here, the objective is not to make the model memorize the data it’s being trained on. Instead, the goal is to enhance the model’s capability to comprehend and extract relevant information from a given context or passage much like how humans would identify and extract answers from reading comprehension passages. This method enables the model to generalize better to unseen questions and contexts, making it a preferred approach for most real-world QA applications.

While Method 1 stands as a direct mapping problem, Method 2 treats it as an information extraction problem. The choice between the two largely depends on the application at hand and the available data. If the goal is to create a QA model that can answer a wide range of questions based on diverse passages, then Method 2, involving contexts, is more appropriate. On the other hand, if the objective is to build a FAQ chatbot that answers a fixed set of questions, the first method might suffice.

Extractive question answering tutorial with Hugging Face

In this tutorial, we will be following Method 2 fine-tuning approach to build a Question Answering AI using context. Our goal is to refine the BERT question answering Hugging Face model’s proficiency, enabling it to adeptly tackle and respond to a broader spectrum of conversational inquiries.

When dealing with conversational questions, we’re diving into queries that arise in natural, fluid dialogues between individuals. These questions are context-heavy, nuanced, and might not be as straightforward as fact-based inquiries. Without fine-tuning on such specific questions, BERT might struggle to capture the underlying intent and context of these queries fully. Thus, by refining its capabilities through fine-tuning, we aim to equip BERT with the specialized skill set required to adeptly address and respond to a broader range of these conversational challenges.

Dataset used for fine-tuning

In this tutorial, we will be working with the Conversational Question Answering Dataset known as CoQA. CoQA is a substantial dataset from Hugging Face designed for developing Conversational AI Question Answering systems. You can find CoQA datasets here on Hugging Face.

The dataset is provided in JSON format and includes several components: a given context (which serves as the source for extracting answers), a set of 10 questions related to that context, and the corresponding answers to each question. Additionally, it includes the start and end indexes of each answer within the main context text from which the answer is extracted.

The primary objective of the CoQA challenge is to evaluate how well machines can comprehend a textual passage and provide answers to a series of interconnected questions that arise within a conversation.

Comparing the non-fine-tuned and the fine-tuned model performances

Non-fine-tuned BERT model evaluation

Below are the outcomes when using a BERT model that hasn’t been fine-tuned and evaluated on the same dataset.

Given the challenge of precisely predicting the start and end indices, we’ve implemented a function to accommodate minor deviations in our evaluations. We present the accuracy without any margin for error, followed by accuracies considering a leeway of 5 words and then 10 words. These error margins are also applied to evaluate the performance of the fine-tuned BERT model.

Total processed data points: 468

Start Token Accuracy (Pre-trained BERT): 1.27%

End Token Accuracy (Pre-trained BERT): 0.00%

Start Token Accuracy Within Range (Pre-trained BERT, 5): 4.66%

End Token Accuracy Within Range (Pre-trained BERT, 5): 4.24%

Start Token Accuracy Within Range (Pre-trained BERT, 10): 7.63%

End Token Accuracy Within Range (Pre-trained BERT, 10): 7.63%

Fine-tuned BERT model evaluation

Here, we’ve showcased the loss value across our seven epochs. Subsequently, we’ve also detailed the accuracy of our fine-tuned model using the same error margins as the model before fine-tuning.

Start Token Accuracy: 7.63%

End Token Accuracy: 5.51%

Start Token Accuracy Within Range (5): 34.75%

End Token Accuracy Within Range (5): 43.22%

Start Token Accuracy Within Range (10): 46.61%

End Token Accuracy Within Range (10): 51.27%

The initial performance of the BERT model on the CoQA dataset was almost negligible. However, after training on just approximately 6,000 data points, the model’s effectiveness surged to around 40% for a 5-word error range and close to 50% for a 10-word error range.

This is a notable enhancement. To further boost the model’s efficacy, we could experiment with varying learning rates, extend the number of epochs and augment the training data. Indeed, a dataset of 6,000 points is often insufficient for many scenarios.