Home » Uncategorized

How we used spaCy and Hunspell to handle typos in an AI chatbot

How we used spaCy and Hunspell to handle typos in an AI chatbot[Photo credit: Pexels]

Think back to the last time you texted a chatbot. Whether it was a concierge, a customer support assistant, or an AI virtual recruiter, chances are the bot guided you through a linear flow.

There’s very little “intelligence” in a bot unless it is trained with Natural Language Processing (NLP) and is able to engage a user.

Part of improving the user experience involves anticipating how a person will interact with the bot. And that has been my main focus over the last two weeks at impress.ai — figuring out how to train a chatbot so it can respond to a dialog with spelling errors.

What’s all the fuss about typos?

Our AI chatbot is used for recruiters to actively engage and shortlist candidates during the hiring process. One way we promote engagement is by letting users (candidates) ask the chatbot questions during the interview to learn more about the company, role, and career benefits.

With NLP techniques, the chatbot is able to separate each word within the question, check for errors, and then provide suggestions for what the candidate’s question could have been. This is like when you mistype a question into Google and the first thing that shows up is a ‘Did you mean?’ statement.

Typos in candidate questions limit the ability of the chatbot to respond with answers that are actually helpful. This is why we decided we to create a way for the chatbot’s algorithm to recognise spelling errors, so dialogs wouldn’t be abruptly interrupted.

Recommended tools to create a spell correction feature

It’s important to understand that the number of questions any chatbot can answer is limited to  the questions within the bot’s training dataset. Based on the use case, the information can be tailored to fit this dataset. Information that is either biasing or irrelevant can be discarded from the dataset, so the chatbot only only learns answers to questions that a candidate is likely to ask.

Since our algorithm is deployed in an interactive chatbot environment, using a third-party web based spell or grammar correction tool would lead to a significant increase in response times. It would also raise concerns around data privacy.

My team and I researched several tools to help meet our goal and these were the ones we used:

  • spaCy: spaCy is a free, open-source library for advanced NLP in Python
  • Hunspell: Hunspell is a free spell checker and morphological analyzer library
  • spacy_hunspell: Hunspell extension for spaCy
  • google-10000-english: List of the 10,000 most common English words in order of frequency

Algorithm employed for our spell correction feature

Once we had the tools in place, the next step involved brainstorming and executing the right techniques. Here is what our process looked like:

  1. Tokenization:
    1. Using the spaCy library, individual words and punctuation marks were identified
    2. Each word was sent to the Hunspell library and if it is misspelled, the tool. provides a list of suggested replacements. If there were no replacements, the query would be forwarded to the FAQ module without any corrections.
  2. Similarity matching:
    1. The suggested replacements were converted to word vectors using spaCy.
    2. These vectors were then compared to the vector representation of the original misspelled word.
    3. The comparison generates an initial similarity score that indicates how similar the suggestion is to the original word.

However, a little testing revealed that we couldn’t rely on the similarity scores alone to replace the sentence sensibly. So we added a frequency check to the pipeline.

  1. Frequency ranking:
    1. Each suggestion was assigned a rank score based on its  frequency in the English language. For this we used a list of the 10,000 most common English words in order of frequency, as determined by n-gram frequency analysis of the Google’s Trillion Word Corpus.
  2. The similarity score and rank score were combined and normalized to provide a final score for each suggested replacement word.
  3. The suggested replacements for each word were sorted based on their final scores.
  4. We select the top two suggested replacements for each misspelled word.
  5. Since there are likely to be multiple misspelled words in a sentence, we create a list of permutations of the different replacements. For example, if there are two words that are misspelled with two suggested replacements each, we have a shortlist of 4 sentences. If there are three words, we have a shortlist of 8 sentences.
  6. These list of replacement sentences are then sent to be matched within the FAQ module.
  7. Based on the matches found, the appropriate answer is shown to the candidate.
  8. These list of replacement sentences are then sent to be matched within the FAQ module.
  9. Based on the matches found, the appropriate answer is shown to the candidate.

Is this something you see yourself doing everyday?

If you’d like to work at the forefront of innovative technology and use data to improve systems, I’d recommend you check out these jobs with one of the top technology consulting companies in Singapore and the Asia-Pacific region: