Natural language conversation is one of the most challenging artificial intelligence problems, which involves language understanding, reasoning, and the utilization of common sense knowledge. Previous works in this direction mainly focus on either rule-based or learning-based methods. These types of methods often rely on manual effort in designing rules or automatic training of model with a particular learning algorithm and a small amount of data, which makes it difficult to develop an extensible open domain conversation system.
Chatbots (also known as Conversation Agents or Dialog based Agents) make use of Natural Language conversation models. They are the latest trend. All big guns investing a lot on building a Chatbot to allure customer because it adds a coolness factor to any system/ website/ application of being able to solve problems in more interactive way. Companies like Microsoft, Facebook (M), Apple (Siri), Google, WeChat, and Slack are heavily investing in building their version of bots
Many companies are hoping to develop bots to have natural conversations that are as similar to human ones as possible, and many are claiming to be using NLP and Deep Learning techniques to make this possible. But with all the hype around AI it’s sometimes difficult to tell fact from fiction. In this article I will try to uncover what a chatbot consists of.
Any Chatbot can consist of the following components:
This part of chatbot is opened to end user. User interacts with the bot from UI. The conversation can happen via multi-channels interfaces like phones, laptops, hardware (Amazon echo), kiosks, desktops etc. and it can be text based (messengers) or verbal (Siri, Alexa etc.). The user interface is responsible for providing these capabilities through which the user can interact and use the system. It is also responsible for maintaining the user session, keeping track of user activity, manage authentication/ authorization and user cart (if applicable).
The communication mediator of chatbot is responsible for channelizing the input coming in from User Interfaces to invoke the service that can resolve the query. In case of verbal communication, it should also be able to convert speech to text and while responding should have text to speech generation capability. This can either be done by invoking third party API (Nuance’s ASR/TTS API, Alexa API etc) or we can have pre-generated speech files (in case of rule-based models). This layer should also be responsible for making sure right model is invoked to further process the input and generate the appropriate responses. Some people consider UI and Communication as one entity but I think they are separate because they have different functions. I have also seen a few architects calling this layer as conversation state machines, knowledge as service (KaaS – a term often used by a good friend of mine @Brian Martin), or intent resolver
Models are the brain of bot. The models help in figuring out the intent of input. In ideal scenario the model should behave as human brain that understands what a person is asking for by relating events from context it builds over the period of conversation and signals what response to be sent. The Chatobt can be modeled in various ways. Depending upon the availability of data and domain knowledge one can build a high performing bot. Here are few commonly used models
These models are easier to build as we predefine the set of sentences (question or responses) and use some kind of heuristic to select the appropriate response based on input context. Depending upon the type of problem, we can model the heuristics to be as simple as some rules or as complex as machine learning classifiers. These systems are mostly hard-coded and they pick responses from a fixed set stored either in database or file system. It needs lot of ground work to prepare and cover all types of scenarios.
These models are similar to rule based but here we just predefine the frame of the response and leave spaces for placeholders (tokens) that we can fill in based on logic we wrote. The logic can either use deep learning techniques or can be fetched from a query based system. E.g. We can have template of telling weather as – “Weather of <LOCATION> is <WEATHER_TYPE> and temperature will go <TEMPERATURE> F”. In this simple example we have 3 place holders and we can easily fetch the information if the question being asked – “What is the weather in San Francisco?”. We can generalize these templates to train our model. Most of the existing chatbots make use of template based approach however the companies are constantly trying to make smarter templates so that they can get away from this hardcoding of finding too many scenarios and generate more human like responses.
This is a new breed of models where we generate the responses using Machine Translation techniques. The deep neural networks using sequence-to-sequence modeling are commonly used to generate the response. Another widely used (and researched) model is Autoencoders that make use of encoder-decoder paradigm to generate responses. These systems are much smarter. Here we don’t rely on pre-defined responses and everything is done on the fly. These models are slowly becoming popular. The problem is they need a lot of data to train upon and they are still not able to generate fully grammatically correct sentences also they can’t generate large responses with high accuracy. If we have short answers they are pretty good.
Sometimes the need is to generate one-liners and sometimes the expectation is to generate a full paragraph of response. The bots can be modeled to generate either but in both the cases we need to handle scenarios differently. Long conversation type bots are harder to build. We can make use of deep neural network of Generators (using LSTM, GRU) capable of generating text summary for this. Short answers can follow one of the above discussed approaches. Since the response is small and pretty much follow a pattern most of the times we use rule-based or template based approach. For the answers that are highly open-ended we would need to make use of generators.
This class has to categories – Open ended and closed ended conversations. The model we build for these types of bots should be trained accordingly on the type of responses it need to send back. In an open ended setting the user can take the conversation anywhere. There isn’t necessarily have a well-defined goal or intention. Conversations on social media sites like Twitter and Reddit are typically open ended – they can go into all kinds of directions. The infinite number of topics and the fact that a certain amount of world knowledge is required to create reasonable responses makes this a hard problem. In a closed ended setting the space of possible inputs and outputs is somewhat limited because the system is trying to achieve a very specific goal. Technical Customer Support or Shopping Assistants are examples of closed ended problems. These systems don’t need to be able to talk about politics, they just need to fulfill their specific task as efficiently as possible. Sure, users can still take the conversation anywhere they want, but the system isn’t required to handle all these cases – and the users don’t expect it to.
We can mix and match these models to get more optimum results. Having worked on generative and rules/ template based the models, I feel that generative models still have a long way to go. Even though I was able to get good number of one liner/ word answers (94% accuracy), in real world this is not sufficient.
It is the bread and butter of bots. Model need to be trained on the business specific data. The model consumes the data and can become smarter. The more specific data we have the better a bot can perform. In case of building rule-based or template based bots, we can train on publically available engines like wit.ai or if we are working on generative models we need a lot of business specific data to train the custom model.
The bot architecture should be pluggable with exposed interfaces that connect the bot with enterprise applications and help in catering to the purpose for which it was made. Once bot figures out what user is looking for they can programmatically invoke these interfaces to further add specifics to the responses product information in a shopping cart, inventory list or showing last n orders. The interfaces can also represent content delivery networks that can feed media and other business specific data to bot application.
In this article, I tried to cover some basic nut bolts of a chatbot. It takes a lot to build a chatbot. It needs proper planning, thought process, business domain understanding, content building, model selection, data gathering, scenarios and use case preparation and interfaces identification. The deep network technology is evolving and new methods are discovered every other day. I have built bots using most of the modeling methods I described in this article. I have strong liking for generative models because they can be made smart and they can awe you upon the type of responses they are able to generate, but they are still far away from being foolproof. I am continuing my research on building a smart bot. For developing high performing bots, still rule and template based models are extensively used.
We are in pursuit of making a high performing chatbot. The bot that can resolve the issues right away with appropriate reasoning unlike the answer “42” given by “Deep Thought” in "Hitchhiker's Guide to Galaxy". I am confident that we will get there soon.