Neural Network Based Dialogue Management

Jan Pichl May 14, 2017 at 8:41 pm

There are plenty of sub-tasks we need to deal with when we are developing chatbots. One of the most important sub-task is called intent recognition or dialogue management. It is a part of the chatbot which decides what the topic a user want to talk about is and then it runs the corresponding module or function.  Since it manages which part of the software should be triggered, it is called “dialogue management”. The “intent recognition” name comes from the fact that the decision is made according to the user intention which needs to be recognised.

All we want to do is to take the sentence of whatever just the user said and then decide which class is the most suitable. In Alquist, we have 16 classes. These classes correspond to the topics which the chatbot is capable of talking about.

We have experimented with several approaches to deal with this task. The first approach combines a logistic regressions classifier and cosine similarity of the GloVe word embeddings (similar to word2vec). The input of the classifier consists of the one-hot vectors of word uni- and bi-grams. The classifier estimates the probabilities of the coarse-grained classes such as chitchat, question answering, etc. More fine-grained classes are estimated using the cosine similarity distance between the average vector of the embeddings of the words from input sentence and the average of the embeddings of the words from reference sentences. The accuracy of this combined approach is 78%.

Another approach uses a neural network as an intent recognizer. The neural network has three different inputs. The first one is the actual utterance, the second one consists of the sequence of the concepts and the third one is the class of the previous utterance. The concepts are retrieved using heuristic linguistic rules and Microsoft Concept Graph. The previous label is the output of the very same neural network for the previous utterance or “START”  if the utterance is the first message. The structure of the neural network is shown in the following figure.

Network Structure

The network consists of separate convolutional filter for input sentence and the list of concepts. We use the filters of lengths from 1 to 5. Max pooling layer follows the convolutional layer and the outputs of the input sentence and the concepts branches are concatenated. Additionally, the class of the previous utterance is concatenated to the vector as well. Finally, we use two fully connected layers with dropout in the architecture.

This neural network based approach achieves the accuracy of 84% and it represents a more robust solution for the presented task. Additionally, it takes advantage of the information about the previous class which is often a crucial feature.