Intents in conversational AI: A Rethink

Published by:

Yaki Dunietz

When I started developing Chatbots in the late 1990s, There were no generic NLP/NLU systems available, definitely not off the shelf. In fact, these terms were vaguely defined, and used for systems designed to extract meaning from a text string. Therefore, when starting to design a Chatbot Development System, all we had as input for the system is this string of text: The user’s input.

Following Joseph Weizenbaum, we treated the situation as a pattern-matching problem, using good old regular expressions. The input would be matched against a mask, which includes literals, variables and wild cards. So, the pattern “*hello*” will match every input which includes the word “hello”. And the pattern “hello*” will match only inputs that start with “hello” (because in the 2nd pattern there is no “*” in the beginning). Complicated patterns like “*my [family_members]’s name is [common_name] [1:l2]” will only match inputs that fit this pattern (note that only 2 words are allowed after [common_name]. If the input includes more than that, the pattern will not match.)

In how many ways can you say “yes”?

Very often in any dialogue, a yes/no question comes up. At first it seemed easy to recognize a “yes” input, but we quickly discovered there are many ways to say “yes”. Some of them less obvious than others. A few examples: If the bot asks “Are you a happy person?” and the user says “always”, that is clearly a “yes”. But obviously not every “always” is a “yes”! Or if the bot asks the user: “Do you like to drink?” and the user says “occasionally”?

So we quickly collected all the evident ways to say “yes” (including “yeah”, “sure”, “yup” etc.), and put them all in a collection we called [yes]. We did the same for “no” (in the collection [no]), and used the same exercise for a rather big number of frequent user inputs. Such collections of patterns (regular expressions) are simply all the different synonymous ways a person could convey a specific meaning – an Intent.

Alexa bursts into the scene

Years went by, and Amazon Echo appeared on the scene with Alexa. We figured this is great news: What would be easier then deploying our bots on the Echo/Alexa platform? It was only then that we realized that the concept of “Intent” (or “User Intent”) has taken root and become the holy grail of the new emerging NLP systems. So much so, that when creating a new skill (a new Chatbot on Alexa), the platform itself analyzes the voice input. But instead of just passing it to the bot in text format (like our bots are used to), Alexa passes the Intent. The bot does not know what the user really said – it relies Alexa’s interpretation of the input, in the form of Intent.

Basically, Alexa limited the input alphabet from the entire English language, to a relatively small number of pre-designed intents. If the user said “always”/“maybe”/“sometimes” to a yes/no question, Alexa would make the decision regarding the intent, not the bot. There is no way to distinguish between different flavors of “yes” and “no” in particular contexts. That is, unless you resort to the raw user input.

Typical NLP systems take in a user input and provide one intent that can be associated with that input. Such systems are often employed as the first step in most Chatbots, who take the intents produced by the NLP system AS the user input. They do not bother with the raw user input. They get a ready made intent from the NLP. The conversation flow does not affect the function of the NLP. This poses a huge problem, because a “yes” is not always the same “yes”! Without the actual words said by the user, it is impossible to produce a good response when the context is deep.

Some more great reading

This brings me to the main problem with intents: Deep context. Typical voice assistants (Siri, Alexa, Cortana) are (almost) always without context: The dialog starts afresh. Sometimes there may be a follow-up question, but there are no long conversations with many context levels, as a typical dialog between two humans would have. The intents produced by popular NLP systems work perfectly in a flat-context scenario. Here, there’s no need to know what was discussed a moment ago or a few minutes ago. But once a real conversation starts, the meanings of user inputs cannot be placed in ready-made boxes. That’s because the context is specific, and so are the user intents. The bot must have access to the exact words uttered by the user, in order to properly react.

In other words: Good NLP must take into account the conversation flow and context. The NLP should be aware of the status of the dialog. NLP and conversation design cannot be separated. NLP depends on conversation design. The two are intertwined.

When the VAs has no context

The reason the concept of intents became so successful, is the prevalent confusion between Chatbots and Voice Assistants. The main difference being that VAs have no context: They provide a function, not a conversation partner. And for VAs, intents are perfect. However, for Chatbots which conduct long, multi-turn, deep-context conversations, The concept of intent must be abandoned. What counts is the exact user input, not an interpretation of a stranger (NLP) that knows nothing about the situation.

The way we solved the Alexa problem, is by using only one single intent Alexa provides; An option to get the raw user input in text format. This works perfectly with our multi-turn conversational bots. We did discover that Alexa’s voice recognition systems has problems when expected to provide an exact input transcript (as opposed to just producing the intent), but that is already the topic for another blog entry.

Check out the conversation flow and context in our CoCo Bot here