Reading time: 4 min
Latest news in Artificial Intelligence dedicated to Natural Language Processing (NLP) have lifted hopes of some sort of AI magic. This is the first of a series of articles to qualify the gap between stellar expectations and actual results delivered in real-life business use cases.
Today, free machine translation services such as DeepL or GoogleTranslate demonstrate the incredible progress made by A.I. It simply works!
Yet, some might have the impression that tasks such as Machine Translation or Question Answering are relatively 'easy' tasks. Because they imagine there are some sort of giant correspondance tables hidden somewhere and that a super computer is finding among every single possible sentences in one given language its equivalent. (spoiler: no such tables...) And, you know, machines are good at large-scale systematic repetitive tasks...
Does that mean that machines can today really understand and manipulate language almost like humans?
Business loves categorization
Businesses are today overwhelmed by a continuous flow of business-related conversations: customer request or complaints, customer reviews, employee feedback, emails, tweets
It is not surprising that categorization stand as one of the top client request from brands and businesses:
- sentiment analysis
- email filtering
- customer reviews
- opinion polls
In many instances, it would be very cool if a handy program could classify automatically every single text item according to meaning. Detect what it's all about, whether it's positive, or negative... Definitely some productivity gains to mine.
So, how does the machine fare?
At first, there were dinosaurs
A very very long time ago (in Artificial Intelligence timeline, something like … 5 years ago), implementing such classification models was taking months, man-hours of dataset labelling, designing and implementing rules to take into account lingo, vocabulary, neuronal networks, adapting to every different language…
And results were at best disappointing. Having the machine get it right 2 times out of 3, or even 3 out 4, may seem "OK". But it is a showstopper for many business use cases. Don’t expect traders to rely on a client order filtering system with 75% accuracy.
The two main roadblocks holding back performance were:
- the inability to truly account for context when decyphering text. In fact, algorithms were essentially word-based. As if you were limite to express yourself only with ‘sentences’ of 2 or 3 isolated words in random order...
- the (human) cost of labelling huge specific datasets, in order to teach the model, by showing it ten of thousands of correct examples.
Transformers come to the rescue
The advent of a new generation of language models, based on a novel and powerful architecture - Transformers, BERT being the first and most famous of all - have dramatically changed the picture.
How? By adding three magic ingredients to the pot
All of sudden, we have numerical representations of sentences ("sentence embeddings") that capture some of the overall context of a sentence, what’s more or less important in one given sentence, and how it affects the overall meaning.
. Unsupervised training
All of sudden, we have methods to train models that can be pre-trained on unlabelled data in the first place. Giving opportunity to deep pocket players such as Google to train models on MASSIVE data (remember, they have it) … and to open-source them.
And all these models happen to have transfer learning properties eg that when you train them on new data (say, your data, for your specific problem, with your specific angle), these models happen to ‘remember’ most of the other things they have learned during the pre-training, on top of the specific things they will learn from your data. As if Google & altri were actually giving you for free some of their data and the computing power to train your model, just to get you started.
Better results, lower constraints
Equipped with these new technologies, performance bar has been dramatically raised.
Last week our Data Science team delivered a multilingual custom classification model, aiming at categorizing employee feedback from engagement surveys. In this case, with three challenges to overcome:
- Very conceptual/abstract classes, such as "Culture" or "Vision"
- 15+ languages used
- Very different language styles, from colloquial to corporate
The team delivered a model with a accuracy score of 85%.
In itself, this score does not mean anything. It could be considered excellent or insufficient depending of the initial business objectives and constraints, and the complexity of the model. (actually, the client was aiming at 80%, so he was pretty happy about it 😉 )
Nevertheless, it is fair to say that a model that classify the quote more than 8 or 9 times out of 10 in the right bucket in a fraction of millisecond can prove useful in many occasions.
More importantly, what is also game-changing, that this level of performance has been obtained:
- in weeks rather than months,
- with limited upfront human labelling effort; days rather than weeks
- based on affordable computing infrastructure
Understandably, the range of business use cases for which current NLP state-of-the-art models qualify is growing as we speak.
New expectations, new limits
How to get closer to perfection? Reach 99%?
As a teaser for our next articles, let me mention a few directions towards better performance
. Data: more, cleaner…
. Human-augmentation: who is able to label a 1 000 lines dataset in a consistent and rigorous way? Not me, probably not you 😉 So let’s better work with machine.
. Perception biases: Do we judge AI results against human performance … or against perfection?
We try to dig this quest in the following articles of this series.