National

Can Machines Talk?

And not just like a parrot? This is the ultimate test of AI. Language technology will change our lives, and we must build a future that doesn’t speak to us only in English.

Getting your Trinity Audio player ready...
Can Machines Talk?
info_icon

I propose to consider the question, ‘Can machines think?’

– Alan Turing

In ordinary parlance, we often say of an unc­ommonly intelligent person: “So and so is like a computer.” But at the story’s beginning, it was the other way round. The challenge was to build a machine that could be like a human in the most elementary aspect possible: think, for instance. Even if not deeply, like Rodin’s Le Penseur, at least like a dew-fresh two-year-old. It still is, arguably, quite the unbreachable frontier. That deceptively simple question up there was posed, almost like a whisper that would set off a storm, by a tragic and enigmatic figure now universally seen to have founded both computer science and Artificial Intelligence (AI)—the British mathematician Alan Turing. Set out in his seminal 1950 paper in Mind, the idea of proving that a mechanical dev­ice could, even hypothetically, think like a human was as profoundly unsettling as it was exhilarating—for philosophers and commoners alike.

The famous Turing Test went on to become the axis on which a whole century pivoted. But the basic task to be assigned to a Turing machine entails an even more elementary aspect of life all of us take for granted: human language. Can there be thinking without language? And even if not invested with a soul—with poetry, song and human emotions—can a machine be devised that could respond to and reproduce the subtle inner geometries of human language like a two-year-old? And so we could ask the subset question: can machines talk?

This problem lies at the heart of what we now call language technology, a nascent and fascinating field at the cutting edge and also at the centre of the whole AI universe, intersecting almost everything else. Alexa, could you please foretell what the future could look like? Well, as a maximal language tech fantasy, imagine a farmer from Punjab being able to converse with another from the Karbi Anglong hills or Tindivanam almost instantaneously, without understanding a word of each other’s language—through a smart dev­ice. Or a Korean manager with people at his car factory in Tamil Nadu. This would be a very simple thing for a human interpreter, but is still galaxies away for the smartest computer! But aiming to get there, and fulfilling any number of more achievable targets in the interim, is what makes the field—an interdisciplinary one that brings tog­ether cognitive scientists, neurobiologists, linguists and computer mavens—toss and froth like a primal sea aching to give birth to a planet.

The building blocks seem ridiculously simple: a machine that can listen and speak. And, most of all, get it. Problem is, the latter involves the mind, still a mystical thing for science—Descartes had posed his whole mind-body dualism on that. But even the mechanical aspects were no picnic. Lonely, unt­iring inventors kept at it in their attics and labs for a whole two centuries. The first attempts to develop talking machines with human-like speech capabilities were made as far back as the mid-18th century, and scientists soldiered on bravely right into the 1990s. The earliest such artefact used acoustic resonance tubes to mimic the human vocal tract. Back in 1773, Christian Kratzenstein, a Russian scientist and a professor of physiology in Copenhagen, connected these tubes to organ pipes to produce human vowel sounds. Four years before him, Austrian inventor Wolfgang von Kempelen—more famous for a chess-playing automaton called The Turk that turned out to be an elaborate hoax, but notionally prefigured IBM’s Deep Blue—started work on his Speaking Machine. It took him over 20 years of tinkering with stuff like a kitchen bellows, a bagpipe reed and a clarinet’s bell, and imp­ressive work on human phonetics, to fashion his contraption. In the mid-1800s, Charles Wheatstone built an improved version of that, using resonators made of leather that could be controlled by hand to produce different speech-like sounds.

It was finally in 1971, after computer technology had begun maturing, that a definite collaborative shape was given to proceedings when the US Defense Advanced Research Project Agency (DARPA), under its Speech Understanding Research (SUR) project, provided funding for many res­earch groups to work on spoken language systems. This went on to bec­ome the basis of modern speech technology. The research initiatives at that time were largely based on Allen Newell’s 1971 study that sought to incorporate knowledge systems from signal processing and linguistics into technology. The speech technologists at the time believed these systems would work only if they were able to create computational models of how humans talk: the very central thesis of all work on Artificial Intelligence till now.

It took over 20 more years—till around the mid-’90s—bef­ore speech technology stepped out of the realm of science fiction and into a real world of humans and machines talking to each other. Let’s get to 1995: the birth year of this magazine. This was not only the year of Microsoft Windows 1995, and the short-lived Microsoft BOB, but also the year when Microsoft released the very first Speech API, or programming interface. This suddenly allowed developers to “speechify” their Windows applications by integrating existing Automatic Speech Recognition (ASR) and Text-to-Speech Synthesis (TTS) systems.

Now, in 2020, Alexa, Cortana and Siri are seeping into common culture, popping up in TV commercials as new playthings, almost on the way to attaining the mundanity of ovens and toasters. We are by now used to novel technology popping out of the ether like warmly toasted bread—we almost expect it. But what made these talking assistants possible? A whole ser­ies of scientific breakthroughs over five decades.

info_icon

Primary amongst these was the advent of statistical machine learning in the 1980s-90s that made it possible to mathematically represent linguistic and speech features in a probabilistic model. Now we had something called Baum-Welch algorithm—which was as useful as it seems hard to pronounce. It could do a fine job of figuring the best way to eke out the knowledge features from ‘labelled’ speech and text—that is, boil a language input down to its phonemic and syllabic units. In natural language, those features are not random. They are patterned. There are broad patterns the layperson can understand: like the simpler consonant-vowel alternation in Japanese. There are also subtler patterns. Like colours bleed into each other in tie-and-dye prints, consonants and vowels do too onto each other: that’s how real language works, if you see it under a microscope. How do you parse that? Here came another formidable-sounding ally: Hidden Markov Models (HMM). A statistical toolkit for sequence analysis—sometimes called the “Lego blocks of computational sequence analysis”—HMMs could be trained on this data to indicate whether an unk­nown utterance was indeed the phrase predicted by the model. And voila, speech recognition turned a corner, and started speeding down the autobahn.

Why was this a paradigm shift? Because, while earlier models did use linguistic rules and basic pattern matching, now they made the soufflé rise with heavy data. Just like with a digital camera, more data meant more clarity and definition. Data has come to be the mainstay of all language technology now. The use of HMMs was not without initial controversy; many thought it did not offer the best representation of human language. Real speech is not entirely a series of discrete events—our perception of it may be so, but the signals on the spectrogram yield quite a continuous, analog blur of vowels, consonants, rhotics (‘r’ sounds), liquids (‘l’) and gradient nasals. So no discrete data, and no proof anyway that humans process language data (assuming it was clean, discrete) probabilistically. Also, for many, HMMs were too simple and did not consider the finer details of higher-level knowledge. But it was this simplicity that made HMMs popular and the only practical way to do speech recognition. That is, until very recently.

For, now, speech technology was marching in lockstep with computing power and machine learning, mirroring the adv­ances made in those wider fields. This was not straying from the data-driven paradigm—rather, getting better at it. Over time, it has become possible to process enormous amounts of data at a much faster clip. And more data meant faster learning, especially since newer advances allowed for semi-supervised and unsupervised machine learning—you no longer need to feed fine-grained, pre-set labels to the machine. The accuracy of outputs increased with the amount of data the models were trained on. Speech-based applications became a reality, going from toys and research prototypes to real-world applications—airline reservation systems to virtual assistance, educational tools to information access.

The Neural Path Starting in the early 2000s, deep learning as a broad framework of methods went beyond HMMs and took speech and natural language processing system-building by storm. Essentially, all deep learning systems are a specific form of artificial neural networks—which seek to simulate human brain functioning with entire orchards of artificial neurons. The idea of neuromorphic computing has been around for long, but it was to regain currency in the mid-’80s with publications emerging from the parallel distributed processing group at San Diego—and force their way centrestage.

Deep neural networks (DNN) have fundamentally revolutionised the approach to language processing systems in general, and ASR and TTS—Automatic Speech Recognition and Text-to-Speech—in particular. DNNs can model complex non-linear relationships by modelling speech as a layered composition of basic data types. This network of layers, trained from huge amounts of data, allows very complex phenomena to be modelled efficiently and make predictions that may not have been explicit in the input data. Their success notwithstanding, one major concern has been a certain lack of interpretability. How are these machines doing what they are doing? Do they really get it? Lately, it has become clear that it is possible to easily fool a neural speech recogniser by just adding a small amount of specially constructed noise or ‘adversarial’ examples. This raises serious concerns about the robustness, security, and interpretability of modern deep neural networks. It also underlines the need for explainability of the representations the machines learn. Was going purely statistical leaving a vital gap?

ALSO READ: Future Grid

So we are left with a vaguely dissatisfied feeling. Deep neural networks may have pulled off an engineering feat— yielding many practical gains in commercial and even assistive technologies—but come up short on some real scientific objectives. Frontier scientists now seek a crucial change of focus: they want the field to re-engage with the cognitive study of human languages. Newer deep learning architectures have responded by reorienting themselves to that real task. Natural language processing can’t be merely an engineering challenge: it has to be attentive to the subtle switches and fibres of human cognition, how the mind models language as a symbolic interface with a world full of meaning. A truly sophisticated model would deliver knowledge from the heart of that human world. This task it is yet to fulfil: the taunt of 18th century philosopher Denis Diderot about a speaking parrot still resonates.

There’s another crippling problem in working with algorithms that are essentially data-greedy pattern-readers and follow a strict Garbage In, Garbage Out (GIGO) principle. The accuracy, relevance and overall reach of these technologies are only as good as the data they are trained on. Which means the asymmetries that exist in other dom­ains—economic or material—seep into language technologies as well. Most data that exists in digital form, unsurprisingly, tends to be in English…and such like. So technologies are all mostly built for dominant languages, simply because that’s where we have a surfeit of available data. Where does leave minor, lesser-known, understudied languages. Or even mixed language contexts? The latter is a very pervasive, real-world thing—and a major source of variation and complexity—yet remains largely understudied. So right now we are in a place where technologies built with major languages are being ported to global markets. Result? Poor performance. Just like a delicate Japanese CD player would choke in the Indian dust, that too after you’ve paid First World rates.

ALSO READ: A To Z

The Future Speaks In 2016, Microsoft claimed its ASR system for English had achieved human parity or can function at accuracies similar to a human. A few months later, IBM claimed to break that record by a few decimal points. Since then, there have been claims and counter-claims by many commercial giants of incrementally decreasing error rates, at least for a handful of languages. But for machines to be able to achieve human-like abilities in language requires more than just speech recognition. It requires integration of speech technology with other natural language systems in a seamless manner. It requires a broader understanding of cognitive, communicative and socio-cultural aspects of human language use. It requires computational frameworks that can make meaningful Human-Machine interactions possible in all the languages of the world. That Punjabi-Karbi farmer’s conversation is a faraway goal.

Theoretically, now that deep learning-powered language models like GPT-3 are generating text that’s nearly indistinguishable from human-generated text, it doesn’t seem impossible. We may sense a not-so-distant reality where real-world versions of Arthur C. Clarke’s HAL 9000 will greet us at home, at work and in almost every sphere of human existence. Except when we consider that GPT-3 was trained on 45 Terabytes of language data from the internet, and then look at what that data really is. Only about 100 of the approximately 7,000 languages of the world have a presence on the net. English constitutes more than a quarter of digital content, followed by Chinese and Spanish, and only 10 languages account for 78 per cent of the content on the internet. So no surprise that all that flawless GPT-3 output we have is (mostly) English.

What would happen to global language use in such a scenario? Would large populations of future generations experience a disruptive language shift because ubiquitous technology can only be accessed in a handful of dominant languages? Would we witness an accelerated rate of language death bec­ause HAL will not talk to me in my mother tongue? As we peek into the future, what we see is the inequalities in every sphere bleeding onto a domain that should ideally set us free.Unscr­upulous and ultimately unscientific approaches will nearly ensure that—the intensely powerful architectures of speech technologies will deliver that coup de grace with greater ferocity than ever before. It is therefore imperative that the future be reconstructed now by shifting the design goals—bring in the cognitive, bring in the human, and curate the data engineering process to reflect the future we would like to have rather than a future we would be regressing into. 

ALSO READ

Kalika Bali and Indranil Dutta Bali, a researcher at Microsoft Research India, is passionate about developing technology for Indian languages. Dutta, a professor at Jadavpur University, is a computational linguist.