Artificial Intelligence

     My Home Page

An extract from an essay written in May 2001

Speaking Computers

This essay outlines the progress which has been made in creating programs which understand and produce natural language. It also identifies ways in which such programs will need to be improved in order to mimic more faithfully the human language user.


  1. Introduction
  2. Outline of progress
  3. Artificial Intelligence Programs
  4. Problems to overcome
  5. Improvements which are necessary
  6. Summary
  7. What does the future hold ?

1. Introduction
Artificial Intelligence (AI) is concerned with using computers to simulate human thought processes and devising computer programs that act intelligently and can adapt to changing circumstances. A key aim in AI is to enable humans to interact with computers using natural language - the ultimate in user-friendliness. This is not just to satisfy a desire to make computers more human like. There are practical reasons. Speech is natural - we can speak before we can read or write. Speech is also more efficient - people generally can speak many times faster than they can write or type, and speech is more flexible - we do not have to touch or see anything to carry out a conversation.

Through its short modern history, advancements in the field of AI have been slower than first estimated. From its birth almost fifty years ago there have been a variety of AI programs. We will first look at the progress over the last few decades and then look more closely at various programs developed over that period. We will see how they became more sophisticated in their understanding and production of language as our understanding of language and linguistics improved and our computing technology advanced. We will also look at the problems which need to be overcome and then we will identify areas where these programs need to improve to achieve one of the aims of AI, which is mimicking the human language user.

2. Outline of Progress
In the 1940s the computer was invented, and this provided an electronic means of processing data. This made available the technology necessary for AI.

In the 1950s Newell and Simon (1955) developed 'The Logic Theorist", considered by many as the first AI program. This problem-solving program represented each problem as a tree model, and would attempt to solve it by selecting the branch that would most likely result in the correct conclusion. This impacted greatly on the field of AI and the general public. In 1956 John McCarthy (regarded as the father of AI) coined the phrase Artificial Intelligence at a conference on machine intelligence. This conference laid the groundwork for future AI research. The 'General Problem Solver' program, developed also by Newell and Simon in 1957, was capable of solving a greater extent of common sense problems. The LISt Processing (LISP) language, developed by McCarty in 1958 became the language of choice by most AI researchers.

In the 1960s, Marvin Minsky of the MIT demonstrated that computer programs could solve spatial and logic problems when confined to a small subject area. Programs that appeared were STUDENT - which could solve algebra problems, SIR - which could understand simple English sentences and ELIZA (1966) - which could simulate a conversation by understanding simple English sentences using simple syntactic analysis of the input and producing seemingly intelligent output. AI researchers also began to use the ideas of linguists, in particular Noam Chomsky, who suggests that language obeys a set of rules that can be expressed in mathematical terms.

The 1970s saw the advent of expert systems, which could mimic the performance of experts in various fields like medicine and stock markets. Expert systems rely on a large knowledge base and a program to search the knowledge base for the best possible answer, but they were flawed where common sense or intuition was necessary. In 1972 SHRDLU was developed which was a reasonably intelligent program but very limited to one domain ('toy world'). The program was able to accept natural language text commands (and correctly respond) to a greater degree of complexity than was seen before. However SHRDLU still lacked the necessary semantic and pragmatic analysis of language. Ray Kurzweil developed the first print-to-speech reading machine for the blind, using optical character recognition technology, in 1976, which could scan and read aloud any printed material.

The 1980s saw further advances in Expert systems, as more finance became available for research and development once the corporate sector developed greater interest. Also, interest in neural networks resurfaced (as computer technology advanced and building neural computers became more realistic). Neural networks are an approach to machine learning and they tend to 'learn' by example. This type of software is used in speech and character recognition.

The 1990s saw Algorithms - a set of instructions for problem solving, used to a greater degree in speech and language processing. The rise of the World Wide Web emphasized the need for language based information retrieval and information extraction. 1997 saw the introduction of continuous-speech diction software. Also in 1997, Deep Blue defeats Gary Kasparov in chess. In 1998 the first continuous speech recognition program with the ability to understand natural language commands is developed by Ray Kurzweil.

3. Artifical Intelligence Programmes
One of the first programs to be developed, in 1966, which understood and produced natural language was ELIZA (named after a character in Shaw's Pygmalion).

ELIZA simulates conversation. There were various versions of this program. One version simulated a conversation between a psychotherapist and a patient with ELIZA taking the place of the psychotherapist. The program works by a simple syntactic or grammatical analysis of the input. It is pre-programmed to respond to certain words and phrases especially those to do with feelings and emotions and gives us the impression that it understands more than it actually does. However it is easy to reveal ELIZA's lack of understanding by asking simple questions, which cause confusion and indicate that it never really understood what was been discussed. The program works without any semantics but uses simple syntactical analysis of sentences.

SHRDLU (the second six letters most frequently used in the English language), the classical blocks-world system that engaged in convincing natural language dialogues was created by Terry Winograd in 1972. SHRDLU is a program designed to understand language. It "carried on a dialogue with a person (via teletype) concerning the activity of a simulated robot arm in a tabletop world of toy objects (often referred to as the `blocks world'). The program could answer questions, carry out commands, and incorporate new facts about its world. It displayed the simulated world on a video screen, showing the activities it carried out as it moved the objects around.

SHRDLU is reasonably intelligent in the limited domain of analysing a conversation about its "toy world" containing blocks of various shapes, sizes and colours, blocks that sit on a table and can be picked up and moved around. It is far more sophisticated program than ELIZA, and can be instructed to move blocks around and can answer questions about the colour, shape and position of the blocks. The computer actually keeps track of the blocks in its memory. It does not physically have to move the blocks but the system could be modified to do that if required. It provided a key illustration of 'symbolic AI'. It uses inference (e.g. by asking it to place a red block on top of a green block, and a blue block on top of the red block it could infer, when asked, that the green block was at the bottom). It can actually work out things. It can use knowledge of objects to make decisions (e.g. It would not place a square block on top of a conical block as it knew it would fall off). Its use of language depends on it knowing (been told/programmed) certain things about shape colour and geometry. It has a memory or knowledge system which allowed it to work out various arrangements in its mini universe. Vast number of rules, facts and relationships have been inputted into its memory.

Expert system: Another advancement in the 1970's was the advent of expert systems. Expert systems predict the probability of a solution under set conditions. They encapsulate the specialist knowledge gained from a human expert and apply that knowledge automatically to make decisions. It mimics the performance of the human expert in specific fields. The applications in the market place were extensive, and over the course of ten years, expert systems had been introduced to; forecast the stock market, analyse mineral exploration data and aid doctors with the ability to diagnose disease. This was made possible because of the systems ability to store conditional rules as well as large amounts of information.

For example, the knowledge of doctors about how to diagnose a disease can be encapsulated in software. The process of acquiring the knowledge from the experts and their documentation and successfully incorporating it in the software is called knowledge engineering, and requires considerable skill to perform successfully. MYCIN is and example of an expert system. This is a medical diagnostic software package that helps doctors diagnose various bacterial infections. Other applications of expert systems include customer service and helpdesk support.

Some simpler forms of expert systems are the 'help' programs incorporated into operating systems and application programs such as computer or network troubleshooting, autocorrect and grammar check features in word processors (earlier wordprocessors only had spell check features, now we have grammar check features where a syntactical dimension is required), Microsoft Office assistant and search engines on the world wide web. While much development work has been put into these it is clear that generally they are unable to pinpoint what the user is asking. A simple search in a search engine will generally give a best match of associated words from a database with categorisation and probability features.

ThoughtTreasure (1996) whose development is ongoing, from the U.S. company Signiform, brings together much of the research and experience in linguistics and AI to-date. It has a database with over 20,000 concept entries and 50,000 English and French word phrases. It includes a syntactic and semantic parser (which describes the words in relation to the sentence and analyses into components to test conformability to a grammar). It also includes an English and French generator and an interface enabling the user to converse with the program. It makes use of the latest findings in linguistics and artificial intelligence and has a long-term goal of understanding and producing natural language. The program however needs to be scaled up. By adding more understanding agents to enable the program cope with the many possible state transitions its designers think it will eventually be able to understand. But this work is slow and time consuming.

Galaxy, which was started a few years ago and whose development is ongoing at the M.I.T. Laboratory for Computer Science has five main functions: speech recognition, language understanding, information retrieval, language generation and speech synthesis. It is capable of communicating in several languages, and is almost as quick as two people having a normal conversation. However, even thought the information it provides is up-to-date and can be accessed over the telephone, it can only deal with limited domains of knowledge, such as weather forecasts (500 cities worldwide) and flight schedules (4000 commercial flights in the U.S.A. per day).

When you ask Galaxy a question, a server matches your spoken words to a stored library of phonemes - the irreducible units of sound that make up words in all languages. A ranked list of candidate sentences is then generated - the computer guesses at what you actually said. To make sense of the best-guess sentence, the Galaxy system uses another server which applies basic grammatical rules to parse the sentence into its parts; subject, verb, object and so forth. It then formats the question in a semantic frame, a series of commands that the system can understand.

At this point, Galaxy is ready to search for answers. A third server converts the semantic frame into a query specially formatted for the database where the requested information lies. The system determines which database to search by analysing the user's question. Once the information is retrieved, it is arranged into a new semantic frame and this frame is then converted into a sentence in the user's language. Finally, a commercial speech synthesizer on yet another server turns the sentence into spoken words.

Galaxy can currently understanding, interpret and respond correctly to about 80 percent of the queries from first-time users. The other 20 percent of queries are too ambiguous to resolve.

Cog (short for cognition), a project currently under development by Rodney Brooks at MIT to make a humanoid robot which can interact with humans in a versatile manner in real time. Cog's talents will include speech recognition (using some special-purpose signal-analysing software which should give Cog a fairly good chance of discriminating human speech sounds, and probably the capacity to distinguish different human voices. Cog will also have to have speech synthesis hardware and software, and will have eye-coordinated manipulation of objects.

Cog's nervous system is a massive parallel architecture capable of simultaneously training up an indefinite number of special purpose networks or circuits. One talent that they hope to teach Cog is a rudimentary capacity for human language. It is hoped that Cog will be able to design itself in certain respects, learning from infancy, and building its own representation of its world in the terms that it innately understands. To be capable of interacting intelligently with a human being on human terms, Cog must have access to literally millions if not billions of logically independent items of world knowledge. Either these must be hand-coded individually by human programmers or some way must be found for the artificial agent to learn its world knowledge from (real) interactions with the (real) world.

4. Problems to overcome
Today, natural language processing could still be viewed as in its infancy. Computational linguists are only starting to learn how to capture word meanings in a computer and how to build programs that track the meanings of sentences and conversations. The knowledge of language needed by a computer to accomplish this can be separated into six categories.

The fundamental problem of artificial intelligence is how can we get a computer to understand what we say and carry out the task required by the user. Computers can understand computer-programming languages, which only have mild ambiguities such as the order of evaluation of arguments to a function. In contrast, natural language is highly ambiguous. By ambiguity we mean that there are multiple meanings for some words and multiple alternate linguistic structures that can be built for a given series of words (sentence or phrase). Ambiguities arise at any of the six categories of linguistic knowledge. For instance at the morphology level, 'I' has at least the following uses in English:

  • The subject pronoun: I am reading a book.
  • The letter: I is the ninth letter of the alphabet. (A wordprocessor grammar check may suggest to change is to am.)
  • The Roman numeral: Napoleon I was born in Ajaccio.
  • An abbreviation for electric current, iodine, an integer, etc.

At the semantic level the sentence 'the teacher taught the pupil a lesson' could mean the teacher educated the pupil as in a normal teacher pupil relationship or that the teacher disciplined the pupil for some wrong doing. Many words have many meanings and the intended meaning will depend on the context of the situation. Also, some phrases are used metaphorically or figuratively.

In addition, from the above sentence the word 'taught' (to educate, impart knowledge) when spoken sounds similar to 'thought' (idea, notion) which causes further ambiguity for computers when trying to recognise natural language.

Words can also be syntactically ambiguous in their part-of-speech, as some words can be used as either verbs or nouns. For instance, in the sentence 'He gave her a ring', 'ring' is syntactically ambiguous in its part-of-speech as it could be a verb or noun (telephone call or jewellery).

As we see above a sentence which has a single clear meaning to a human can have numerous possible parses (or meanings) for a computer. If the computer is programmed with all these meanings it is difficult for it to understand which one is intended by the user. To understand and interpret natural language much more knowledge is needed than was once thought in the earlier days of AI. A full interpretation of a sentence by humans requires an analysis of its syntactic, semantic and pragmatic dimensions. Computers must be capable of the same levels of interpretation for them to be able to mimic humans.

We will now look at areas requiring improvement before computers can imitate humans with regard to understanding and producing language.

5. Improvements which are necessary
For a computer to be able to interact successfully with humans using natural language, advanced technology in the field of Speech and Language Processing is required. Speech and Language Processing itself can be divided into the following areas;

Speech Recognition
Speech-recognition software is now widely available. However it is not a perfected technology, but competition in the field has led to vast improvements and most recent software versions will recognise tens of thousands of words and are able to identify up to a hundred words per minute and possess extensive idioms. More research is still needed especially in differentiating between humans and other sounds and the ability of the computer to recognise different human voices.

Natural Language understanding
However we need to be able to do more than convert audible signals to digital symbols and back again. We must continue with our efforts to incorporate language-understanding programs into our systems so computers can comprehend the meaning and context of spoken words. Computers must do more than what we say, they must do what we mean, by syntactical, semantical and pragmatical analysis of sentences. For the ambiguities that arise throughout the categories of language, computers must be able to disambiguate these ambiguities by part-of-speech tagging and word sense disambiguation.

Deciding whether the phrase 'boil the kettle' means to bring the water in the kettle to 100 °C or to bring the metal and other kettle components to their boiling points can be resolved by wordsense disambiguation. The computer must analyse a sentence into nouns and verb phrases and dividing these phrases into smaller units like nouns, verbs and adjectives, etc (part-of-speech tagging) so that the function of each morpheme is clearly established. Probabilistic parsing is a method used to resolve syntactic disambiguation and is based upon assigning probabilities to possible parses for a sentence. This is used as a parsing system by finding the parse for the sentence with the highest probability of being correct (as in the meaning intended). However, resolving these problems is still a huge task, as there are so many possibilities and so many wrong paths the program takes which need to be repaired and so many new situations which always remain to be added.

Information Retrieval and Extraction
The computer would need to convert the request into a specially formatted query, determine which database to search and extract the pertinent information from the database. We would need to increase the number of domains - computer must look across all domains without been specifically requested to do so. Computers can complete this task reasonably well, once sentences are syntactically parsed so that the correct meaning is understood, the correct data can be extracted. However programs like Galaxy are currently confined to look in one domain at a time (e.g. weather). Programs need to be able to draw data from all domains without any further input from the user. In reality there should be only one domain, which will contain vast amounts of knowledge, and advances in processing speed and memory should provide acceptable search speeds with no waiting period and allow conversations happen in real time. Also, the knowledge acquisition process needs to be automated, so that agents (software which runs without constant human supervision) can learn the vast amounts of world knowledge. Otherwise vast amounts of information would have to be inputted manually. It seems the deeper you go when trying to capture human knowledge, there are always more details to represent.

Natural Language Generation
In natural language generation a computer automatically creates natural language, from a computational representation. Once the information is retrieved, the program must plan and organise the information, decide on the scope of information to generate, the lexical content and the syntactic structure. It then generates the information in the users natural language starting with phrases and morphemes of a sentence and then translating these morphemes into phonemes. This technology is currently being used for automatic weather reports and explanations in expert systems but is still largely under development.

Speech Synthesis
The computer must be able to produce properly formed sentences and be able to engage in dialogue with the user to clarify ambiguities or mistakes in the users questions. Dialogue would have to be natural-sounding highly intelligible speech. It does this by using a database of recorded speech. The database is obtained by recording someone whose voice we wish to emulate. This person records a set of words and phrases. Those words and phrases are then run through a computer program that stores them in a database in a way that can be accessed by a synthesizer. Then, when speech is synthesized, the longest continuous strings of speech are appended together. So if someone is trying to synthesize a phrase or sentence that has been stored in the database, the resulting synthesized speech will sound as natural as recorded speech. If, however, someone is trying to synthesize a more unusual phrase or sentence, shorter portions of speech will be appended together. The resulting synthetic speech will sound slightly less natural, but it will still be in the same voice as the person who was recorded. Synthesizing speech this way allows all possible words, phrases, and sentences to be synthesized even though only a limited number of words and phrases have been recorded.

6. Summary
Significant progress has been made both in fundamental research (search, logic, machine learning, knowledge representation, computational linguistics, robotics) and in deployed applications (speech recognition and synthesis, financial expert systems, chess-playing programs).

However, much more research is needed in natural language understanding in resolving ambiguities (syntactic, semantic and pragmatic). Further research is needed in using models such as state machines (Markov model), formal rule systems (regular grammar and context-free grammar rule systems) and logic (predicate calculus). State space search algorithms and dynamic programming algorithms are some of the other methods used. All of these techniques are boosted by probability theory, where we are basically concerned with the task of finding the most probable sequence of words when we are given a sentence or phrase. There are many different possible sequences and we assign probabilities to each possible sequence (e.g. in speech recognition - select the most probable from those proposed by the speech analyser.)

Also, the knowledge-acquisition process must somehow be automated so that the computers can read and learn on their own.

There are basically two schools of thought with different approaches in trying to overcome these problems. The 'top down' approach sees human thought processes as the result of rule-based symbol processing in the brain. The brain manipulates symbols according to set rules. The chess playing program, Deep Blue, applied these methods when it dethroned chess champion Gary Kasparov. However this approach requires enormous amounts of manual coding.

In contrast, the 'bottom up' approach seeks to build a mechanism that can evolve useful systems by itself. Using neural networks, where the computing structure of neurons in the brain is emulated, statistical language models can now perform many tasks once thought to require manually constructed rules, such as word-sense disambiguation. The hope is that machines will develop intelligence by learning from its surroundings - as humans do.

Neural networks are an approach to machine learning by using simple processing units (neurons), organised in a layered and highly parallel architecture, to perform arbitrarily complex calculations. Learning is achieved through repeated minor modifications to selected neurons, which results in a very powerful classification system. Successful applications include stock market analysis and character and speech recognition.

So some researchers see the solution in a system modelled on neural nets - a type of learning system. Others believe that symbol manipulation is the only viable approach. Perhaps a combination of both approaches is necessary to unravel the problems.

7. What does the future hold?
Can the human language user be replicated by a computer? Some researchers regard the intricate workings of the brain as an algorithm, a step-by-step program, which consequentially can and will be run on a computer. But it will take decades, and will probably be developed on an artificial brain of silicon and plastic. Is there a critical thought level like critical mass level? And if so will mechanical/electronic minds then start-up by themselves when they reach a certain level of complexity? Perhaps they too will have a language acquisition device - that innate potential to develop language which humans are born with. Other researchers say that since we don't understand the basis of human consciousness or intelligence yet, how can we create it in machines? And the philosopher John Searle insists that the mere computational manipulation of symbols and successful running of algorithms does not constitute understanding and thinking.

How far are we away from a computer who could coherently and articulately participate in dialogue with humans and exhibit Hockett's characteristics of language, like HAL in '2001: A Space Odyssey'? Perhaps developing computers that can understand and produce natural language will not happen until the computer can also mimic other cognitive faculties including emotions, dreams, feelings and desires. Or perhaps these cognitive faculties will appear in parallel with the development of language understanding in computers.

On the other hand, with our artificial limbs, implants and pacemakers of today, and with the mechanical hearts, cochlear and retinal implants of tomorrow, perhaps we are already part cyborg. How long will it be before we implant integrated circuits into our brains, to enhance or repair functions? We will become cyborgs. There will be no distinction between man and machine. Homo Sapiens will vanish and evolve into Robo Sapiens - with far superior intelligence and unlimited life expectancy.