Measuring the IQ of intelligent machines

How can we know if intelligent machines are getting smarter? The simple answer is by measuring their IQ. Nevertheless there are some obvious, and perhaps some less obvious, problems with such an approach. The most obvious hindrance is the plethora of AI approaches and methodologies that technologists follow in building their intelligent machines.

On one end of the spectrum are the “symbolists”, those who develop algorithms that manipulate symbols in universal Turing machines (such as your PC). Their most successful products so far are called “expert systems”. At the other end there are the “connectionists”; they mimic the human brain by building artificial neural networks. Many encouraging developments have come from connectionist architectures, mostly applied in pattern recognition and machine learning. Other technologists follow hybrid approaches that fall between those two extremes.

The problem with comparing the IQ of these variable machines is this: if one assumes I to be the input of information in a machine and O the output, one lacks a common T, where T is the transformation of I into O. Proposing a universal method for testing the IQ of machines must include a caveat that the method will apply to all machines, irrespective of their “internal” T. This means that we agree to test for intelligence irrespective of what happens “inside” the machine. This is equivalent to testing the IQ of biological intelligent beings who have evolved in different planets.

The second, equally profound stumbling block for a universal IQ test has to do with definitions. What do we mean by the word “intelligence” anyway? Various people mean various things so we must be specific. To overcome semantics of intelligence is helpful to remember what the original aims of AI are. Generally speaking, AI aims to achieve four broad objectives for intelligent machines:

1. Thinking humanly, i.e. to be conscious of thinking

2. Acting humanly, i.e. to make decisions and take actions by applying evolved moral reasoning, as well as appear to be “human-like” in the action

3. Thinking rationally, i.e. processing information in a rational manner.

4. Acting rationally, i.e. producing outcomes that comply with rational reasoning.

Most serious philosophical arguments bedevil the first two objectives, while a few mild ones have issues with the third. The fourth one however, the purely behavioral one – wisely chosen by Alan Turing when he proposed his famous test- is where AI delivers its best. A machine may be said to act rationally if it appears to do so to human observers. It follows that if we endeavor to apply a universal method for testing machine IQ we must ignore “how” the machine works. If we do not we will fall prey to the philosophical wrangling of objectives 1 to 3.

So in order to arrive at a universal IQ test we must (a) ignore the internal mechanism of the machine that transforms inputs to outputs, and (b) measure only the degree of rational outcomes. So the next question is: how bad is that? It turns out that it is not bad at all. To see why, let us see what happens when human beings test for IQ.

The measurement of human intelligence was conceived in 1905 by French psychologists Alfred Binet and his assistant Theodore Simon. The French government of the time wished to ensure that adequate education was given to mentally handicapped children, so the two psychologists were commissioned to find a way to measure the “beautiful pure intelligence” of the children. Binet observed that children solved problems in the same way that younger, “normal”, children did. So he tested the possibility that intelligence was related to age. The tests that he and Simon developed were thus adapted to age: if a child was able to answer the questions that were answerable by the majority of children age 8, but unable to answer the respective questions for children age 9, she was said to have the “mental age” of 8.

IQ (Intelligence Quotient) was therefore defined as: IQ=100 x (mental age/chronological age)

Plotting this equation (IQ measurements over number of individuals tested, for each chronological age) one gets a “bell curve” with most individuals falling in the middle (the middle area of the curve defined as “normal”).

Modern tests of human IQ follow the same principles determined by Binet and Simon. They ignore internal brain mechanisms (the “T” of intelligent machines) and are only interested in outcomes (the answers to the questions). Developing a universal machine IQ test that only tests and compares rational outcomes we simply do what humans do for themselves.

Nevertheless human IQ testing is riddled with controversy. Since its inception it was noted that defining “normal” depends heavily on the statistical sample chosen for the measurement. For example white, middle class European children are better fed and better educated than poor black children in rural Africa. This difference in lifestyle impacts IQ measurements because IQ testing does not factor in social circumstances, proven by modern neuroscience to have enormous impact in brain development.

Notably, Binet and Simon’s approach was first criticized by the Russian psychologist L.S. Vygotsky who made the distinction between “really developed mental functions” and “potentially developed human functions”. IQ tests measure mostly the former. Since Vygotsky many have had issues with IQ testing, most notably H. Gardner who suggested not one but seven different types of human intelligence including linguistic, musical, mathematical, etc.

Measuring machine IQ may stumble upon disputable definitions of “normalcy”. As machines develop further issues of cultural influence may also creep in. Will Japanese robots score higher marks because the Japanese culture is more robot-friendly?

An interesting approach for a universal test of machine intelligence has been proposed by Shane Legg and Marcus Hutter. Trying to measure machine intelligence in a pure and abstract form the two researchers have suggested measuring outcomes of intelligent agents’ performance in a probability game based on what strategies should yield the best results and the biggest rewards over time.

Their suggestion appears to be viable in the context already defined, namely that we must be satisfied with measuring rational outcomes only, and not ask the difficult “how” question. Sticking to AI objective 4 we can agree about defining “universal intelligence” for machines in terms of acting rationally only.

Their proposition encapsulates an evolutionary dimension too: living creatures tend to seek rewards (food, mates, authority) while seeking the best strategies over time. Applying Legg and Hutter’s probability game at various stages of development in machine intelligence one can compare various machines now, as well as monitor the development of machines over time. If you worry about machines becoming more “intelligent” than humans in the future Legg and Hutter’s measurements should provide ample warning for the forthcoming “Singularity”.
Reference: Shane Lee and Marcus Hutter, Universal Intelligence: a definition of machine intelligence, work supported by NSF grant 200020-107616.