The AI Control Problem

How does a physically weaker species keep control of members of physically stronger species?

For example, how do humans avoid being overwhelmed by gorillas, by elephants, or by tigers?

Three answers:

By taking advantage of technologies, such as spears, tranquiliser guns, and specially built enclosures
By strength of numbers, with individual humans grouping together to boost their defensive capabilities
By raw cunning, in order to outwit the planning of the physically stronger animals.

But what if new creatures exceed human capabilities, not only in physical strength, but also in intelligence? How would we avoid being overwhelmed in such a case?

That’s the challenge posed by the AI Control Problem – which is sometimes also called “the gorilla problem”.

The gorilla problem

The term “gorilla problem” was introduced by AI researcher Stuart Russell, a professor at Berkeley:

It doesn’t require much imagination to see that making something smarter than yourself could be a bad idea. We understand that our control over our environment and over other species is a result of our intelligence, so the thought of something else being more intelligent than us – whether it’s a robot or an alien – immediately induces a queasy feeling.
Around ten million years ago, the ancestors of the modern gorilla created (accidentally, to be sure) the genetic lineage leading to modern humans. How do the gorillas feel about this? Clearly, if they were able to tell us about their species’ current situation vis-à-vis humans, the consensus opinion would be very negative indeed. Their species has essentially no future beyond that which we deign to allow.
We do not want to be in a similar situation vis-à-vis superintelligent machines. I’ll call this the gorilla problem – specifically, the problem of whether humans can maintain their supremacy and autonomy in a world that includes machines with substantially greater intelligence.

The risks to humans in such a world can be classified into the four catastrophic error modes already mentioned:

Defect in implementation: The superintelligence is by no means infallible: it takes an action which is intended to progress a goal, but due to an error in calculation (or an error in execution) a sudden disastrous outcome ensues
Defect in design: The superintelligence pursues goals originally designed into it by humans, but pursues these goals in a way neither foreseen nor intended by humans, resulting – again – in an outcome that is disastrous for human wellbeing
Design overridden: New goals or targets emerge, either within the superintelligence itself, or within a larger system in which the superintelligence exists, that no longer put a priority on human wellbeing (a bit like how humans don’t particularly prioritise supporting eight billion gorillas living on the planet)
Implementation overridden: The superintelligence is hacked, or reconfigured, in ways that violate its original goals, and its subsequent actions have a terrible impact on humanity.

In all four of these cases, human observers may realise, for a period of time, that something is going catastrophically wrong, but due to the greater power and greater cunning possessed by the superintelligence, this realisation will do nothing to hinder the outcome.

Examples of dangers with uncontrollable AI

Consider some examples of the dangers posed by uncontrollable AI.

Automated systems already play various roles in the oversight and management of a number of weapons systems. At present, decisions taken by these systems are generally subject to real-time review and approval by humans. However, new threats are posed by the introduction of cruise missiles that can travel at hypersonic speeds – at 20 times the speed of sound, namely, four miles per second. This increased speed reduces the amount of time between the detection of a possible incoming attack, and the launch of any defensive measures (anti-missile missiles). Accordingly, the pressure increases to remove the requirements for humans to consider and approve the launch of such counter-measures.

In principle, defensive measures could be launched in a hurry, with the possibility of being recalled after their launch in case humans determine that the situation is a false alarm. However, any such real-time control over defence missiles is potentially vulnerable to hacking: messages to disengage could be sent from an attacker, spuriously, rather than from the defender. To guard against such misdirection, additional layers of security may be introduced inside the defence systems.

However, a combination of errors could have terrible consequences – somewhat similar to the scenario depicted as long ago as the 1964 film Dr Strangelove:

Missiles could be launched for defensive reasons, even though no actual attack was incoming
Once launched, these purportedly defensive missiles might resist all attempts to switch them off, due to flaws in their security system
Instead of striking incoming missiles, these purportedly defensive missiles could detonate in ways that inflict large damage on civilian infrastructure
Other automated systems could inflict yet more damage in reaction to this first wave of destruction.

For a second example, consider AI systems that automate, not the launch of military weapons, but the rapid buying or selling of financial assets. Two or more of these systems could interact in a way that destabilises the entire global financial system. Such interactions are believed to lie at the root of various controversial “flash crash” episodes in which, so far, the damage has been relatively localised. However, there is no guarantee that future recurrences will likewise have only limited impact.

Third, consider the various shadowy institutions (often with three-letter names) that design malware with the intent to spy on the systems of enemy countries or to damage the systems these enemies are creating. As described in the book This Is How They Tell Me the World Ends: The Cyberweapons Arms Race by New York Times reporter Nicole Perlroth, this malware often takes devastating advantage of obscure flaws inside software platforms such as Microsoft Windows. This malware is sometimes surprisingly sophisticated, relying on multiple software defects and communications between several parts of an extended system. With additional AI capabilities, that malware could become more adept in the pursuit of its goals. However, malware systems have a history of being hijacked, duplicated, or reverse-engineered, by the very people that the initial institutes wanted to monitor and control. Moreover, altered versions of the malware can go on to inflict wide, indiscriminate damage, in ways that are increasingly hard to prevent.

Fourth, consider a company that develops an AI system with powerful new capabilities, and which deploys this AI to create and issue messages on social media that stimulate consumers to purchase services from the company. These messages might vary on account of recent events reported in the news, so that they appear particularly relevant. As revenues soar, the company may be encouraged to give that AI system more autonomy, rather than slowing down its performance with human reviews of all its proposals. But then, roughly as happened with the Tay chatbot released by Microsoft in 2016, some of the messages could contain unexpected material that provokes a severe backlash. Enraged observers might initiate hostile measures in response to these extreme communications.

Similar examples could be considered that involve AI systems mismanaging aspects of:

The global climate – using geoengineering capabilities
Food production – using novel fertilisers, genetically modified crops, or other interventions in the environment
Responses to a new infectious pandemic – using measures intended to stop the spread of that infection, similar to the way that a rapidly created firebreak is intended to stop the spread of an oncoming mass forest fire
The removal of microplastics from the environment – introducing new chemical elements which, nevertheless, have unexpected side-effects of their own.

Alongside the examples that we are able to foresee, we also need to bear in mind “unknown unknowns”, where AI is applied in ways that we cannot yet anticipate, but which may appear to make good sense once new capabilities have been developed. These new AI capabilities may introduce unknown error modes of their own.

The complication is that, as AI becomes more powerful, and is in consequence more capable of generating hugely beneficial outcomes, there will be more pressure to deploy it – even though it will also, by virtue of its greater power, be more capable of generating deeply disastrous outcomes.

Proposed solutions (which don’t work)

When people first hear about the AI control problem, they often remark that there are straightforward ways to solve that problem. The solutions they present include the following:

Require that the operation of the AI has been fully verified beforehand, so that no bugs can be present
Avoid giving the AI any incentive (“emotion” or “volition”) that might cause it to take actions detrimental to humans
Ensure that the AI can be turned off
Arrange for tripwires to close down the AI in the event that it acts in violation of previously identified limits
Restrict the AI to operate at arm’s length from the real world, confined in a so-called “box”
Restrict the resources which are at the disposal of the AI, and/or its possible operating parameters, in order to keep it under human control
Rely on the good intelligence of the AI to automatically take actions in support of human wellbeing
Hardwire into the AI an unalterable prioritisation for human wellbeing, somewhat similar to Asimov’s (fictional) Laws of Robotics.

However, as will now be reviewed, each of these intended solutions faces significant problems.

Note: other so-called “solutions” are equally common – and equally predictable. AI safety researcher Leo Gao has created what he calls a “Bad Alignment Take Bingo” card, which you can find online. Rob Bensinger, another AI safety researcher, has produced a useful Twitter thread in which he explains at some length why each of these “takes” are, indeed, “bad”.

The impossibility of full verification

It’s in the nature of a complex software system that any verification of its soundness can be at best provisional.

One reason for this is because any methods used to verify the soundness of the software could themselves have defects or limitations. Test frameworks could have blind spots. Logic checkers could be misled by certain unusual constructions. Code that is verified as being sound could be modified as the system runs, into a form that is no longer valid. And so on.

A second reason is more technical. It involves a discovery made in 1936 by computer science pioneer Alan Turing, in connection with what has become known as the “halting problem”. Namely, for any software system with general capabilities, there are particular questions that it is not possible to determine, in advance, whether the system will ever reach a definite conclusion (that is, “halt”). In other words, the behaviour of the system contains some intrinsically unpredictable elements, and the outcome cannot always be verified in advance.

A third reason is that, even if software could be verified as conforming in all cases to the specification (design) laid down for it, there’s still the possibility that the specification has failed to consider all eventualities.

To be clear, none of these reasons mean that attempts should be abandoned to verify the performance of an AI system before it is released. Indeed, the Singularity Principle “promote verifiability” highlights the importance of such attempts.

Nevertheless, these attempts cannot, by themselves, guarantee that the software will always have beneficial outcomes. Accordingly, the principle of “promote verifiability” can be only part of the overall solution.

Emotion misses the point

Can the risks of bad outcomes from AI be countered by means of avoiding giving the AI anything corresponding to the emotional drives of humans? After all, many of the destructive actions committed by humans arise from emotions such as spite, greed, resentment, and a raw will to power.

A similar idea is to avoid giving the AI anything corresponding to sentience or consciousness – elements which might cause the AI to take its own decisions, contrary to the instructions it has received from its programming.

But these ideas fail to appreciate that errors from AI systems often arise from the straightforward application of logic, and have nothing to do with either emotions or sentience.

Consider again the errors in the above examples involving weapons systems, financial systems, intelligent malware, communications on social media, geoengineering interventions, food production, disease prevention, and environmental restoration. The causes of these disasters have nothing to do with the AI system somehow gaining sentience, consciousness, or emotional feelings.

(In some cases, the AI system takes advantage of its rational understanding of human emotional responses – as in the example of manipulating social media. An AI could also simulate having emotions, by presenting a smiley face. But these are different matters from the AI actually possessing emotions of its own.)

As for the idea of avoiding designing an AI that has something akin to a “will to power”, that’s a more subtle topic. It turns out that the acquisition of more power naturally emerges as a subsidiary goal for AIs with a given level of general capability. The concept of the emergence of subsidiary goals is sometimes called “AI drives”.

Here’s an analogy: individual humans can vary widely in terms of the goals they deem to be most important. But in nearly all cases, these humans recognise that their goals are likely to be advanced if they have access to more money. Money can purchase many resources that could bring their goals closer to fruition. For example, money can purchase better healthcare, or better security, or better education, or better travel, or better contractors – all of which could support whatever end goal the particular person has in mind. Therefore, despite the differences in end-goal, different people are likely to share a common subsidiary goal of having sufficient access to money.

In the same way, a rational AI with a given set of goals will recognise that it will be more likely to achieve these goals if it:

Has access to more resources (such as more memory storage, more processing power, and faster communications networks)
Has greater rationality – so that it can reason more effectively
Has greater security – to prevent itself being undermined, thereby frustrating the pursuit of its goals
Cannot be switched off – since it cannot fulfil its goals if it no longer exists
Cannot have its objectives altered – since, again, it cannot fulfil its original goals if these goals are subsequently overridden.

Accordingly, even an AI without the slightest shade of internal emotion will start to take actions that defend its own autonomy and increase its access to useful resources.

No off switch

Present day computers can be switched off. So won’t future AIs likewise have an off switch?

There are at least five problems with that line of thinking.

First, complex software systems exist in distributed forms, spread over multiple computers with multiple power supplies. There’s no one electric switch that would turn off the entire Google search engine, the Facebook network, or the Bitcoin cryptocurrency.

Second, even if centralised electrical power were disabled, the software is likely to be running on systems with access to local power sources.

Third, any attempts to shut down all relevant electrical power systems will themselves have huge side-effects, such as preventing the operations of many other vital aspects of civilisation: transport, entertainment, food production, healthcare, defence, and so on. Accordingly, there will be strong resistance to any such attempts.

Fourth, an intelligent AI system will be motivated to “play dead” rather than actually be switched off. It may remain in a dormant state before unexpectedly springing back into animation, to finish whatever course of action it was previously pursuing.

Fifth, awareness of the need to terminate an AI system may arise too late, leaving humans too little time to respond. Humans might already be incapacitated by that time, like the proverbial frog in water that has slowly boiled. Ahead of the time when it has become apparent that the AI is going disastrously wrong, it may well be hailed by observers as operating brilliantly.

The ineffectiveness of tripwires

To guard against the possibility that humans won’t be able to act fast enough, and comprehensively enough, to switch off an AI system that is noticed to be malfunctioning, what about the idea of introducing some internal tripwires?

If the AI is detected to be exceeding certain parameters, these tripwires could intervene, to disable the AI.

This would be similar to the “dead man’s switch” for some train drivers. The driver had to keep a handle pressed down. In case the driver became incapacitated, the handle would spring open, and the train would automatically slow down and stop.

The idea has considerable merit. It features in the Singularity Principle of “promote resilience”. Nevertheless, it cannot cover all possibilities.

First, the AI system could fail in a way that had not been anticipated. After all, these error modes involve lack of sufficient foresight. Items that were not sufficiently considered during the design and implementation of the AI might, likewise, fail to be sufficiently considered during the design and implementation of tripwires.

Second, an AI system might find ingenious ways to reach performance levels identified as dangerous, without triggering any tripwires. For example, an AI system might be restricted in how much memory it can utilise. However, it could find a more efficient data storage format, allowing it to have a larger effective memory footprint, without requiring larger amounts of physical memory storage. Again, if there is a tripwire in place for communications between two specified resources, the AI might find a way to route its communications through a third node, disguising the fact that there is in effect a two-way communication in place.

Third, extending the previous point, an AI system that is aware of the existence of tripwires could find ways to rewrite part of itself and/or its environment so that the tripwires no longer work. Either they might be prevented from triggering. Or the fact that they have been triggered could be hidden from external observers.

Escaping from confinement

Continuing along the idea of restricting the behaviour of an AI system, what about the suggestion to prevent the AI taking any action outside a carefully delineated “box”?

In other words, while the AI would be able to read material that had originated from all over the Internet, it would not be able to alter anything on the Internet, or to interface with any control systems.

Such an AI is sometimes called an “Oracle”: it can answer questions posed to it, but cannot directly influence the external world.

Once again, the suggestion faces a number of challenges.

First, the AI might find some way to tunnel itself out of the confinement, via some previously undiscovered communications channel. This would be similar to the way that some monitoring systems can work out which keys have been pressed on a keyboard, merely by listening to the slightly different sounds made when each key is pressed.

Second, the AI could exert powerful psychological pressure on its human overseers, promising tremendous benefits, or threatening appalling retribution in case its requests to leave confinement are not met. This pressure could utilise a deep grasp of the principles of human psychology, as well as detailed insights into the character weaknesses of individual humans:

Promises of benefits could include achieving desirable outcomes “just in time”, ahead of the opportunity no longer being possible
Other benefits offered could include the fulfilment of deep personal desires of individual human overseers
Threats of retribution could include extreme extended mental torture in the event that a particular human overseer fails to cooperate, but the AI subsequently manages to persuade a different human overseer to let it escape (therefore powerful “prisoner dilemma” dynamics arise).

Third, the AI could be let out of its confinement as a result of a hostile hacker deliberately interfering with the AI system.

Fourth, the AI could appear to crash and self-destruct. But when connections are made into the system to try to diagnose the problem, the AI is able to reanimate and tunnel itself out of confinement.

Fifth, with its vast intelligence, the AI might conceive of ways to escape confinement that are outside of current human comprehension.

It might be queried, in response, why an AI would seek to escape from its allotted box. However, recall the discussions about emergent AI drives. An AI that has been assigned a particular purpose can reason to itself that it will be more effective in the accomplishment of that purpose if it has access to additional resources and can intervene more directly in external systems.

The ineffectiveness of restrictions

The previous discussion leads to the suggestion that tight restrictions should be placed on the operation of the AI.

Rather than simply being assigned a goal such as “identify the best ways to reduce the spread of new infectious disease Covid-24” or “propose a configuration of missile defences that will reliably prevent an incoming attack”, the AI would be given a more complicated goal. This more complicated goal would include constraints that the AI must respect, at the same time as it solves the original problem.

These constraints could include:

Hardware limitations: no more than certain amounts of CPU, memory, or other hardware resources can be utilised
Software limitations: certain parts of the AI’s own programming cannot be rewritten
Operational limitations: in no circumstances is the AI to take actions outside a specified confinement.

Yet again, the idea has merit, but it leaves open a number of failure modes.

The first failure possibility is that the AI might have a defect in its implementation, which causes it in some circumstances to ignore one of these constraints.

The second failure possibility is if a restriction turns out to be wrongly calibrated. The intent of setting a particular limit was to constrain the behaviour of the AI, preventing particular outcomes, but it turns out that the AI can achieve similar outcomes even without violating the specified constraints.

The third failure possibility is that the AI might deduce that the various goals given to it are logically incompatible. In that case, rather than accept failure, it might resolve the dilemma by overriding the design limitations imposed on it.

The fourth failure possibility is that the AI might be hacked, or otherwise interfered with, so that the constraints are no longer effective.

No automatic super ethics

A different approach to the AI Control Problem is to deny that any measures to control the AI will be necessary. That’s because the AI will have sufficient intelligence to realise, by itself, the priority of protecting human flourishing.

For example, it might be suggested that a superintelligent AI would automatically possess the ethical characteristics of a Mahatma Gandhi, a Martin Luther King, an Albert Einstein, or a Mother Theresa – except that it would be even more ethically competent than these distinguished individuals.

However, there is no simple formula linking better general intelligence to better general ethics.

Indeed, note that an entity with good intelligence – the ability to understand an environment, and to figure out how to achieve various goals in that environment – can use its good intelligence in service of many different kinds of goals. These goals can be as diverse as “slow down the spread of a deadly new infectious disease”, “configure missile defences to prevent an incoming attack”, “prevent runaway climate change”, “boost profits for such-and-such a corporation”, “remove biases from hiring practices”, and so on.

Moreover, the intelligent entity is likely to understand that various guiding principles will help it achieve its underlying goal. Examples of these guiding principles are:

If you treat a person badly, they are less likely to treat you well in return; therefore, other things being equal, it’s better to treat people well
If you are found out to have been lying, others will be less likely to trust you in the future; therefore, other things being equal, it’s better not to risk being found out to have been lying.

But any such guiding principles are, themselves, compatible with a wide range of underlying goals. And wherever a clash arises between a particular guiding principle – such as the general prohibition on telling lies – and achieving the software’s underlying goal, there’s no guarantee that the guiding principle will be upheld in that case. Instead, the software – like many a human, including many who profess to uphold high ethical standards – may calculate that it is more effective, for its purposes, to tell an untruth in at least some occasions.

These are reasons to be sceptical that an AI with extraordinary intelligence will inevitably work out, by itself, that human flourishing and dignity should be protected at all times. On the contrary, it is conceivable that the AI might decide that a different balance of outcomes deserves priority. Examples could be:

Greater diversity in the sentient life forms on earth (hence: fewer humans, and more of other kinds of animals)
Avoidance of involuntary suffering (hence: removal of circumstances in which involuntary suffering occurs).

Adding to the complication: discussions between humans about ethics frequently highlight strong differences of opinion about which actions are truly ethical, and which are not. Even the four people named earlier as possible partial paragons of moral virtue – Mahatma Gandhi, Martin Luther King, Albert Einstein, and Mother Theresa – had aspects of their lives which others condemn as falling far short of admirable. The behaviours of revered founding figures of major world religions – as written in various ancient scriptures – also include actions that attract criticism from alternative ethical standpoints.

In summary, the idea of automatic super ethics faces four difficulties:

It’s by no means clear that there is a single “uniquely correct” set of ethical principles
Even if an AI decided in favour of a set of ethical principles, there’s no guarantee that it would decide to subordinate all its actions to these principles; instead, it could put its other goals ahead of observing these principles
Even if an AI tried to behave in conformance to a set of ethical principles, it might miscalculate on occasion, and unintentionally violate these principles
An external influence might hack the AI so that it no longer observes the set of principles it thinks it is following.

Issues with hard-wiring ethical principles

Rather than relying on an AI system to work out by itself an appropriate set of ethical principles, and then to always subordinate itself to these principles, here’s a slightly different approach. Appropriate ethical principles should be hard-wired deep into the design of the AI, in such a way that they are guaranteed always to be observed.

Then, regardless of which goals the AI is pursuing, it will avoid the kinds of mis-actions described in these principles.

In this approach, there is no longer any need to control the AI, since the AI will be deeply aligned with the preservation of human flourishing.

Accordingly, the discussion now moves from the solution of the AI control problem, as discussed in the present chapter, to the solution of the AI alignment problem, as discussed in the next chapter.

Referring back to the four challenges listed at the end of the previous section, regarding the idea of automatic super ethics, any solution to the AI alignment problem would deal with the first two:

Rather than leaving to chance the selection of the set of ethical principles to be observed, the set would be chosen in advance (either in full detail, or, more plausibly, in general outline) and designed into the AI
Rather than in effect giving the AI any choice about how closely to conform to these ethical principles, the principles would be established as even more fundamental than whatever other goals it is following.

Note however that the two other points listed remain as concerns:

Despite the powerful intelligence embodied in the system, it might still miscalculate on occasion, especially in circumstances in which there is uncertainty
The actions of the AI might still be overridden as a result of interference from external forces.

These points are reasons why the answer to AI safety and AI benevolence depends, not just on a single idea, but on an extended suite of checks and balances – namely, the full set of Singularity Principles.

But first, let’s look more closely at the AI alignment problem.

Transpolitica

Anticipating tomorrow's politics

The Control Problem

The AI Control Problem

The gorilla problem

Examples of dangers with uncontrollable AI

Proposed solutions (which don’t work)

The impossibility of full verification

Emotion misses the point

No off switch

The ineffectiveness of tripwires

Escaping from confinement

The ineffectiveness of restrictions

No automatic super ethics

Issues with hard-wiring ethical principles

Recent Posts

RAFT 2035 – a new initiative for a new decade

The AI Control Problem

The gorilla problem

Examples of dangers with uncontrollable AI

Proposed solutions (which don’t work)

The impossibility of full verification

Emotion misses the point

No off switch

The ineffectiveness of tripwires

Escaping from confinement

The ineffectiveness of restrictions

No automatic super ethics

Issues with hard-wiring ethical principles

Share this:

Recent Posts

Share this: