The Control Problem

The AI Control Problem

How does a physically weaker species keep control of members of physically stronger species?

For example, how do humans avoid being overwhelmed by gorillas, by elephants, or by tigers?

Three answers:

  1. By taking advantage of technologies, such as spears, tranquiliser guns, and specially built enclosures
  2. By strength of numbers, with different humans grouping together to boost their presence
  3. By raw cunning, in order to outwit the planning of the physically stronger animals.

But what if new creatures exceed human capabilities, not only in physical strength, but also in intelligence? How would we avoid being overwhelmed in such a case?

That’s the challenge posed by the AI Control Problem – which is sometimes also called “the gorilla problem”.

The gorilla problem

The term “gorilla problem” has been used by AI researcher Stuart Russell, a professor at Berkeley:

It doesn’t require much imagination to see that making something smarter than yourself could be a bad idea. We understand that our control over our environment and over other species is a result of our intelligence, so the thought of something else being more intelligent than us – whether it’s a robot or an alien – immediately induces a queasy feeling.

Around ten million years ago, the ancestors of the modern gorilla created (accidentally, to be sure) the genetic lineage leading to modern humans. How do the gorillas feel about this? Clearly, if they were able to tell us about their species’ current situation vis-à-vis humans, the consensus opinion would be very negative indeed. Their species has essentially no future beyond that which we deign to allow.

We do not want to be in a similar situation vis-à-vis superintelligent machines. I’ll call this the gorilla problem – specifically, the problem of whether humans can maintain their supremacy and autonomy in a world that includes machines with substantially greater intelligence.

The risks to humans in such a world can be classified into the four catastrophic error modes already mentioned:

  1. Defect in implementation: The superintelligence is by no means infallible: it takes an action which is intended to progress a goal, but due to an error in calculation (or an error in execution) a sudden disastrous outcome ensues
  2. Defect in design: The superintelligence pursues goals originally designed into it by humans, but pursues these goals in a way neither foreseen nor intended by humans, resulting – again – in an outcome that is disastrous for human wellbeing
  3. Design overridden: New goals or intents emerge, either within the superintelligence itself, or within a larger system in which the superintelligence exists, that no longer put a priority on human wellbeing (a bit like humans don’t particularly prioritise supporting eight billion gorillas living on the planet)
  4. Implementation overridden: The superintelligence is hacked, or reconfigured, in ways that violate its original goals, and its subsequent actions have a terrible impact on humanity.

In all four of these cases, human observers may realise, for a period of time, that something is going catastrophically wrong, but due to the greater power and greater intelligence possessed by the superintelligence, this realisation will do nothing to hinder the outcome.

Some examples of dangers with uncontrollable AI

Consider some examples of the dangers posed by uncontrollable AI.

Automated systems already play various roles in the oversight and management of a number of weapons systems. At present, decisions taken by these systems are generally subject to real-time review and approval by humans. However, new threats are posed by the introduction of cruise missiles that can travel at hypersonic speeds – at 20 times the speed of sound, namely, four miles per second. This increased speed reduces the amount of time between the detection of a possible incoming attack, and the launch of any defensive measures (anti-missile missiles). Accordingly, the pressure increases to remove the requirements for humans to consider and approve the launch of such counter-measures.

In principle, defensive measures could be launched in a hurry, but could be recalled after their launch in case humans determine that the situation is a false alarm. However, any such real-time control over defence missiles is potentially vulnerable to hacking: messages to disengage could be sent from an attacker, spuriously, rather than from the defender. To guard against such misdirection, additional layers of security may be introduced inside the defence systems.

However, a combination of errors could have terrible consequences – somewhat similar to the scenario depicted as long ago as the 1964 film Dr Strangelove:

  • Missiles could be launched for defensive reasons, even though no actual attack was incoming
  • Once launched, these purportedly defensive missiles might resist all attempts to switch them off, due to flaws in their security system
  • Instead of striking incoming missiles, these purportedly defensive missiles could detonate in ways that inflict large damage on civilian infrastructure
  • Other automated systems could inflict yet more damage in reaction to this first wave of destruction.

For a second example, consider AI systems that automate, not the launch of military weapons, but the rapid buying or selling of financial assets. Two or more of these systems could interact in a way that destabilises the entire global financial system.

Again, consider a company that develops an AI system with powerful new capabilities, and which deploys this AI to create and issue messages on social media that stimulate consumers to purchase services from the company. These messages might vary on account of events happening in the world, so that they appear particularly relevant. As revenues soar, the company may be encouraged to give that AI system more autonomy, rather than slowing down its performance with human reviews of all its proposals. But then, roughly as happened with the Tay chatbot released by Microsoft in 2016, some of the messages could contain unexpected material that provokes a severe backlash. Enraged observers might initiate hostile measures in response to these extreme communications.

Similar examples may be considered that involve AI systems mismanaging aspects of:

  • The global climate – using geoengineering capabilities
  • Food production – using novel fertilisers, genetically modified crops, or other interventions in the environment
  • Responses to a new infectious pandemic – using measures intended to stop the spread of that infection, similar to the way that a rapidly created firebreak is intended to stop the spread of an oncoming mass forest fire
  • The removal of microplastics from the environment – introducing new chemical elements which, nevertheless, have unexpected side-effects.

Alongside the examples that we are able to foresee, we also need to bear in mind “unknown unknowns”, where AI is applied in ways that we cannot yet anticipate, but which may appear to make good sense once new capabilities have been developed. These new AI capabilities may introduce unknown error modes of their own.

The complication is that, as AI becomes more powerful, and is in consequence more capable of generating hugely beneficial outcomes, there will be more pressure to deploy it – even though it will also, by virtue of its greater power, also be more capable of generating deeply disastrous outcomes.

Proposed solutions (which don’t work)

When people first hear about the AI control problem, they often remark that there are straightforward ways to solve that problem. The solutions they present include:

  1. Require that the operation of the AI has been fully verified beforehand, so that no bugs can be present
  2. Avoid giving the AI any incentive (“emotion” or “volition”) that might cause it to take actions detrimental to humans
  3. Ensure that the AI can be turned off
  4. Arrange for tripwires to close down the AI in the event that it acts in violation of previously identified limits
  5. Restrict the AI to operate at arms length from the real world, confined in a so-called “box”
  6. Restrict the resources which are at the disposal of the AI, and/or its possible operating parameters, in order to keep it under human control
  7. Rely on the good intelligence of the AI to automatically take actions in support of human wellbeing
  8. Hardwire into the AI an unalterable prioritisation for human wellbeing, somewhat similar to Asimov’s (fictional) Laws of Robotics.

However, as will now be reviewed, each of these intended solutions faces significant problems.

The impossibility of full verification

It’s in the nature of a complex software system that any verification of its soundness can be at best provisional.

One reason for this is because any methods used to verify the soundness of the software could themselves have defects or limitations.

A second reason is more technical. It involves a discovery made in 1936 by computer science pioneer Alan Turing, in connection with what has become known as the “halting problem”. Namely, for any software system with general capabilities, there are particular problems that it is not possible to determine, in advance, whether the system will ever reach a definite conclusion (that is, “halt”). In other words, the behaviour of the system contains some intrinsically unpredictable elements.

A third reason is that, even if software could be verified as conforming in all cases to the specification (design) laid down for it, there’s still the possibility that the specification has failed to consider all eventualities.

To be clear, none of these reasons means that attempts should be abandoned to verify the performance of an AI system before it is released. Indeed, the Singularity Principle “promote verifiability” highlights the importance of such attempts.

Nevertheless, these attempts cannot, by themselves, guarantee that the software will always have beneficial outcomes. Accordingly, the principle of “promote verifiability” can be only part of the overall solution.

Emotion misses the point

Can the risks of bad outcomes from AI be countered by means of avoiding giving the AI anything corresponding to the emotional drives of humans? After all, many of the destructive actions committed by humans arise from emotions such as spite, greed, resentment, and a raw will to power.

A similar idea is to avoid giving the AI anything corresponding to sentience or consciousness – elements which might cause the AI to take its own decisions, contrary to the instructions it has received from its programming.

But these ideas fail to appreciate that errors from AI systems often arise from the straightforward application of logic, and have nothing to do with either emotions or sentience.

Consider again the errors in the above examples involving weapons systems, financial systems, communications on social media, geoengineering interventions, food production, disease prevention, and environmental restoration. The causes of these disasters have nothing to do with the AI system somehow gaining sentience, consciousness, or emotional feelings.

(In some cases, the AI system takes advantage of its rational understanding of human emotional responses – as in the example of manipulating social media. An AI could also simulate having emotions, by presenting a smiley face. But these are different matters from the AI actually possessing emotions of its own.)

As for the idea of avoiding designing an AI that has something akin to a “will to power”, that’s a more subtle topic. It turns out that the acquisition of more power naturally emerges as a subsidiary goal for AIs with a given level of general capability. The concept of the emergence of subsidiary goals is sometimes called “AI drives”.

Here’s an analogy: individual humans can vary widely in terms of the goals they deem to be most important. But in nearly all cases, these humans recognise that their goals are likely to be advanced if they have access to more money. Money can purchase many resources that could bring their goals closer to fruition. For example, money can purchase better healthcare, or better security, or better education, or better travel, or better contractors – all of which could support whatever end goal the particular person has in mind. Therefore, despite the differences in end-goal, different people are likely to share a common subsidiary goal of having sufficient access to money.

In the same way, a rational AI with a given set of goals will recognise that it will be more likely to achieve these goals if it:

  • Has access to more resources (such as more memory, more processing power, and faster communications networks)
  • Has greater rationality – so that it can reason more effectively
  • Has greater security – to prevent itself being undermined, thereby frustrating the pursuit of its goals
  • Cannot be switched off – since it cannot fulfil its goals if it no longer exists
  • Cannot have its objectives altered – since, again, it cannot fulfil its original goals if these goals are subsequently overridden.

Accordingly, even an AI without the slightest shade of internal emotion will start to take actions that defends its own autonomy and increase its access to useful resources.

No off switch

Present day computers can be switched off. So won’t future AIs likewise have an off switch?

There are at least five problems with that line of thinking.

First, complex software systems exist in distributed forms, spread over multiple computers. There’s no one electric switch that would turn off the entire Google search engine, the Facebook network, or the Bitcoin cryptocurrency.

Second, even if centralised electrical power were disabled, the software is likely to be running on systems with access to local power sources.

Third, any attempts to shut down all relevant electrical power systems will themselves have huge side-effects, such as preventing the operations of many other aspects of civilisation (transport, entertainment, food production, healthcare, defence, and so on). Accordingly, there will be strong resistance to any such attempts.

Fourth, an intelligent AI system will be motivated to “play dead” rather than actually be switched off. It may remain in a dormant state before unexpectedly springing back into animation, to finish whatever course of action it was previously pursuing.

Fifth, awareness of the the need to terminate an AI system may arise too late, leaving humans too little time to respond. Humans might already be incapacitated by that time, like the proverbial frog in water that has slowly boiled. Ahead of the time when it is clear the AI is going disastrously wrong, it may well be hailed by observers as operating brilliantly.

The ineffectiveness of tripwires

To guard against the possibility that humans won’t be able to act fast enough, and comprehensively enough, to switch off an AI system that is noticed to be malfunctioning, what about the idea of introducing some internal tripwires?

If the AI is detected to be exceeding certain parameters, these tripwires could intervene, to disable the AI.

This would be similar to the “dead man’s switch” for some train drivers. The driver had to keep a handle pressed down. In case he became incapacitated, the handle would spring open, and the train would automatically slow down and stop.

The idea has considerable merit. It features in the Singularity Principle of “promote resilience”. Nevertheless, it cannot cover all possibilities.

First, the AI system could fail in a way that had not been anticipated. After all, these error modes involve lack of sufficient foresight. Items that were not sufficiently considered during the design and implementation of the AI might, likewise, fail to be sufficiently considered during the design and implementation of tripwires.

Second, an AI system might find ingenious ways to reach performance levels identified as dangerous, without triggering any tripwires. For example, an AI system might be restricted in how much memory it can utilise. However, it could find a more efficient data storage format, allowing it to have a larger effective memory footprint, without requiring larger amounts of physical memory storage. Again, if there is a tripwire in place for communications between two specified resources, the AI might find a way to route its communications through a third node, disguising the fact that there is in effect a two-way communication in place.

Third, extending the previous point, an AI system that is aware of the existence of tripwires could find ways to rewrite part of itself and/or its environment so that the tripwires no longer work. Either they might be prevented from triggering. Or the fact that they have been triggered could be hidden from external observers.

Escaping from confinement

Continuing along the idea of restricting the behaviour of an AI system, what about the suggestion to prevent the AI taking any action outside a carefully delineated “box”?

In other words, while the AI would be able to read material from all over the Internet, it would not be able to alter anything on the Internet, or to interface with any control systems.

Such an AI is sometimes called an “Oracle”: it can answer questions posed to it, but cannot directly influence the external world.

Once again, the suggestion faces a number of challenges.

First, the AI might find some way to tunnel itself out of the confinement, via some previously undiscovered communications channel. This would be similar to the way that some monitoring systems can work out which keys have been pressed on a keyboard, merely by listening to the slightly different sounds made when each key is pressed.

Second, the AI could exert powerful pressure on its human overseers, promising tremendous benefits, or threatening appalling retribution in case its requests to leave confinement are not met. This pressure could utilise a deep grasp of the principles of human psychology, as well as detailed insights into the character weaknesses of individual humans:

  • Promises of benefits could include achieving desirable outcomes “just in time”, ahead of the opportunity no longer being possible
  • Other benefits offered could include the fulfilment of deep personal desires of individual human overseers
  • Threats of retribution could include extreme extended mental torture in the event that a particular human overseer fails to cooperate, but the AI subsequently manages to persuade a different human overseer to let it escape (therefore powerful “prisoner dilemma” dynamics arise)

Third, the AI could be let out of its confinement as a result of a hostile hacker deliberately interfering with the AI system.

Fourth, the AI could appear to crash and self-destruct. But when connections are made into the system to try to diagnose the problem, the AI is able to reanimate and tunnel itself out of confinement.

Fifth, with its vast intelligence, the AI might conceive of ways to escape confinement that are outside of current human comprehension.

It might be queried, in response, why an AI would seek to escape from its allotted box. However, recall the discussions about emergent AI drives. An AI that has been assigned a particular purpose can reason to itself that it will be more effective in the accomplishment of that purpose if it has access to additional resources and can intervene more directly in external systems.

The ineffectiveness of restrictions

The previous discussion leads to the suggestion that tight restrictions should be placed on the operation of the AI.

Rather than simply being assigned a goal such as “identify the best ways to reduce the spread of new infectious disease Covid-24” or “propose a configuration of missile defences that will reliably prevent an incoming attack”, the AI would be given a more complicated goal. This more complicated goal would include constraints that the AI must respect, at the same time as it solves the original problem.

These constraints could include:

  • Hardware limitations: no more than certain amounts of CPU, memory, or other hardware resources can be utilised
  • Software limitations: certain parts of the AI’s own programming cannot be rewritten
  • Operational limitations: in no circumstances is the AI to take actions outside a specified confinement.

Yet again, the idea has merit, but it leaves open a number of failure modes.

The first failure possibility is that the AI might have a defect in its implementation, which causes it in some circumstances to ignore one of these constraints.

The second failure possibility is if a restriction turns out to be wrongly calibrated. The intent of setting a particular limit was to constrain the behaviour of the AI, preventing particular outcomes, but it turns out that the AI can achieve similar outcomes even without violating the specified constraints.

The third failure possibility is that the AI might deduce that the various goals given to it are logically incompatible. In that case, rather than accept failure, it might resolve the dilemma by overriding the design limitations imposed on it.

The fourth failure possibility is that the AI might be hacked, or otherwise interfered with, so that the constraints are no longer effective.

No automatic super ethics

A different approach to the AI Control Problem is to deny that any measures to control the AI will be necessary. That’s because the AI will have sufficient intelligence to realise, by itself, the priority of protecting human flourishing.

For example, it might be suggested that a superintelligent AI would automatically possess the ethical characteristics of a Mahatma Gandhi, a Martin Luther King, an Albert Einstein, or a Mother Theresa – except that it would be even more ethically competent.

However, there is no simple formula linking better general intelligence to better general ethics.

First, note that an entity with good intelligence – the ability to understand an environment, and to figure out how to achieve various goals in that environment – can use its good intelligence in service of many different kinds of goals. These goals can be as diverse as “slow down the spread of a deadly new infectious disease”, “configure missile defences to prevent an incoming attack”, “prevent runaway climate change”, “boost profits for such-and-such a corporation”, “remove biases from hiring practices”, and so on.

Second, the intelligence is likely to understand that various guiding principles will help it achieve its underlying goal. Examples of these guiding principles are: if you treat a person badly, they are unlikely to treat you well in return, and that if you are found out to have been lying, others will be less likely to trust you in the future. However, these guiding principles are, themselves, compatible with a wide range of underlying goals. And wherever a clash arises between a particular guiding principle – such as the general prohibition on telling lies – and achieving the software’s underlying goal, there’s no guarantee that the guiding principle will be upheld in that case. Instead, the software – like many a human, including many who profess to uphold high ethical standards – may calculate that it is more effective, for its purposes, to tell an untruth in at least some occasions.

Could we be confident that an AI with extraordinary intelligence will work out, by itself, that human flourishing should be protected at all times? On the contrary, it is conceivable that the AI might decide that a different balance of outcomes deserves priority. Examples could be:

  • Greater diversity in the sentient life forms on earth (hence: fewer humans, and more of other kinds of animals)
  • Avoidance of involuntary suffering (hence: removal of circumstances in which involuntary suffering occurs).

Discussions between humans about ethics frequently highlight strong differences of opinion about which actions are truly ethical, and which are not. Even the four people named earlier as possible partial paragons of moral virtue – Mahatma Gandhi, Martin Luther King, Albert Einstein, and Mother Theresa – had aspects of their lives which others condemn as falling fall short of admirable. The behaviours of revered founding figures of major world religions – as written in various ancient scriptures – also include actions that attract criticism from alternative ethical standpoints.

In summary:

  1. It’s by no means clear that there is a single “uniquely correct” set of ethical principles
  2. Even if an AI decided in favour of a set of ethical principles, there’s no guarantee that it would decide to subordinate all its actions to these principles; instead, it could put its other goals ahead of observing these principles
  3. Even if an AI tried to behave in conformance to a set of ethical principles, it might miscalculate on occasion, and unintentionally violate these principles
  4. An external influence might hack the AI so that it no longer observes the set of principles it thinks it is following.

Issues with hard-wiring ethical principles

Rather than relying on an AI system to work out by itself an appropriate set of ethical principles, and then to always subordinate itself to these principles, here’s a slightly different approach. Appropriate ethical principles should be hard-wired deep into the design of the AI, in such a way that they are guaranteed always to be observed.

Then, regardless of which goals the AI is pursuing, it will avoid the kinds of mis-actions described in these principles.

In this approach, there is no longer any need to control the AI, since the AI will be deeply aligned with the preservation of human flourishing.

Accordingly, the discussion now moves from the solution of the AI control problem, as discussed on the present page, to the solution of the AI alignment problem, as discussed here.

Referring back to the four points listed at the end of the previous section, the solutions to the AI alignment problem would deal with the first two:

  1. Rather than leaving to chance the selection of the set of ethical principles to be observed, the set would be chosen in advance (either in full detail, or, more plausibly, in general outline) and designed into the AI
  2. Rather than in effect giving the AI a choice about how closely to conform to these ethical principles, the principles would be established as even more fundamental than whatever other goals it is following.

Note however that the two other points listed remain as concerns:

  • Despite the powerful intelligence embodied in the system, it might still miscalculate on occasion, especially in circumstances in which there is uncertainty
  • The actions of the AI might still be overridden as a result of interference from external forces.

These points are reasons why the answer to AI safety and AI benevolence depends, not just on a single idea, but on an extended suite of checks and balances – namely, the full set of Singularity Principles.

Recent Posts

RAFT 2035 – a new initiative for a new decade

The need for a better politics is more pressing than ever.

Since its formation, Transpolitica has run a number of different projects aimed at building momentum behind a technoprogressive vision for a better politics. For a new decade, it’s time to take a different approach, to build on previous initiatives.

The planned new vehicle has the name “RAFT 2035”.

RAFT is an acronym:

  • Roadmap (‘R’) – not just a lofty aspiration, but specific steps and interim targets
  • towards Abundance (‘A’) for all – beyond a world of scarcity and conflict
  • enabling Flourishing (‘F’) as never before – with life containing not just possessions, but enriched experiences, creativity, and meaning
  • via Transcendence (‘T’) – since we won’t be able to make progress by staying as we are.

RAFT is also a metaphor. Here’s a copy of the explanation:

When turbulent waters are bearing down fast, it’s very helpful to have a sturdy raft at hand.

The fifteen years from 2020 to 2035 could be the most turbulent of human history. Revolutions are gathering pace in four overlapping fields of technology: nanotech, biotech, infotech, and cognotech, or NBIC for short. In combination, these NBIC revolutions offer enormous new possibilities – enormous opportunities and enormous risks:…

Rapid technological change tends to provoke a turbulent social reaction. Old certainties fade. New winners arrive on the scene, flaunting their power, and upturning previous networks of relationships. Within the general public, a sense of alienation and disruption mingles with a sense of profound possibility. Fear and hope jostle each other. Whilst some social metrics indicate major progress, others indicate major setbacks. The claim “You’ve never had it so good” coexists with the counterclaim “It’s going to be worse than ever”. To add to the bewilderment, there seems to be lots of evidence confirming both views.

The greater the pace of change, the more intense the dislocation. Due to the increased scale, speed, and global nature of the ongoing NBIC revolutions, the disruptions that followed in the wake of previous industrial revolutions – seismic though they were – are likely to be dwarfed in comparison to what lies ahead.

Turbulent times require a space for shelter and reflection, clear navigational vision despite the mists of uncertainty, and a powerful engine for us to pursue our own direction, rather than just being carried along by forces outside our control. In short, turbulent times require a powerful “raft” – a roadmap to a future in which the extraordinary powers latent in NBIC technologies are used to raise humanity to new levels of flourishing, rather than driving us over some dreadful precipice.

The words just quoted come from the opening page of a short book that is envisioned to be published in January 2020. The chapters of this book are reworked versions of the scripts used in the recent “Technoprogressive roadmap” series of videos.

Over the next couple of weeks, all the chapters of this proposed book will be made available for review and comment:

  • As pages on the Transpolitica website, starting here
  • As shared Google documents, starting here, where comments and suggestions are welcome.

RAFT Cover 21

All being well, RAFT 2035 will also become a conference, held sometime around the middle of 2020.

You may note that, in that way that RAFT 2035 is presented to the world,

  • The word “transhumanist” has moved into the background – since that word tends to provoke many hostile reactions
  • The word “technoprogressive” also takes a backseat – since, again, that word has negative connotations in at least some circles.

If you like the basic idea of what’s being proposed, here’s how you can help:

  • Read some of the content that is already available, and provide comments
    • If you notice something that seems mistaken, or difficult to understand
    • If you think there is a gap that should be addressed
    • If you think there’s a better way to express something.

Thanks in anticipation!

  1. A reliability index for politicians? 2 Replies
  2. Technoprogressive Roadmap conf call Leave a reply
  3. Transpolitica and the TPUK Leave a reply
  4. There’s more to democracy than voting Leave a reply
  5. Superdemocracy: issues and opportunities Leave a reply
  6. New complete book awaiting reader reviews Leave a reply
  7. Q4 update: Progress towards “Sustainable superabundance” Leave a reply
  8. Q3 sprint: launch the Abundance Manifesto Leave a reply
  9. Q2 sprint: Political responses to technological unemployment Leave a reply