The AI Alignment Problem

AI systems learn about the world via a combination of:

Information and general principles pre-programmed into them in advance
Deductions the AIs make based upon data they observe
Additional deductions they make, based upon the outcomes of experiments they conduct.

Can this pattern be extended from general information to desirable ethical principles?

In other words, can AI systems learn to behave in ways that would please us (human citizens) in terms of respecting, upholding, and enhancing human flourishing, even when these systems are beyond our ability to control? Could these AIs learn the appropriate behaviour patterns via a combination of:

Information and general principles pre-programmed into them in advance
Deductions the AIs make about examples of ethically admirable behaviour they observe
Additional deductions they make, based upon feedback of humans to the conduct of the AI systems?

If that were possible, we humans could give up on our attempts to control the AI – attempts that are in any case (as covered in the previous chapter) very unlikely to be effective. Instead, we could relax, and trust the AI to “do the right thing”.

Asimov’s Three Laws

As an example of possible pre-programmed general principles, consider the famous Three Laws of Robotics, which feature in many pieces of science fiction written from the 1940s onward by Isaac Asimov.

The Three Laws have several different forms of expression – which, as it happens, underlines the difficulty in making clear and comprehensive statements about foundational ethical principles. Their most recognised form is the following:

A robot may not injure a human being or, through inaction, allow a human being to come to harm.
A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.
A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

As is made clear in the science fiction stories that feature these laws, there are multiple ambiguities in these statements. Primarily:

What should count as “harm” or “injury”?
How should various possible harms or injuries be assessed, in comparison to each other?
How wide a “circle of concern” should apply to the robot (for example, just the humans within its current field of vision, or all humans all over the world)?
If two or more humans give conflicting orders to a robot, how will this contradiction be resolved?

The First Law is to be commended for highlighting harms that might arise from inaction (by the robot) as well as harms that could arise by action. A robot would rightly be criticised if it failed to run into a position where it could safely catch a falling child. But this raises further questions. What about children, in different parts of the world, who are suffering from malnutrition, poverty, abuse, and diseases that could easily be stopped? Shouldn’t the robot take actions to prevent such harms? Or how about people all over the globe who may soon find their lives ended by extreme weather caused by humanity’s inaction over runaway climate change?

The existence of these (and many other) questions does not invalidate the exercise. But these questions show that the task of encoding an adequate set of ethical principles is complicated. Statements such as “do not allow a human being to come into harm” are insufficient. It’s the same for statements that appear in many places in this book, such as “respect human flourishing”. Such statements can only be the beginning of the exercise, rather than its grand conclusion.

Ethical dilemmas and trade-offs

Classic works of fiction abound with examples in which the pursuit of apparently admirable ethical principles can have manifestly unethical consequences. Philosophers have written much on this topic as well.

The well-known “trolley problems” are extreme examples, but they highlight the types of issue that can complicate attempts to formulate ethical principles. Adherence to the seemingly self-evident ethical principle “do not push a person over a bridge to their certain death” would result, in some of these puzzles, in an inaction causing five other people to be killed by a runaway train (which might be diverted by the falling person’s body altering the setting of a junction lever). Again, the understandable imperative “do not be seen to indulge in distasteful utilitarian calculations over whose deaths are more tragic than others” (lest you be assessed by other observers as being an untrustworthy partner – as being a “cold fish”) means, again, that larger numbers of people may have their lives cut short.

Other ethical impulses often end up opposing each other:

A desire to give loyal support to the family or community that has laboured long and hard to raise us, versus a desire to explore controversial alternative ideas or lifestyles that seem more compelling to us than the ones of our original community
A desire to express mercy, to enable someone to make a new start in their lives after making a mistake, versus a desire to uphold justice, so that the community as a whole feels confident about continuing to follow its norms and laws
A desire to give an extra helping hand to people from disadvantaged communities – lowering entry requirements in their case – versus a desire not to unfairly discriminate against someone from a mainstream community who has used great diligence and innovation in their own entry submission
A desire to do anything (including paying kidnappers a ransom fee) to return our family member from traumatic captivity, versus upholding the principle that kidnapping should never be rewarded by financial payments
A desire to provide a wonderfully uplifting experience in the near future, versus a desire to hold on to money and resources for the time being in order to provide a series of wonderfully uplifting experiences at some later time in the future
A desire to speak the truth at all times, and to uphold academic integrity, versus a desire to protect people dear to us from mistreatment by an enemy, should that enemy discover the secret we are attempting to withhold
A desire to exercise prudence and to minimise risks of significant losses, versus a desire to exercise courage and to increase chances of remarkable gains
A desire to reduce involuntary suffering, versus a desire to increase the happiness of large numbers of people (even if that would entail significant involuntary suffering for at least some people)
A desire to uphold ethical virtues as “ends in themselves”, versus a pragmatic concern for the likely consequences of actions (even if these actions would ordinarily be assessed as unethical).

In some cases of conflicting impulses, a wise compromise might be possible, which upholds both sets of ethical impulse. But even in these cases, an original ethical impulse usually has to be dialled down, and given less priority than it originally demanded.

Again, these dilemmas and trade-offs are no reason to abandon the quest of codifying a set of ethical principles that a powerful AI would be constrained to observe. But they show that this quest is by no means straightforward.

Problems with proxies

Since it’s often hard to determine, directly, whether an action will boost an underlying measure such as human flourishing, it’s understandable that algorithms will calculate, instead, with a “proxy” measure – something that tends to be associated with the underlying measure.

It’s similar to evaluating whether a move in a chess match will increase the probability of winning that match. Proxy measures such as the number of pieces each side has on the board are a good first indication. If the move results in one side (White, say) losing a queen, without obvious compensation, it’s a sign that White is losing the game.

The problem with this proxy is the statement “without obvious compensation”. It may not be obvious that the loss of the queen is a deliberate sacrifice that will lead to a forced checkmate in ten moves time. In other cases, a chess piece may be sacrificed for some positional advantage that is even harder to evaluate.

As another example, an economy with a higher GDP (gross domestic product) is generally better, other things being equal, than one with a lower GDP. The additional economic activity is a sign of more goods being created and sold, thereby meeting consumer needs. But, again, the condition “other things being equal” is hard to assess. A country may feature an increase in buying and selling of various sorts of consumer items, but the citizens of the country could be more harried, more depressed, and less satisfied.

In a similar way, any system for evaluating a complex phenomenon is vulnerable to being misled by putting too much attention on a proxy for that phenomenon, without realising that other aspects of the situation need to be protected as well. Some examples:

A flourishing society is one where citizens feel safe and secure. But some systems that would enhance safety and security – such as setting the speed limit on all roads to just 20 miles per hour – will deny citizens many freedoms
A flourishing society is one where citizens have freedom of speech – there are no constraints on being able to articulate views that are controversial and unpopular. But an unconditional support for hate-speech and deliberate falsehoods results in its own kinds of damage
A flourishing society is one where citizens have the freedom to create new companies and to bring new products to market. But an unconditional support for new products reaching the market runs in tension with principles of consumer safety and environmental wellbeing.

None of these problems are insoluble. Modern chess software is able to assess the strength of a chess position in much more sophisticated ways than merely counting up the number of pieces on each side. Metrics such as GDP can be complemented by indexes such as the Social Progress Index published by the Social Progress Imperative. Nevertheless, the trade-offs are far from trivial. Programming the answers into an AI won’t be easy!

The gaming of proxies

There’s another problem with giving a proxy measurement too much prominence. Once an intelligence recognises that raising the value of the proxy measurement will be rewarded, it will look for all sorts of ways to raise that value – even if the underlying objective is unaffected by these actions.

This phenomenon is sometimes called Goodhart’s Law, after the economist Charles Goodhart. The law has several different formulations:

“When a measure becomes a target, it ceases to be a good measure”
“Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes”

Related, consider Campbell’s Law, named after the psychologist Donald Campbell:

“The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor”
“Achievement tests may well be valuable indicators of general school achievement under conditions of normal teaching aimed at general competence. But when test scores become the goal of the teaching process, they both lose their value as indicators of educational status and distort the educational process in undesirable ways”

Similar effects are commonplace within management bonus systems in corporations. The corporation agrees a set of metrics that, in general, are aligned with the overall wellbeing of the corporation. These might include sales, profits, projects completed, reduced quality failures, and so on. The higher these metrics, the larger the bonuses paid to managers. However, intelligent managers can find ways to “game” their performance. Results can be held back in one calendar quarter, and moved forward to the next quarter. Problems with production can be reclassified, not as quality failures, but as some other kind of incident. A project might be declared as “complete” even though the output has not been adequately tested. And so on. (Some managers seem endlessly creative in this aspect!)

Governments often take part in similar “gaming” of statistics. They highlight some metrics which appear to show they are doing well, in managing the nation, even though other metrics would give the contrary impression.

For AI systems, the risk is that proxies for human flourishing will, in similar ways, end up changing the behaviour of the AI, in ways that boost the proxy measurement, but which are actually detrimental to overall human flourishing.

Simple examples of profound problems

Discussions of AI safety and proxy targets often include examples that may puzzle observers. “No AI is going to be that stupid”, the observers complain. However, the point of such examples is to illustrate a general principle, namely, that if key requirements are omitted from the specification of the required outcome of the AI software, then the AI may act in violation of these unstated requirements. In other words, the AI will do what we asked it to do, rather than what we should have asked it to do.

Moreover, the omissions that we should most fear, are those which we haven’t yet realised need to be made explicit. It is only when the AI evidently ignores these requirements, that we will think to ourselves, “Oops, that should have been made explicit”.

One such example is the legendary figure of King Midas, who wanted to accumulate more gold (which, in itself, wasn’t a bad desire). In this legend, Midas was given a reward by the god Dionysus, in recognition of an act of kindness. Midas requested that everything he touched would turn into gold. However, he neglected to specify that various items he touched would not be transformed in this way, such as food, drink, and his own daughter. Oops.

Another example is an AI that is given the task of reducing the incidence of cancer. It observes that cancer can be reduced if the human population is reduced, and it thereby eliminates 90% of humanity at a stroke. Oops.

Again, an AI that wishes to increase signs of human happiness could connect all humans to brain stimulation devices that issue electrical signals which keep the brain in a state of stultified pleasure.

Indeed, no AI is going to be “that stupid”. However, what we need to fear is more complicated versions of the same general type, in which the AI doggedly pursues a particular objective, whilst unintentionally causing much greater damage as a result of its pursuit.

It is sometimes suggested that one of two types of solution will avoid this kind of perverse outcome:

Before taking any potentially drastic actions, the AI will check with humans whether its intended course of action meets with their approval
On account of its general high intelligence, the AI will realise that its behaviour would have detrimental consequences, and therefore it will avoid taking that course of action.

Let’s look at each of these suggestions in turn.

Humans disagree

If an AI is unsure whether a proposed course of action will meet with human approval, how about asking humans, in advance, whether to go ahead with it?

Unfortunately, humans frequently disagree on ethical calculations. Different humans have divergent opinions on matters such as abortion, divorce, transgender rights, taxation and redistribution, limits on economic activities, free speech, compulsory vaccinations, positive discrimination, animal welfare, nuclear weapons, and much more besides.

It’s conceivable that, if an AI asked humans for approval over a proposed course of action, humans would fail to come anywhere close to a clear conclusion.

In any case, the AI may determine that an answer is needed more quickly than the time required for any lengthy human deliberation.

Could the AI nevertheless infer the answer that humans would give, as a result of it studying works of great literature, philosophy, and religious devotion?

Again, the problem is that humans disagree. Even a simple Bible verse can be interpreted in multiple divergent ways. And the various human writings over the centuries contain lots of incompatible ideas.

No automatic super ethics (again)

The final step in this discussion sequence is the suggestion that, if an AI is smart enough to figure out how to carry out a challenging and difficult task, it will also be smart enough to figure out whether humans would approve of that action.

In other words: a superintelligent AI is going to know in advance whether the actions it takes will harm humanity or benefit humanity.

Unfortunately, things are not so straightforward. It may be clear in advance that a course of actions will result in some proxy metrics scoring well, without it being clear that humanity would end up being harmed in some other way.

Let’s consider some examples:

A geo-engineering intervention, injecting particles high in the stratosphere, could have the apparently welcome effect of reducing the average global temperature – even though side-effects result such as increased flooding in some parts of the world, and other extreme weather phenomena elsewhere
The adoption of a particular configuration of defence missiles could have the apparently welcome effect of reducing the likelihood of a successful first strike attack by an enemy force – even though the missile configuration adopted has unstable elements that risk an escalation of unintentional explosions
The creation of new variants of viruses, in “gain of function” research, may usefully increase the quantity of knowledge possessed about these viruses and potential remedies – even though the existence of these deadly new viruses poses a security risk, and could lead to huge numbers of deaths following a “lab leak” incident
A new medical treatment to reduce the pace of transmission of a deadly new pathogen could, as a horrific side-effect, interfere with the mechanisms in people’s brains that give rise to conscious experiences.

Moreover, even if the AI calculates in advance that humanity will be harmed, it may still proceed with the course of action. That’s because:

That course of action scores well in the explicit metrics used by the AI to determine its actions
The nature of the harm experienced by humans lies outside of the set of metrics with which the AI is concerned
It is the explicit metrics that drive decisions, rather than imprecise informal ones.

Transpolitica

Anticipating tomorrow's politics

The Alignment Problem

The AI Alignment Problem

Asimov’s Three Laws

Ethical dilemmas and trade-offs

Problems with proxies

The gaming of proxies

Simple examples of profound problems

Humans disagree

No automatic super ethics (again)

Other options for answers?

Recent Posts

RAFT 2035 – a new initiative for a new decade

The AI Alignment Problem

Asimov’s Three Laws

Ethical dilemmas and trade-offs

Problems with proxies

The gaming of proxies

Simple examples of profound problems

Humans disagree

No automatic super ethics (again)

Other options for answers?

Share this:

Recent Posts

Share this: