Measuring progress toward AGI
Estimates of progress toward the creation of AGI have a colourful history.
It’s easy to find past examples when AGI was predicted to happen within a short space of time.
For example, two pioneering AI researchers, Herbert Simon and Allen Newell, made the following predictions at a conference in Pittsburgh in November 1957 for events that would happen “within the next ten years” (that is, by 1967):
1. A digital computer will be the world’s chess champion
2. A digital computer will discover and prove an important new mathematical theorem
3. A digital computer will write music that will be accepted by critics as possessing considerable aesthetic value
4. Most theories in psychology will take the form of computer programs, or qualitative statements about the characteristics of computer programs
From the vantage point of the 2020s, the timescale elements of these predictions appear ridiculous. That’s despite the fact that both Simon and Newell had distinguished academic careers. The two shared the Turing Award in 1975, bestowed by the Association of Computing Machinery (ACM) for “contributions to artificial intelligence, the psychology of human cognition, and list processing”, and Simon went on to win the Nobel Prize for Economics in 1978. Evidently, neither of these awards guarantees the soundness of your predictions about the future.
It’s also easy to find past examples when apparent experts were sure various landmark accomplishments would not take place within a foreseeable future timescale, but AI progress nevertheless did reach these levels.
For example, consider a survey published in May 2014 by Wired journalist Alan Levinovitz of expert opinions about when computers would be able to outperform the best humans at the game of Go. The different experts broadly split into three groups. AI researchers tended to think the task would take “maybe ten years”. Professional Go players were more sceptical: the full depths of Go gameplay would more likely resist AI mastery for around twenty years. A third group suggested the task would forever exceed the ability of any mechanical brain – that some kind of “wall” was about to be reached, defying further progress.
However, any simple projection of previous trends was to prove misleading. A significant new disruption arrived, namely the programming methods utilised by Google’s London-based DeepMind subsidiary. Rather than needing to wait ten or even twenty years, the breakthrough took less than two years, culminating in an emphatic 4-1 victory by the AlphaGo software over human Go playing legend Lee Sedol in Seoul, South Korea. All the apparent domain experts surveyed by Levinovitz just two years earlier proved to be over-pessimistic by a factor of at least five.
I have heard that some of these forecasters subsequently remarked they never expected that any one organisation would apply to this problem the vast scale of resources that Google chose to deploy. Such a possibility lay outside of what was presumed to be the landscape of plausible scenarios. That misperception is a reminder to all of us to appreciate the potential multiplicative effect of changed human outlook. As in a time of war, or any other major crisis, coordinated human activity can produce results that transcend previous expectations.
Aggregating expert opinions
A number of organisations run projects to aggregate expert forecasts on possible future occurrences. In the systems used by these organisations, greater weight is placed on the forecasts from contributors who have gained good reputation scores following their previous forecasts, and through community ratings of the quality of the explanations they offer in justification of their forecasts.
Take the probability that the US will become engaged in a nuclear war. It’s (we hope!!) quite small. But how small? There are many routes by which a nuclear war might happen, and we’d need to identify each route, break each route into components, and then assign probabilities to each of these components…
Each of these component questions is much easier to address, and together can indicate a reasonably well-calibrated probability for one path toward nuclear conflict. This is not, however, something we can generally do ‘on the fly’ without significant thought and analysis.
What if we do put the time and energy into assessing these sequences of possibilities? Assigning probabilities to these chains of mutually exclusive possibilities would create a probability map of a tiny portion of the landscape of possible futures. Somewhat like ancient maps, this map must be highly imperfect, with significant inaccuracies, unwarranted assumptions, and large swathes of unknown territory. But a flawed map is much better than no map!
Aguirre then set out what would be involved in creating a probability map for forecasts of the sort he envisaged:
First, it would take a lot of people combining their knowledge and expertise. The world – and the set of issues at hand – is a very complex system, and even enumerating the possibilities, let alone assigning likelihoods to them, is a large task. Fortunately, there are good precedents for crowdsourced efforts: Wikipedia, Quora, Reddit, and other efforts have created enormously valuable knowledge bases using the aggregation of large numbers of contributions.
Second, it would take a way of identifying which people are really really good at making predictions. Many people are terrible at it – but finding those who excel at predicting, and aggregating their predictions, might lead to quite accurate ones. Here also, there is very encouraging precedent. The Aggregative Contingent Estimation project run by IARPA, one component of which is the Good Judgement Project, has created a wealth of data indicating that (a) prediction is a trainable, identifiable, persistent skill, and (b) by combining predictions, well-calibrated probabilities can be generated for even complex geopolitical events.
Finally, we’d need a system to collect, optimally combine, calibrate, and interpret all of the data. This was the genesis of the idea for Metaculus…
Since the public launch of Metaculus in 2015, the project has attracted a significant community of dedicated forecasters, and it regularly publishes updates on the collective performance of the site.
Metaculus has a number of questions regarding the timing of the advent of AGI.
One such question is “Date Weakly General AI is Publicly Known”, with the following definition:
- Able to reliably pass a Turing test of the type that would win the Loebner Silver Prize.
- Able to score 90% or more on a robust version of the Winograd Schema Challenge, e.g. the “Winogrande” challenge or comparable data set for which human performance is at 90+%
- Be able to score 75th percentile (as compared to the corresponding year’s human students; this was a score of 600 in 2016) on all the full mathematics section of a circa-2015-2020 standard SAT exam, using just images of the exam pages and having less than ten SAT exams as part of the training data. (Training on other corpuses of math problems is fair game as long as they are arguably distinct from SAT exams.)
- Be able to learn the classic Atari game “Montezuma’s revenge” (based on just visual inputs and standard controls) and explore all 24 rooms based on the equivalent of less than 100 hours of real-time play.
As of June 2022, a total of 497 forecasters had submitted a total of 1,540 predictions of that date. The community average prediction is October 2028.
Another Metaculus question sets a more demanding definition of AGI: “Date of Artificial General Intelligence”. The criteria for this prediction are:
- Able to reliably pass a 2-hour, adversarial Turing test during which the participants can send text, images, and audio files (as is done in ordinary text messaging applications) during the course of their conversation. An ‘adversarial’ Turing test is one in which the human judges are instructed to ask interesting and difficult questions, designed to advantage human participants, and to successfully unmask the computer as an impostor. A single demonstration of an AI passing such a Turing test, or one that is sufficiently similar, will be sufficient for this condition, so long as the test is well-designed to the estimation of Metaculus Admins.
- Has general robotic capabilities, of the type able to autonomously, when equipped with appropriate actuators and when given human-readable instructions, satisfactorily assemble a (or the equivalent of a) circa-2021 Ferrari 312 T4 1:8 scale automobile model. A single demonstration of this ability, or a sufficiently similar demonstration, will be considered sufficient.
- High competency at a diverse fields of expertise, as measured by achieving at least 75% accuracy in every task and 90% mean accuracy across all tasks in the Q&A dataset developed by Dan Hendrycks et al..
- Able to get top-1 strict accuracy of at least 90.0% on interview-level problems found in the APPS benchmark introduced by Dan Hendrycks, Steven Basart et al. Top-1 accuracy is distinguished, as in the paper, from top-k accuracy in which k outputs from the model are generated, and the best output is selected.
Again as of June 2022, a total of 239 forecasters had submitted a total of 581 predictions of this date. The community average prediction in this case is April 2038.
Neither of these two dates – 2028 or 2038 – should be regarded as somehow fixed or exemplary. Indeed, these two dates regularly change, as the community of Metaculus forecasters continues to take stock of the latest news and theories about AI performance.
The particular value of these forecasts comes from paying attention to significant changes in the dates predicted.
By the way, in case you’re shaking your head in disbelief at the prospect of AGI arriving as early (in various formats) as 2038 or even 2028 – in case you suspect the Metaculus community is dominated by crazed fantasists – here’s something to consider. It’s a brief list of examples of how a single conceptual breakthrough transformed an entire field of study, leading to progress much faster than the previous trend would have predicted:
- Once physicists Werner Heisenberg and Erwin Schrodinger had established the basics of quantum mechanics in 1925 and 1926, a flood of new results came thick and fast in the next few years. Another pioneer of that subject, Paul Dirac, later referred to this period as a “golden age in theoretical physics”: “For a few years after that it was easy for any second rate student to do first rate work.”
- Earlier, the invention of calculus by Isaac Newton and Gottfried Leibniz, and the formulation of Newton’s second law of motion (“F=ma”), had transformed the pace of progress in numerous areas of physics
- In biology, a similar status is held by the principle of evolution via natural selection – although it took considerable time from the original formulation of that principle by Charles Darwin before its full explanatory significance was widely understood. Prominent geneticist Theodosius Dobzhansky put it like this in an essay in 1973: “Nothing in biology makes sense except in the light of evolution.”
- Once inventor James Watt had created steam engines that operated with sufficient efficiency, these engines were soon at work not only in coalmines (where they had originated, as a means to pump out water) but also in tin mines and copper mines; they were, moreover, soon revolutionising the operation of mills producing flour, cotton, paper, and iron, as well as distilleries that produced alcohol; not long after that, steam engines gained a further lease of life once they were built into railway locomotives and paddle ships
- The breakthrough success in computer vision in 2012 of the “AlexNet” convolutional neural network, designed by Alex Krizhevsky with support from Ilya Sutskever and Geoffrey Hinton, caused the field of neural networks to move rapidly from being of only fringe interest in the field of Artificial Intelligence, to take centre stage, transforming AI application after AI application – speech recognition, facial recognition, text translation, stock market prediction, and so on.
The possibility of similar “overhang breakthroughs” in AI research – perhaps arising from any of the items in the rich “supply pipeline” I listed in the chapter “The question of urgency” – means that AGI could arise considerably sooner than would be predicted from a cursory look at existing trends of progress.
In other words, the Metaculus predictions may not be so crazy after all!
Alternative canary signals for AGI
The Metaculus questions mentioned above include their own tests for how progress toward AGI can be recognised. Different researchers have made alternative proposals. That is to be welcomed. In each case, a discussion about the new milestones proposed can shed light on risks and opportunities ahead. Again in each case, when progress toward the proposed milestone is either faster than expected or slower than expected, it’s a reason for analysts to reflect on their assumptions and, probably, to revise them.
As a good example of a set of milestones that appear both clear and challenging, consider this recent proposal by AI researcher Gary Marcus, which lists five possible canary signals:
(1) Whether AI is able to watch a movie and tell you accurately what is going on. Who are the characters? What are their conflicts and motivations? etc.
(2) Whether AI is able to read a novel and reliably answer questions about plot, character, conflicts, motivations, etc.
(3) Whether AI is able to work as a competent cook in an arbitrary kitchen (extending Steve Wozniak’s cup of coffee benchmark).
(4) Whether AI is able to reliably construct bug-free code of more than 10,000 lines from natural language specification or by interactions with a non-expert user. (Gluing together code from existing libraries doesn’t count.)
(5) Whether AI is able to take arbitrary proofs from the mathematical literature written in natural language and convert them into a symbolic form suitable for symbolic verification.
AI index reports
A different approach to measuring progress with AI is taken by a number of organisations that regularly publish their own reports. These are worth reviewing for signs of change from year to year – and for the reasons given for these changes.
- The AI Index published by Stanford University
- The Global AI Index published by Tortoise Media
- The AI Index Report published by the OECD
A potential shortcoming of these reports is the extent to which they prioritise descriptions of current AI capabilities (and the threats and opportunities arising), rather than forecasts of potential future developments.
However, better foresight generally arises from better hindsight. Accordingly, a sober assessment of how AI capabilities have improved in the recent past is a vital input to deciding the best advice on the management of future AI capabilities.