**THOUGHTS ON AGI ASSESSMENT AND THE ARC BENCHMARK** ![](arc.png) I have been thinking about the ARC benchmark for AGI a bit. The Abstraction and Reasoning Corpus developed by Francois Chollet is supposed to be a test for intelligence "modulo memorization". However, I am not convinced this is measuring general intelligence. The reason is that it is a single contrived benchmark and general intelligence is more properly defined as correlation between multiple separate capabilities across a diverse population of models. In order to estimate a meaningful quantitative measure for general intelligence we would need to use techniques similar to, or in some way inspired by, structural equation modeling (SEM) and factor analysis (FA), which were developed in psychometrics for the purpose of assessing human cognitive abilities. Alternatively, we should at least perform spectral analysis of correlations across a (preferably large) number of models on a diverse set of benchmarks. Recently, I have been involved in some efforts to measure general intelligence of LLMs using spectral analysis and the results suggest that general intelligence is highly correlated with scale across model classes. This has been known for some time, but a rigorous framing further quantifies these general observations. Below is a figure showing how the general model capabilities (extracted using spectral analysis) depend on total compute used to derive the model weights. The training FLOPs is a measure of scale that combines model size and the amount of training data with an estimate of total floating point operations required to train the model. Clearly scaling improves general intelligence (as measured by diverse benchmark correlations across multiple model classes). (The "bitter lesson" (c.f. Richard Sutton's writings) continues to apply in the context of current SoTA LLMs.) ![](capabilities_scale_dark.png) Furthermore comparing base models to their instruction tuned versions reveals further insights into this relationship (a paper on this with more details is in progress but I won't say much more here because it is part of another project that is still underway). Some benchmarks naturally benefit from scaling laws across all model classes. However, some become correlated with scale after tuning techniques such as RLHF are applied to the base models. The figure below summarizes this idea. (In order to make claims about "general intelligence" we need a diverse collection of models and benchmarks as well as proper quantitative analysis of the relevant correlations. Some benchmarks are solved with scale only after proper tuning techniques are applied.) ![](spectral_benchmarking_dark.png) It is possible that ARC becomes scale-dependent after some yet to be developed tuning techniques are applied to currently existing LLMs. If that is the case, ARC will be eventually solved with scale at future levels of emergent phase transitions and no special ARC-specific work needs to be done. The general issue with the ARC benchmark is that it makes extraordinary claims without extraordinary evidence. If we accept, following Chollet, that it is the only currently existing "test for AGI", as he has repeated on various podcasts, then we are saying that all of the other existing benchmarks are orthogonal to general intelligence. (Francois: I apologize if this is a misinterpretation of what you meant. If you read this post feel free to comment with a correction and I can edit this part.) Furthermore, we are making this claim without proper psychometric or otherwise statistically sound analysis. Instead, I suggest two alternative interpretations of underperformance of SoTA LLMs on ARC. First, it is strongly plausible that ARC is biased towards our unique evolutionary "pre-training". Humans are not initialized randomly (as LLMs are) but rather come with smart initialization distilled over billions of years from amounts of data many orders of magnitude larger and incomparably more complex than what current SoTA AI systems are trained on. It is actually surprising that LLMs behave so convincingly intelligent given these constraints (which can be interpreted as a sign of intelligence). I doubt a "randomly initialized human neocortex in a vat", further devoid of most senses, and given only sparse and limited signal to learn from could come anywhere close to this. Underperforming on a particular benchmark does not itself suggest lack of general intelligence. As an analogy consider the "cognitive trade-off hypothesis" developed by a team of researchers at Kyoto Primate Research Institute in Japan. The institute is famous for developing cognitive "benchmarks" on which chimpanzees outperform humans by a large margin. These benchmarks take advantage of the particular differences between how human and chimpanzee brains process information. The cognitive trade-off hypothesis suggests that our linguistic ability is the cause of our under-performance on these tasks. ARC could be measuring something similar in the sense that AI's under-performance could be explained by the bias of the benchmark towards peculiarities of how the human brain processes the type of information represented by the ARC tasks. We would need a much larger variety of independent tasks to make any claims about AGI. (Human under-perform with respect to other primates on certain contrived benchmarks, such as this sequence memorization puzzle.) ![](cognitive_tradeoff.jpg) Second, it is also equally possible that ARC measures an independent factor of general intelligence (c.f. FA in psychometrics). Without a proper process of SEM and FA we can not make any strong claims that LLMs are not generally intelligent at all and that we are making zero progress towards AGI with the scaling of GPT architectures, as messaging coming from the ARC challenge seems to suggest. It is likely that scaling GPT does lead to further emergence at scale, and that we are on the right track towards AGI. It is more plausible that these models are somewhere on a spectrum rather than completely missing some magical ingredient that can only be measured with a single contrived benchmark. However, I do think that ARC is making contributions to AGI by either revealing human evolutionary biases that might be useful for AI to learn, or by uncovering a second factor of intelligence that is orthogonal to the "g-factor" estimated by the remaining benchmarks. It is unlikely that ARC is the only direction in the benchmark space associated with "true general intelligence". Solving ARC itself will not mean AGI is achieved. We would need a lot more separate benchmarks that measure capabilities in the component that ARC belongs to. The good thing about ARC is that it encourages the creation of new benchmarks uncorrelated with the currently measured capabilities. I hope the AI community develops a diverse set of benchmarks with properties similar to ARC (e.g. not following scaling laws) that allows us to get a fuller picture of the spectrum of general intelligence. Solving ARC will not be enough to make any strong claims about AGI, but it is a step in the right direction. General intelligence, especially one that followed a different developmental path from ours, is most likely a lot more complex and exotic than we can currently imagine. We need a much greater diversity in benchmark design coupled with more rigorous quantitative analysis of various correlations that arise. These include correlations between model classes, benchmark types, model scale, training data, and more.