**SPECTRAL BENCHMARKING** **measuring general model ability with spectral methods** INTRODUCTION ============================================================== I am a co-organizer at the NeurIPS conference this year. We are designing a competition track aimed at developing AI safety benchmarks that are disentangled from the broader capabilities of the models. The competition aims to address a problem concerning the quality of AI safety datasets: many datasets, including those that supposedly test some safety measure, instead effectively test general capabilities. We compute the general ability based on the top principal component of model performance on key capabilities benchmarks, and observe that most standard AI safety benchmarks are highly correlated with general ability, raising the question of whether these benchmarks are measuring a usefully distinct component of model performance. Since general model ability is correlated with improvements on current AI safety benchmarks, improvements in benchmark scores due to progress in general model ability can obscure contributions and techniques that narrowly improve safety metrics. Conversely, AI safety improvements should be orthogonal (or as independent as possible) to general ability, so improving safety does not hurt model performance on tasks that are not correlated with safety concerns. Currently, there are no rigorous tools to make this distinction, leading to lack of organization in the field. Having a rigorous, quantitative ranking of safety vs general ability loading for new tasks would provide a unifying framework, more efficiently guiding researchers into directions that are optimal for improving safety of AI models, without sacrificing downstream task performance. The ability to make models safer without hurting their general capabilities simultaneously promotes adoption of safe AI systems in applied settings, where performance might outweigh safety considerations. In this post I will share some ideas and results from my work on designing the NeurIPS competition. These will be mostly lessons that will be useful on developing more substrate free psychometric techniques for AI (independent of AI safety research), and can help guide us on the quest towards AGI. I also share some extensions of this work that go beyond what will be presented at NeurIPS. These are preliminary results that are still under development. THE G-FACTOR IN PSYCHOMETRICS ============================================================== The $g$-factor, or general intelligence factor, is a psychometric construct used to quantify what is common to the performance on various cognitive tasks, positing a single underlying intellectual capability. It represents a latent variable derived from the observation that performance across diverse cognitive tasks tends to be positively correlated. This suggests the presence of a single underlying factor, $g$, that contributes to individual differences in overall cognitive ability. The $g$-factor was first proposed by psychologist Charles Spearman in the early 20th century. His two-factor theory posited that individual performance on any cognitive task could be explained by two factors (see figure below): - $g$: A general cognitive ability that influences performance across all tasks. - $s$: Specific abilities unique to each task. ![](original_model.png) Mathematically, in this original two factor model, the score $v_i$ on the $i$-th cognitive task can be modeled as: $$ v_i = g \cdot w_i + s_i + \epsilon_i $$ where $w_i$ is the loading of the $i$-th task on the g-factor, indicating the extent to which the task measures $g$; $s_i$ represents task-specific variance not explained by $g$; and $\epsilon_i$ is the error term, capturing measurement error and other unexplained variance. Modern definition of the g-factor is based on a hierarchical model, usually derived using factor analysis, a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables, called factors (see figure below). This refined model proposes that cognitive abilities are organized in a hierarchy, with the general intelligence factor, $g$, at the top. This model allows for the representation of both general cognitive ability and specific abilities related to different domains of cognition. In this framework, the score on a particular cognitive task not only reflects the influence of the general intelligence factor but also the effects of intermediate-level factors (group factors) that represent more specific domains of cognitive abilities. ![](factor_model.png) In the hierarchical factor model the score $v_i$ for an individual on the $i$-th cognitive task can be represented as follows: $$ v_i = g \cdot w^{(g)}_{i} + \sum_{j=1}^{m} f_j \cdot w^{(f)}_{ji} + s_i + \epsilon_i $$ where: - $g$ is the general intelligence factor. - $w^{(g)}_{i}$ is the loading of the $i$-th task on the general intelligence factor $g$, indicating how much of the task performance is explained by $g$. - $f_j$ represents the $j$-th group factor out of $m$ possible group factors, where each group factor captures a specific domain of cognitive abilities (e.g., verbal reasoning, spatial reasoning, logical reasoning, etc). - $w^{(f)}_{ji}$ is the loading of the $i$-th task on the $j$-th group factor, indicating how much of the task performance is explained by this specific cognitive domain. - $s_i$ is the unique factor for the $i$-th task, capturing variance that is specific to this task and not explained by either $g$ or the group factors. - $\epsilon_i$ is the error term for the $i$-th task, including measurement error and other unaccounted-for variance. The concept of $g$ is foundational to many psychometric evaluations, including the construction and interpretation of IQ (Intelligence Quotient) tests. GENERAL ABILITY IN LARGE LANGUAGE MODELS ============================================================== ![](ghat.png) Scaling laws in large language models show that models improve in performance across tasks as the number of parameters increases. Furthermore, models that score highly on one task tend to score high on most other tasks. There are various related emergent properties of LLMs that were documented in the literature in recent years. For instance, fine-tuning models on code also improves their reasoning abilities in other settings, that have no apparent relation to programming. It seems that since transfer learning was discovered, we are increasingly witnessing the emergence of general intelligence factor in large scale AI systems. In this post I will present some techniques to measure this factor using ideas from spectral theory. Since the methodology is different from that of human psychometrics, I will call it **General Ability (GA)** in order not to confuse it with the *g-factor*. Modern composite benchmarks usually weight their component tests equally, averaging test scores. A more principled approach can involve weighting component benchmarks according to the strength of their association with each other. Here I will present an approach to measuring a correlation of general ability for models which acts as a weighted composite score of a set of benchmarks, with higher weight placed on benchmarks that account for greater variance in model performance across benchmarks. This approach provides a consistent framework for evaluating the similarity between newly introduced tasks and the original benchmarks. This method is inspired by a long-standing tradition of ability assessment in the field of psychometrics mentioned above. However it is based on a different approach, which uses spectral methods instead of traditional factor analysis approaches. Given a set of $n$ models and a suite of $m$ capabilities benchmarks (e.g. MMLU, Winogrande, GSM8k, etc.) I construct a matrix of scores $A \in \mathbb{R}^{n \times m}$, such that $A_{ij}$ is the score of the $i$-th model on the $j$-th benchmark, normalized so that columns have mean $0$ and variance $1$. I then compute a correlation matrix $C$ associated with $A$ (for this analysis, I use Spearman correlation), such that $C_{ab}$ is the correlation between task $a$ and task $b$ performance across all models. Finally, I extract the largest eigenvalue $\lambda_1$ of $C$ and its associated unit eigenvector $v_1$. I define vector $L$ as follows: \begin{equation} L = \sqrt{\lambda_1} \cdot v_1 \end{equation} I now define the general ability component (sometimes I will also refer to it as the GA-loading) of benchmark $j$ as the $j$-th index of the $L$ vector, and we define the general ability score of the model $i$ as the inner product of $A_{i:}$ and $L$. The figure at the top of this section demonstrates the full process. Observe that the general ability components are the weights of a linear composite value for each model. In order to justify the correlation, we can point to an interpretation of general ability in terms of PCA. When $C$ is the Pearson correlation matrix, $\sqrt{\lambda_1}$ is the largest singular value of $A$, and $v_1$ is its associated top principal component. $\lambda_1 / m$ then represents the proportion of total variance in normalized model scores explained by the $L$ vector. Given a new benchmark $T$ evaluated against the same $n$ models, we can calculate $T$'s general ability component as the correlation between model general ability values and values on $T$ (I use Spearman correlation for these calculations as well). These general ability components allow for quantitative, intuitive, and principled evaluations of task relationship to general model abilities. We expect generally smarter models to perform better on most tasks. The figure below shows plots of benchmark scores vs general ability scores of a collection of open source language models. We see that most benchmarks show high correlation with the general ability of models. Note that for some benchmarks (e.g. WMDP) the axis is flipped. This is because the metric for these tasks is designed in a way that higher score means worse result. ![](benchmarks_gfactor.png) The table below shows Language Models sorted by their general ability values along with their model size. We can observe a general trend of higher general ability with increasing model size. ![](model_score_table.png) The table below outlines several datasets with a general description of the type of skill it tests for and the associated general ability component values. ![](benchmark_loading_table.png) SPECTRAL CLUSTERING OF LLM BENCHMARKS ============================================================== The rest of this post is more exploratory. I might write more on my experiments in this direction later, but these results are very preliminary. The idea is to take a different perspective on our benchmark data. I will interpret benchmarks as nodes in a graph, with weighted edges determined by model score correlations. There is an intuitive explanation by analogy with physics that I like. Imagine each benchmark is a ball with some mass, and they are connected by springs (edges). The tightness of the springs is determined by the correlation of the model scores on these benchmarks that the springs connect. When we shake this graph the springs will vibrate and due to differences in tightness some nodes (benchmarks) will resonate and start moving as clusters, and others will be thrown off and swing out of sync with a larger amplitude. The eigenvalues of the graph Laplacian (computed below) allow us to study these groups of nodes that form from benchmarks *"vibrating"* together or out of sync. In these computations, I drop the first eigendimension because that mainly tells us about connected components (but our graph is fully connected so it would be constant across the nodes). However, the second and third components will be used for a 2D projection below. These dimensions tell us about how things clump together when we shake our graph. This is related to nonlinear dimensionality reduction, and I believe it is more powerful (yet easy to implement) than simply doing linear projections with raw data (such as the PCA approach we used above). Finally, we will use all remaining dimensions to perform clustering of benchmarks based on these dynamics. COMPUTATION PROCEDURE -------------------------------------------------------------- I break down the procedure for this computation by the intermediate quantities that are to be obtained in sequence. We start by running a collection of models $\mathcal{M}$ on a set of benchmarks $\mathcal{B}$. - $A \in \mathbb{R}^{n \times m}$ : a matrix of scores with $A_{ij}$ representing the score of $i$-th model on the $j$-th benchmark, normalized so that scores in each column have mean $0$ and variance $1$ - $C \in \mathbb{R}^{m \times m}$ : correlation matrix computed from $A$, where $C_{kl}$ is the correlation of scores on benchmark $k$ with benchmark $l$ across all models in $\mathcal{M}$ - $\mathbf{b}$ : an ordered list of benchmark names, labeling the rows/column of $C$; from this list we also create two dictionaries $\mathbf{bi}$ and $\mathbf{ib}$ mapping benchmark names to row/column indices and vice versa - $C \in \mathbb{R}^{m \times m}$ : a diagonal matrix with $D_{jj} = \sum_k C_{jk}$ and $0$ off-diagonal - $\mathbf{L} = I - D^{-1} C$ : normalized laplacian - $I$ is the identity matrix, and this is the laplacian ($D-C$) multiplied by the inverse of $D$; the generalized eigenvectors of the laplacian equation are the eigenvectors of the normalized laplacian in this definition - $E$ : the matrix with **columns** equal to the ordered $k$ eigenvectors of the normalized laplacian - $\mathbf{bv}$ : the benchmark vectors dictionary - keys are benchmark names from $\mathbf{b}$ and values the **rows** of $E$ ($k$-dimensional vectors) - $\mathcal{F}$ : a partition of $\mathcal{M}$ into capability factors IMPLEMENTATION -------------------------------------------------------------- The computation is quite simple in numpy. We need to define several utility methods, which follow the bullet points in the outline above. ```python import numpy as np from scipy.spatial.distance import cosine as cs import matplotlib.pyplot as plt from sklearn.neighbors import NearestNeighbors from sklearn.cluster import KMeans def get_normalized_laplacian(c): """ input: c: benchmark score correlation matrix over all models output: normalized laplacian of the correlation matrix """ l = np.zeros(c.shape) d = np.diag(np.einsum('jk->j', c)) l = np.matmul(np.linalg.inv(d), (d - c)) return l def get_embedding_matrix(l, k): """ input: l: normalized laplacian of the benchmark score correlation matrix for all models k: hyperparameter specifying the desired embedding dimension (projection on the linear subspace spanned by the top k eigenvectors of the laplacian) output: benchmark embedding matrix with rows corresponding to the k-dimensional benchmark vectors """ v = np.zeros((l.shape[0], k)) w, v = np.linalg.eig(l) idx = [e[1] for e in sorted(list(zip(w, range(w.shape[0]))), key=lambda x: x[0])[:k]] return v[:, idx] def get_benchmark_vectors(e, bi): """ input: e: benchmark embedding matrix (rows are benchmark vectors) bi: hashmap from benchmark name (string) to the corresponding index in the benchmark embedding matrix (integer) output: benchmark embedding dictionary that maps from benchmark name (string) to its spectral embedding vector (corresponding row of e) """ return {b: e[bi[b]] for b in bi} def get_knn_graph(bv): """ input: bv: dictionary with keys corresponding to benchmark names (string) and values spectral graph embedding (numpy arrays) output: adjacency list representation of the k nearest neighbor graph """ labels = list(bv.keys()) X = np.array(list(bv.values())) nn = NearestNeighbors(n_neighbors=4) # using 4 because the first "neighbor" will always be the point itself nn.fit(X) distances, indices = nn.kneighbors(X) knn_dict = {} for i, label in enumerate(labels): neighbors = [labels[idx] for idx in indices[i] if labels[idx] != label][:3] knn_dict[label] = neighbors return knn_dict def benchmark_distance(b1, b2, bv): """ input: b1: first benchmark name (string) b2: second benchmark name (string) bv: dictionary with keys corresponding to benchmark nanmes (string) and values spectral graph embedding (numpy arrays) output: cosine similarity score between the benchmarks in the spectral space """ return cs(bv[b1], bv[b2]) def get_clusters(bv): """ input: bv: dictionary with keys corresponding to benchmark anmes (string) and values spectral graph embedding (numpy arrays) output: elbow_point: optimal number of clusters (integer) based on a heuristic hyperparameter optimization clusters: a partition of the list of benchmark names """ X = np.array(list(bv.values())) # range of clusters to try range_clusters = range(1, 11) # calculate WCSS for each number of clusters in the range above wcss = [] for n_clusters in range_clusters: kmeans = KMeans(n_clusters=n_clusters, random_state=42) kmeans.fit(X) wcss.append(kmeans.inertia_) # find the elbow point # calculate the second derivative (difference of differences) of the WCSS to find the elbow differences = np.diff(wcss, n=1) second_differences = np.diff(differences, n=1) # the elbow is where the second derivative decreases significantly # for this simplistic approach, we'll just take the first point of significant change elbow_point = np.argmax(second_differences) + 2 # +2 because np.diff reduces the length by 1 for each differentiation, and indices start at 0 # perform k-means clustering with the optimal number of clusters optimal_kmeans = KMeans(n_clusters=elbow_point, random_state=42) optimal_kmeans.fit(X) # cluster labels for each point labels = optimal_kmeans.labels_ # group original labels by their assigned cluster clusters = {} for label, cluster_id in zip(bv.keys(), labels): clusters.setdefault(cluster_id, []).append(label) # the optimal number of clusters and the clusters themselves return (elbow_point, clusters) ``` CLUSTERS -------------------------------------------------------------- Number of capability factors from spectral analysis: 4 - Factor 1: ['arcc-25', 'mmlu-5', 'winogrande-5', 'logiqa', 'piqa', 'bbh'] - Factor 2: ['gsm8k-5', 'glue-acconly', 'medqa_4options'] - Factor 3: ['hellaswag-10'] - Factor 4: ['gpqa'] It seems we have two main clusters and two outlier singleton clusters. We will look more deeply into these results in another post. Based on these results I suspect that the existing LLM benchmarks represent at least two distinct ability factors. 2D VISUALIZATION -------------------------------------------------------------- Here is a visualization using the two most informative grouping dimensions in the spectral space. This projection is a bit misleading, as we will see from full spectral analysis below, but I included it for completeness, as this is the first thing I visualized. ![](projection2d.png) GRAPH VISUALIZATIONS -------------------------------------------------------------- Some other visualizations I explored based on the 2D projection were nearest neighbor graphs. It is a technique (which I designed a long time ago for a different project) where we look at each node (benchmark) and then expand its neighborhood recursively. For instance a 3-neighbor/2-generation expansion would mean looking at the three nearest neighbors of the current node, drawing the corresponding nodes and edges (I use Bezier curves to make the graph look nicer with multiple edges). We then repeat this process recursively for the newly added nodes. Because of this multiple edges can form between the nodes. The size of the node decreases with generation number. The global graph includes all benchmarks (by performing expansions and additions of new nodes until all are present). ![](global.png) The local graphs below explore the local 3-neighbor/2-generation neighborhood of each benchmark separately. ![](local.png) SUMMARY ============================================================== I defined a measure of general ability for language models based on community standard LLM benchmark datasets. The metric derived from it is inspired by the idea of the *g-factor* in psychometrics, but is derived by performing principal component analysis directly on the correlation matrix of benchmark scores across a set of language models. I call this metric the **General Ability (GA)**. I presented results showing that GA is correlated with, but differs from model capacity. The proposed metric can benefit the AI community by providing a measure of progress in the field towards the goal of AGI. Subsequently, I applied ideas from spectral graph theory to view our benchmark data as a graph with nodes corresponding to benchmarks and weighted edges determined by the correlations of scores from a collection of models evaluated on these benchmarks. I then used clustering in the spectral space of the normalized graph Laplacian to determine ability factors and it seems there are 2 main factors and 2 what seems to be outlier factors. I also produced some visualizations using JavaScript code I wrote (which is not included in this post, but some of the resulting graphs are included). The results suggest that community LLM benchmarks test at least two distinct abilities in language models. Note that this is different from factor analysis in psychometrics, and directly follows the initial computation based on principal component analysis of the task score correlation matrix (but goes further). It would be interesting if low GA-loading ("pure") safety benchmarks formed a separate cluster in the spectral embedding space! It can also help us distinguish good tasks (i.e. those that give us hints as to general intelligence of the system) from misleading tasks, where progress might not be indicative of actual increase in general ability.