The primary goal of this series of articles is to explore the topology of representation manifolds arising in autoregressive neural language models trained on raw text data, and examine their relationship to the raw word manifold introduced previously. The embedding space comes with a canonical notion of shape, derived from the learned metric. I will compare the shapes of this representation to that of the word manifold introduced in the previous article. For this purpose, I introduce mathematical apparatus based on theory of modules, as well as algorithmic procedures derived from differential geometry and computational topology. Because the data used is the same as in the previous article, I can compare the three manifolds arising from it directly. These are the word manifold, the embedding manifold, and the hidden state trajectory manifold. These give three different views of natural language data. The following articles explore their structures and relationships to one another in detail. This article starts with introduction to the mathematical apparatus needed for the first set of experiments - the notion of topological persistence.
The computations I will perform in upcoming articles transform sentences of our corpus into sets of points in an ambient metric space of chosen dimension. Each sentence is thus represented as a point cloud in a vector space. This representation of language can be studied from topological and geometric perspectives.
As we saw in the previous articles, corpus data used to derive such representations does not involve geometry a priori. However, natural language text can be naturally viewed through a topological lens (c.f. the previous article on word manifolds). This is because topological notions, including continuity of maps, rely solely on the grouping of points into neighborhoods (open sets), and these can be defined using word cooccurrence patterns in context of n-grams. I argue that a topological representation of the corpus captures the essence of the grammar more directly, because geometric features, or exact positions of vectorized representations in the ambient space are more related to the labeling of the words by a particular model, with a particular loss measure. By contrast, topological features are a result of the fact that not all strings of words appear in the corpus, but instead words can form particular patterns dictated by the grammatical structure of the source language. This restriction on possible string patterns is precisely the source of topological features seen in the associated word manifolds.
When distributed vector space representations of words are induced by neural language models of the same text, the resulting point clouds are geometric objects. However, their shape (which can be quantitatively studied by topological analyses of this data) must hold a natural relation to the shape of the corpus as measured by the word manifold associated to it directly. This is because neural language models are trained to predict which words can go into given contexts based on their vector space representations. Therefore the topological aspects of the point clouds formed by these representations, must reflect the topological structure of the contexts of raw n-grams, which are the basis for the word manifold construction. The topology of the vector representations can be decomposed into parts corresponding to the model and training objective, and to the parts influenced by the grammar of the language and choice of the text corpus used. Understanding this decomposition and the relation of the point cloud topology to the word manifold can lead us to reparametrization techniques aiming at reducing the size of the parameter space, and hence energy efficiency gains both during training and in deployment of language models. This can result in significant model compression, and more sustainable neural NLP systems.
In the previous article we studied the topological features of the raw corpus data. In order to establish a correspondence between the shape of words and their embeddings into vector spaces we need to introduce tools from topological data analysis - a new field in computational topology, which has seen significant progress in the past decade. In the next three articles, we will study topological aspects (i.e. the shape) of those point clouds induced by several classes of neural language models from sentences of natural language text. This article introduces the first method that that we will use for this purpose.
SECTION 2In order to associate topological spaces to the sentences of the corpus, I compute the Vietoris-Rips Complex (VRC). It can be defined for sets of vectors using a technique inspired by hierarchical clustering methods.
Given a set of data points
A major difference between the word manifold from the previous article and the representation manifolds in neural language models is that the former arises from discrete data without a canonical choice for topology. In contrast, linguistic unit representations exist in a metric space, which naturally comes equipped with the open
The solution to both these issues resulted from work of Edelsbrunner et al. and was further developed by Zomordian and Carlsson
Because the VR complexes for larger
We do not actually need a metric, but only a nested family of spaces (a filtration) where inclusion maps induce simplicial maps on the corresponding abstract simplicial complexes associated to each space. Then, in a fixed dimension, we have a quiver representation of homology groups with the associated induced maps on homology. We could therefore extend the idea of the word manifold of the previous article to include a varying n-gram window, and compute a version of persistent homology on the word manifolds generated by varying window sizes.
The bar codes generated from a point cloud can be interpreted as measuring the shape of the space in the following way.
In dimension zero the barcode is equivalent to a dendrogram of a hierarchical clustering of the points in the data set.
In dimension one, bars represent loops.
Dimension two shows spherical cavities.
Higher dimensions correspond to
Formally, topological persistence can be computed using tools from commutative algebra of modules over principal ideal domains, which involves Smith Normal Form decomposition in a fashion similar to the algorithm from the previous article.
The computation is more intricate than the process used in the previous article, because we have to keep track of the extra scale parameter.
The result is a significantly more efficient algorithm for computing persistence intervals from a filtration of simplicial complexes
Suppose we generated a filtration of abstract simplicial complexes from a point cloud by considering a range of parameter values. For a finite set of initial points, there will be only a finite number of simplices that appear, and we end up with a finite sequence of simplicial complexes that contain each other. $$ X_0 \subseteq X_1 \subseteq X_2 \dots $$
Every simplex has a definite birth time
The goal of the persistence algorithm is to compute homology once, simultaneously for all time steps, while also reducing the sizes of matrices involved.
In order to do that, we compute homology of the persistence module, which will encode information about homology of all the subcomplexes in the sequence simultaneously.
The idea behind this construction is to label the simplices within the filtration with their corresponding birth times.
Algebraically, we convert the boundary matrices, as defined in the previous article, from integers (really
For the remainder of the discussion, we define persistent chain groups
It is informative to see persistent homology computation for a minimal example before describing large scale experiments. Consider an event of an edge appearing between two vertices as in figure 4.
FIGURE 4The persistent boundary matrix in dimension 1 is of the form: $$ \partial_1 = \begin{array}{rcl} &\color{green}\begin{array}{c}AB\ \end{array} \\ \color{blue}\begin{matrix}A \\ B\end{matrix}\color{black}\hspace{-1em} &\begin{pmatrix}-t\\t\end{pmatrix}&\hspace{-1em}\\ \color{blue}\end{array} $$
Note that the exponents of
The kernel of persistent boundary in dimension 0 is is two dimensional, generated by the two vertices
Since modules are not uniquely characterized by their dimension (as regular vector spaces are up to isomorphism
We need to compute the quotient: $$ PH_0 = \text{ker} \partial_0 / \text{im} \partial_1 \\ = \{ \begin{pmatrix} q(t) \\ r(t) \\ \end{pmatrix} + \{ \begin{pmatrix} -tp(t) \\ tp(t) \\ \end{pmatrix} : p \in \mathbb{Q}[t] \} : q, r \in \mathbb{Q}[t] \} $$
Observe that we can always find a representative in each coset, where the first component is a constant, say
This argument implies that the quotient is two dimensional. In fact we can now write: $$ PH_0 = \{ \{ \begin{pmatrix} a \\ r(t) + t\tilde{p}(t) \\ \end{pmatrix} + \begin{pmatrix} -tp(t) \\ tp(t) \\ \end{pmatrix}: p \in \mathbb{Q}[t] \} : q, r \in \mathbb{Q}[t] \} $$ and split these cosets into two summands of the form $$ \{ \{ \begin{pmatrix} a \\ 0 \\ \end{pmatrix} + \begin{pmatrix} -tp(t) \\ tp(t) \\ \end{pmatrix} : p \in \mathbb{Q}[t] \} \\ + \{ \begin{pmatrix} 0 \\ r(t) + t\tilde{p}(t) \\ \end{pmatrix} + \begin{pmatrix} -ts(t) \\ ts(t) \\ \end{pmatrix}: s \in \mathbb{Q}[t] \} : q, r \in \mathbb{Q}[t] \} $$
The above set is a sum of two spans (up to isomorphism):
$$
PH_0 \cong \mathbb{Q} \oplus \text{span}_{\mathbb{Q}[t]} \{ \begin{pmatrix} 0 \\ 1 \\ \end{pmatrix} \}
$$
where the fist summand comes from
This homology group encodes more data than the regular (i.e not persistent) homology we used previously.
In particular we can read Betti numbers over time.
They are usually represented as bar codes (such as those in figure 1).
From our computation we can conclude that there are two bars in dimension 0 (figure 5).
The first term
In practice, there are many optimization that can be done while computing persistent homology. In particular, the boundary matrices are very sparse, and algorithms used in practice take advantage of that fact. There are also topological optimizations based on Morse theoretic ideas, that identify and remove contractible parts of the simplicial complex before computing homology (since the resulting spaces are homotopy equivalent and have isomorphic homology groups).
This article introduces mathematical techniques in more detail than is possible in my journal publications and conference talks. It is useful as an introduction to some of the mathematical constructions I use, especially for Computational Linguistics audience unfamiliar with these ideas.