The primary goal of this series of articles is to explore the topology of representation manifolds arising in autoregressive neural language models trained on raw text data, and examine their relationship to the raw word manifold introduced previously. The embedding space comes with a canonical notion of shape, derived from the learned metric. I will compare the shapes of this representation to that of the word manifold introduced in the previous article. For this purpose, I introduce mathematical apparatus based on theory of modules, as well as algorithmic procedures derived from differential geometry and computational topology. Because the data used is the same as in the previous article, I can compare the three manifolds arising from it directly. These are the word manifold, the embedding manifold, and the hidden state trajectory manifold. These give three different views of natural language data. The following articles explore their structures and relationships to one another in detail. This article starts with introduction to the mathematical apparatus needed for the first set of experiments - the notion of topological persistence.

SECTION 1

INTRODUCTION

The computations I will perform in upcoming articles transform sentences of our corpus into sets of points in an ambient metric space of chosen dimension. Each sentence is thus represented as a point cloud in a vector space. This representation of language can be studied from topological and geometric perspectives.

As we saw in the previous articles, corpus data used to derive such representations does not involve geometry a priori. However, natural language text can be naturally viewed through a topological lens (c.f. the previous article on word manifolds). This is because topological notions, including continuity of maps, rely solely on the grouping of points into neighborhoods (open sets), and these can be defined using word cooccurrence patterns in context of n-grams. I argue that a topological representation of the corpus captures the essence of the grammar more directly, because geometric features, or exact positions of vectorized representations in the ambient space are more related to the labeling of the words by a particular model, with a particular loss measure. By contrast, topological features are a result of the fact that not all strings of words appear in the corpus, but instead words can form particular patterns dictated by the grammatical structure of the source language. This restriction on possible string patterns is precisely the source of topological features seen in the associated word manifolds.

When distributed vector space representations of words are induced by neural language models of the same text, the resulting point clouds are geometric objects. However, their shape (which can be quantitatively studied by topological analyses of this data) must hold a natural relation to the shape of the corpus as measured by the word manifold associated to it directly. This is because neural language models are trained to predict which words can go into given contexts based on their vector space representations. Therefore the topological aspects of the point clouds formed by these representations, must reflect the topological structure of the contexts of raw n-grams, which are the basis for the word manifold construction. The topology of the vector representations can be decomposed into parts corresponding to the model and training objective, and to the parts influenced by the grammar of the language and choice of the text corpus used. Understanding this decomposition and the relation of the point cloud topology to the word manifold can lead us to reparametrization techniques aiming at reducing the size of the parameter space, and hence energy efficiency gains both during training and in deployment of language models. This can result in significant model compression, and more sustainable neural NLP systems.

In the previous article we studied the topological features of the raw corpus data. In order to establish a correspondence between the shape of words and their embeddings into vector spaces we need to introduce tools from topological data analysis - a new field in computational topology, which has seen significant progress in the past decade. In the next three articles, we will study topological aspects (i.e. the shape) of those point clouds induced by several classes of neural language models from sentences of natural language text. This article introduces the first method that that we will use for this purpose.

SECTION 2

TOPOLOGICAL PERSISTENCE

In order to associate topological spaces to the sentences of the corpus, I compute the Vietoris-Rips Complex (VRC). It can be defined for sets of vectors using a technique inspired by hierarchical clustering methods.

Given a set of data points \mathbb{X} = \{x_1, \ldots, x_n\} embedded in a vector space with metric d, and real number \epsilon \geq 0 , we define the Vietoris-Rips complex \text{VRC}_{\epsilon}(\mathbb{X}) as the set of simplices \sigma = [x_0, \ldots, x_k] s.t. \forall i,j \in \{0, \ldots, k\}, d(x_i, x_j) \leq \epsilon . This definition produces an abstract simplicial complex \mathbb{K} with vertex set \mathbb{X} and set \Sigma of subsets of \mathbb{X} (the simplices), with the property that for any \sigma \in \Sigma , all elements of the power set of \sigma belong to the complex \Sigma . A good way of thinking about this process is as sliding an \epsilon ball across the ambient space, while recording simplices spanned by vertices that are captured inside the boundary of the ball. Note that as \epsilon grows, the number of simplices always increases, and VRCs generated by increasing values of \epsilon form a filtration of abstract simplicial complexes containing each other.

A major difference between the word manifold from the previous article and the representation manifolds in neural language models is that the former arises from discrete data without a canonical choice for topology. In contrast, linguistic unit representations exist in a metric space, which naturally comes equipped with the open \epsilon-ball topology. However, it is not obvious what value of \epsilon we should choose for the construction of the VRC. Furthermore, even small amount of noise can alter the homotopy type of a VRC associated to a random sample of points, by generating simplices that alter topological information in significant ways. These issues are main reasons why, until recently, topology was not very useful in dealing with real world data.

The solution to both these issues resulted from work of Edelsbrunner et al. and was further developed by Zomordian and Carlsson in form of persistent homology. For each fixed value of \epsilon in a Vietoris-Rips Complex filtration, we can compute homology groups of the resulting complex and depict each generator as an interval with endpoints corresponding to the birth and death \epsilon values of the homology class generator that it represents. Note that as \epsilon value grows, higher dimensional simplices appear in the VRC filtration, and these simplices will have boundaries that were cycles generating homology classes for earlier values of \epsilon. Because of this, as \epsilon grows we will see some bars appear and disappear, until all bars end, and only a single bar remains in dimension zero, which corresponds to the connected component of the entire point cloud, when it is finally enclosed by a large enough \epsilon ball (see figure 1). The interval representation is often referred to as a persistence bar code, and was initially inspired by the quiver representation of a sequence of vector spaces. This construction solves the first issue by taking all values of \epsilon > 0 into account. It also solves the second issue, by interpreting the longest bars as representing informative topological features and the shortest bars as topological noise.

FIGURE 1

A filtration of Vietoris-Rips complexes with distance parameter \epsilon on a set of points embedded in an ambient metric space, and the associated persistent homology barcodes.

Because the VR complexes for larger \epsilon values contain those with smaller values, we have a nested family of simplicial complexes. That means that there is a simplicial map from the set of simplices for a given \epsilon value into the set of simplices generated by any larger value. Simplicial maps induce homomorphisms on homology groups. If we compute homology with field coefficients, we can think of the object generated by all possible \epsilon values as a sequence of vector spaces with linear maps between them. There is a similar visualization technique in linear algebra known as quiver representation. The barcode diagram for that quiver representation corresponds to the persistence diagram generated by the given data sample. Eventually as \epsilon increases the number of bars decreases in each dimension, as homology classes disappear. With a finite number of points there are finite number of bars, and finite number of merge events.

We do not actually need a metric, but only a nested family of spaces (a filtration) where inclusion maps induce simplicial maps on the corresponding abstract simplicial complexes associated to each space. Then, in a fixed dimension, we have a quiver representation of homology groups with the associated induced maps on homology. We could therefore extend the idea of the word manifold of the previous article to include a varying n-gram window, and compute a version of persistent homology on the word manifolds generated by varying window sizes.

The bar codes generated from a point cloud can be interpreted as measuring the shape of the space in the following way. In dimension zero the barcode is equivalent to a dendrogram of a hierarchical clustering of the points in the data set. In dimension one, bars represent loops. Dimension two shows spherical cavities. Higher dimensions correspond to ( n-1 )-dimensional spheres wrapped around n-dimensional cavities.

SECTION 3

MODULE BASED ALGORITHM

Formally, topological persistence can be computed using tools from commutative algebra of modules over principal ideal domains, which involves Smith Normal Form decomposition in a fashion similar to the algorithm from the previous article. The computation is more intricate than the process used in the previous article, because we have to keep track of the extra scale parameter. The result is a significantly more efficient algorithm for computing persistence intervals from a filtration of simplicial complexes , than performing a separate homology computation for each space in a filtration. It is informative to see an example of such computation before presenting results of large scale experiments with corpus data.

Suppose we generated a filtration of abstract simplicial complexes from a point cloud by considering a range of parameter values. For a finite set of initial points, there will be only a finite number of simplices that appear, and we end up with a finite sequence of simplicial complexes that contain each other. $$ X_0 \subseteq X_1 \subseteq X_2 \dots $$

Every simplex has a definite birth time t which is the index of the complex within the filtration such that \sigma \in X_t and \sigma \notin X_{t-1}. We can arrange the boundary maps as columns, and the inclusion induced homomorphisms between chain complexes of each simplicial complex in the sequence as rows, into a commutative diagram as in figure 2.

FIGURE 2

Chain groups of a filtration of spaces. The columns show boundary chain complex for each complex in a sequence. The rows show maps induced by the inclusion of each subcomplex within a larger complex corresponding to higher parameter values. The lower index marks the dimension within a complex, and the upper index specifies the ordering within a filtration.

The goal of the persistence algorithm is to compute homology once, simultaneously for all time steps, while also reducing the sizes of matrices involved. In order to do that, we compute homology of the persistence module, which will encode information about homology of all the subcomplexes in the sequence simultaneously. The idea behind this construction is to label the simplices within the filtration with their corresponding birth times. Algebraically, we convert the boundary matrices, as defined in the previous article, from integers (really \{-1, 0, 1\}) to polynomials in the birth time variable t, which we adjoin to the ring of integers \mathbb{Z}. Let X be the whole complex, and X_k be the subcomplex in the filtration where a simplex \sigma appears for the first time. The strategy of the persistence algorithm is to track birth times algebraically. Rather than letting C_n(X) contain combinations of simplices with coefficients in \mathbb{Z} we upgrade the coefficients to polynomials in \mathbb{Z}[t]. Given n-simplex \sigma with birth time k, we represent it by t^k \sigma. If the simplex existed from the start (time index 0), we just get \sigma as usual. However, if a simplex shows up at the first time step, we represent it by tk. If it showed up in the second complex within the filtration, we would represent it as t^2k, and so forth. With this convention, our boundary maps now contain polynomials in t. Reduction of matrices over \mathbb{Z}[t] involves polynomial division, and comes with issues as discussed in . However, we could also perform this computation over polynomials with field coefficients, which makes calculation more straightforward. This hides torsion related phenomena, but if we are just interested in computing the Betti numbers, we can transition to vector spaces with coefficients in \mathbb{Q}[t], and leverage existing linear algebra packages such as numpy to perform reduction operations.

FIGURE 3

A filtration of 3 spaces. Note that the vertices v_0 and v_1 forming the boundary of edges e_0 and e_1 have different birth times, which is encoded by the exponents of t in the entries of the boundary operator.

For the remainder of the discussion, we define persistent chain groups PC_k to be free modules over the polynomial ring \mathbb{Q}[t], generated by the simplices of X. In order to make this computation work, we need to define the boundary maps in a way that that satisfies the chain complex condition necessary to define homology (c.f. the previous article on computing homology of word manifolds). The following definition of boundary matrices works. The k-th boundary map \partial_k: PC_k \rightarrow PC_{k-1} is a linear map over \mathbb{Q}[t]. In order to represent it as a matrix, we label rows and columns of boundary matrices with simplices of X, similarly to the definition in the previous article. The difference between it, and the boundary map we used in the previous article is that the entries now encode not only presence of simplices, but also their relative birth times (difference between the indices of spaces in the filtration at which they first appear). Suppose that a column of the persistent boundary matrix corresponds to a k+1 simplex [v_0, v_1, \dots, v_k], which first appeared in the filtration at index B (for birth time). In this case, the i-th entry of this column will be of the form (-1)^i t^{B-b}, which corresponds to the face [v_0, \dots, \hat{v}_i, \dots, v_k] with a birth time b. Note that the birth times of simplices marked by columns must be greater than those that mark rows (since a simplex can not appear unless its boundary is already present), thus the exponent of t is non-negative. Zomordian et al. proved that this definition of a boundary operator induces a chain complex (hence we can take homology quotients) . For instance, the persistent boundary operator in dimension 1 (that is a matrix representation of \partial_1: PC_1 \rightarrow PC_0) for the filtration in figure 3 would be written as follows under this convention. $$ \partial_1 = \begin{array}{rcl} &\color{green}\begin{array}{c}\ e_0 \ \ \ \ \ e_1 \ \ \ \ \ e_2 \ \ \ \ \ \ e_3 \end{array} \\ \color{blue}\begin{matrix}v_0 \\ v_1 \\ v_2 \end{matrix}\color{black}\hspace{-1em} &\begin{pmatrix}-t & -t & \ \ t^2 & \ \ t^2 \\ \ \ 1 & \ \ 1 & \ \ 0 & \ \ 0 \\ \ \ 0 & \ \ 0 & -1 & -1 \end{pmatrix}&\hspace{-1em}\\ \color{blue}\end{array} $$ which can be column reduced to $$ \begin{pmatrix}-t & \ \ \ \ t^2 & \ 0 & 0 \\ \ \ 1 & \ \ 0 & \ 0 & 0 \\ \ \ 0 & -1 & \ 0 & 0 \end{pmatrix} $$ and is hence of rank 2.

It is informative to see persistent homology computation for a minimal example before describing large scale experiments. Consider an event of an edge appearing between two vertices as in figure 4.

FIGURE 4

An edge appears at some value of \epsilon between two vertices A and B.

The persistent boundary matrix in dimension 1 is of the form: $$ \partial_1 = \begin{array}{rcl} &\color{green}\begin{array}{c}AB\ \end{array} \\ \color{blue}\begin{matrix}A \\ B\end{matrix}\color{black}\hspace{-1em} &\begin{pmatrix}-t\\t\end{pmatrix}&\hspace{-1em}\\ \color{blue}\end{array} $$

Note that the exponents of t in the entries above are the differences between birth times of the edge AB and its boundary vertices A and B (both of which are 1). Thus the image of \partial_1 (i.e. the space of cycles) is rank one.

The kernel of persistent boundary in dimension 0 is is two dimensional, generated by the two vertices A and B with birth time 0. $$ \text{span}_{\mathbb{Q}[t]} \{ \begin{pmatrix} 0 \\ 1 \\ \end{pmatrix}, \begin{pmatrix} 1 \\ 0 \\ \end{pmatrix} \} = \{ \begin{pmatrix} q(t) \\ r(t) \\ \end{pmatrix} : q, r \in \mathbb{Q}[t] \} $$

Since modules are not uniquely characterized by their dimension (as regular vector spaces are up to isomorphism ), we can not simply do arithmetic on dimensions (which would wrongly suggest that the quotient is one dimensional), but instead need to work with the polynomial ring to figure out the rank of homology.

We need to compute the quotient: $$ PH_0 = \text{ker} \partial_0 / \text{im} \partial_1 \\ = \{ \begin{pmatrix} q(t) \\ r(t) \\ \end{pmatrix} + \{ \begin{pmatrix} -tp(t) \\ tp(t) \\ \end{pmatrix} : p \in \mathbb{Q}[t] \} : q, r \in \mathbb{Q}[t] \} $$

Observe that we can always find a representative in each coset, where the first component is a constant, say a, simply by solving q(t) - tp(t) for a given polynomial q \in \mathbb{Q}[t] (just set the coefficients of p so they cancel out all but the constant term of q - we can not zero out the constant term because tp(t) is at least of degree 1). $$ \begin{pmatrix} a \\ r(t) + t\tilde{p}(t) \\ \end{pmatrix} $$ where \tilde{p} has been fixed (from an arbitrary polynomial p, in order to make the first component constant), but r is still an arbitrary polynomial.

This argument implies that the quotient is two dimensional. In fact we can now write: $$ PH_0 = \{ \{ \begin{pmatrix} a \\ r(t) + t\tilde{p}(t) \\ \end{pmatrix} + \begin{pmatrix} -tp(t) \\ tp(t) \\ \end{pmatrix}: p \in \mathbb{Q}[t] \} : q, r \in \mathbb{Q}[t] \} $$ and split these cosets into two summands of the form $$ \{ \{ \begin{pmatrix} a \\ 0 \\ \end{pmatrix} + \begin{pmatrix} -tp(t) \\ tp(t) \\ \end{pmatrix} : p \in \mathbb{Q}[t] \} \\ + \{ \begin{pmatrix} 0 \\ r(t) + t\tilde{p}(t) \\ \end{pmatrix} + \begin{pmatrix} -ts(t) \\ ts(t) \\ \end{pmatrix}: s \in \mathbb{Q}[t] \} : q, r \in \mathbb{Q}[t] \} $$

The above set is a sum of two spans (up to isomorphism): $$ PH_0 \cong \mathbb{Q} \oplus \text{span}_{\mathbb{Q}[t]} \{ \begin{pmatrix} 0 \\ 1 \\ \end{pmatrix} \} $$ where the fist summand comes from a being an arbitrary constant term of the erased polynomial q, which gave \mathbb{Q} as the coset represented by \begin{pmatrix} a \\ 0 \\ \end{pmatrix} in the isomorphism above.

This homology group encodes more data than the regular (i.e not persistent) homology we used previously. In particular we can read Betti numbers over time. They are usually represented as bar codes (such as those in figure 1). From our computation we can conclude that there are two bars in dimension 0 (figure 5). The first term \mathbb{Q} generates the first bar, which corresponds to a connected component that exists at time zero, but dies at time 1. We can think of it as the connected component of vertex A (or B). The second term in the direct sum above is a longer bar, that represents the connected component which exists at time 1. This component is the entire space, as A and B are joined by an edge.

FIGURE 5

Bar code corresponding to an edge appearing between two vertices. The two bars are marked with direct summands of the homology group of the associated persistence module in dimension 1 as in the example calculation. The arrow head signifies that the corresponding homology class survives indefinitely.

In practice, there are many optimization that can be done while computing persistent homology. In particular, the boundary matrices are very sparse, and algorithms used in practice take advantage of that fact. There are also topological optimizations based on Morse theoretic ideas, that identify and remove contractible parts of the simplicial complex before computing homology (since the resulting spaces are homotopy equivalent and have isomorphic homology groups).

Remarks

This article introduces mathematical techniques in more detail than is possible in my journal publications and conference talks. It is useful as an introduction to some of the mathematical constructions I use, especially for Computational Linguistics audience unfamiliar with these ideas.