This article presents a novel method, based on the ideas from algebraic topology, for the analysis of raw natural language text.
The article introduces the notion of a word manifold - a simplicial complex, whose topology encodes grammatical structure expressed by the corpus.
Results of experiments with a variety of natural and synthetic corpora are presented, showing that the homotopy type of the word manifold is influenced by linguistic structure.
The analysis includes a new approach to the Voynich Manuscript - an unsolved puzzle in corpus linguistics.
In contrast to existing topological data analysis approaches, I do not rely on the apparatus of persistent homology.
Instead, I develop a method of generating topological structure directly from strings of words.
The inspiration for the work presented here was an observation I made while working on a separate project involving vector space representations of natural language data.
Such techniques are now common in Computational Linguistics and Natural Language Processing.
Initially shallow pre-training of early model layers became standard in NLP research through methods such as word2vec .
Advancements in NLP resulted in progressively deeper hierarchical language representations, such as those derived using self-attention mechanisms in transformer-based architectures .
Neural language modelling methods, based on techniques such as BERT , use vector space representations of linguistic units to predict words based on context found in raw text corpora.
Interestingly, word vectors obtained through such embedding methods do not fill the ambient space uniformly, but rather form lower dimensional structures within it.
Figure 1 shows a graph projection of word embeddings into an ambient vector space obtained from articles in French and English.
The embedding method uses word contexts to bring vector representations of words that appear in similar contexts close together.
While looking at those projections, it is hard to ignore the extreme differences in shape between languages.
The work presented here aims at studying a notion of shape in the context of natural language data, by developing theory and algorithms for probing linguistic structure from a topological viewpoint.
Subsequent research presented in the next article aims to investigate links between topological aspects of natural language data and its representations in vector spaces, which are at the core of recent developments in the field of NLP.
The predominant method of deriving linguistic unit representations within modern language models takes advantage of self-supervised techniques of masked sequence training.
This amounts to gradient descent optimization, resulting in representations that are informative towards predicting the contexts in which words appear within the corpus.
Because of this, there must be a natural relationship between the topology of raw text data (as expressed by word co-occurrence patterns), and the shape of the resulting embedding manifold (as expressed by the distributions of word vectors).
In order to study natural language on the side of LM based vector space embeddings from a topological perspective, we can use existing tools of persistent homology.
These techniques take advantage of the fact that such linguistic unit representations exist in a metric space, which naturally comes equipped with the open \epsilon-ball topology, and its inner product encodes linguistic structure due to the autoregressive model training process.
For this reason, topology based on proximity in the embedding space must hold a natural relationship to the word and context topologies based on word co-occurrence patterns.
Among the techniques we can use here, are topological persistence modules from a Vietoris-Rips complex filtration.
I would like to understand the relationship between those vector space representations and the raw text data used to induce them.
Although we have potential tools to study the vector representations from this perspective, there is currently no analogue on the side of raw text data.
This article develops a method of analyzing raw text data from a topological perspective.
Using such representation, we can then study the relationship between neural language representations and raw text data in a uniform fashion, by mapping both into the category of topological spaces and continuous maps.
The reason we can not apply the same techniques of persistent homology on the side of the corpus directly, is because natural language data is in form of discrete tokens of words, which do not come with a canonical notion of topology such as that on the side of the embedding.
However, literature in pure mathematics provides theoretical foundation, from which we can build algorithmic solutions to this problem.
Over the past century, we saw an emergence of new branches of Mathematics exploring the structure of high dimensional objects.
These ideas came initially from the Erlangen Program outlined by Felix Klein in his seminal work on the formalization of geometry as the study of invariants under algebraically defined groups of transformations .
The body of research produced through this program led to the development of category theory as a unifying language connecting previously isolated branches of Mathematics.
A particularly powerful type of such category theoretical relationship is expressed by the homotopy and homology functors linking the realms of topology and abstract algebra.
These notions belong to the branch of Mathematics known as algebraic topology, which has been in accelerated development over the past century, producing a set of powerful tools for probing global and local properties of manifolds.
The first goal of this article is to build a bridge between topology and language.
Such a connection will then allow us to adapt powerful machinery developed over the past century in pure Mathematics for the analysis of topological manifolds, and employ it for the study of linguistic structure encoded in corpus data.
In particular, we will be concerned with the development of algorithms for associating topological manifolds directly to raw corpus data.
The requirement for these algorithms is that the topology of the resulting manifold encodes word co-occurrence patterns, and that no extrinsic information is imported in this association process.
The distributional hypothesis, which forms a fundamental assumption behind neural language model training (best expressed in words of John Rupert Firth - "you shall know a word by the company it keeps"), states that syntactic and semantic relationships between words can be inferred from their context (i.e. co-occurrence patterns with other words in the corpus, see figure 2).
I use this idea as a basis for the construction of a simplicial complex, where 0-dimensional cells are identified with word tokens, and higher-dimension features are determined by the n-gram patterns within a given corpus of text.
The resulting topological structure will be called the word manifold, and exploration of its properties is one of the primary goals of this work.
The word manifold is a topological space obtained directly from raw natural language data, without the use of any neural encoders.
It is a topological manifold, and it does not need a metric space embedding.
Instead, the topology expressed by its simplicial complex structure, encodes word-context relationships directly.
Gunnar Carlson et.al. applied topological data analysis to patches of pixels from naturally occurring images .
The analysis led to a conclusion that the shape of the image manifold under study could be approximated by a Klein bottle.
This realization led to a novel compression algorithm for images taking advantage of a parametrization of the pixel space that mapped 3x3 patches of images onto points on a sub-manifold homeomorphic to the Klein bottle within the image manifold.
To the best of my knowledge such approaches have not been applied in the field of Computational Linguistics.
Understanding the topological structure of natural language representations can lead to reparametrization techniques, allowing for significant model compression.
My current research efforts aim to develop methods for deriving more parameter-efficient language modeling techniques, by studying the topological structure of natural language data, which can then be used to reparametrize the embedding or introduce topological regularization techniques into the field of NLP.
We start with a corpus of natural language text \mathcal{C}, and a window size k, which defines the maximum n-gram size under consideration.
The first step in constructing the word manifold, is the induction of what I will call skeleta.
In order to define skeleta, I first have to introduce word simplices.
Abstractly they will correspond to the raw n-grams of words found in the corpus, but additional structure will come from the maps that I define in the subsequent sections.
In general, we define an n-dimensional word simplex (or n-simplex for simplicity of notation) to be the ordered tuple of n+1 unique word tokens found in the corpus - [ w_0, w_1, ..., w_n ] - where w_i are elements of the lexicon.
For instance, a single word [ w_0 ] is a 0-simplex.
A bigram [ w_0, w_1] is a 1-simplex.
A context of three words [ w_0, w_1, w_2 ] is a 2-simplex.
Note that the dimension of the simplex composed of n+1 words is n, so the lexicon is zero dimensional.
Given a simplex [ w_0, ..., w_n ], we denote the subsimplex obtained by removing the word w_i from it by [w_0, ..., \hat{w_i}, ..., w_n].
We will call these subsimplices the faces of the original simplex.
For instance \{ [\text{"to"}, \text{"the"}], [\text{"go"}, \text{"the"}], [\text{"go"}, \text{"to"}] \} are faces of the word simplex [\text{"go"}, \text{"to"}, \text{"the"}].
Although the word simplices are purely combinatorial objects, it is useful to think of them as geometric entities.
Word simplices can be given a geometric realization via the following construction (see figure 3).
We define a geometric n-simplex to be the set of all points in \mathbb{R}^{n+1} of the form \sum_{i=1}^{n+1} c_i e^{(i)}, where e^{(i)} is the i-th standard basis vector, and the coefficients in this sum satisfy \sum_{i} c_i = 1 \wedge \forall_i c_i \geq 0 (that is they form convex linear combination of basis vectors).
We will denote this geometric realization of an n-simplex as \Delta^n.
Given sequence of k words found somewhere in the corpus, we call it simplicial iff it is composed of k unique words.
For instance the 3-gram "as soon as" is not simplicial because it contains only 2 unique word tokens "as", "soon".
By contrast, the 3-gram "soon as possible" is simplicial.
In order to produce the skeleta of a corpus, we first slide a window of k word tokens within each sentence generating word simplices from all the simplicial subsequences of k-grams at each position.
For instance, given a sentence "Call me as soon as possible!" found in the corpus, for k=3 there are no 3-simplices (as they would require k \geq 4).
We get the following set of 2-simplices \{ [\text{"call"}, \text{"me"}, \text{"as"}], [\text{"me"}, \text{"as"}, \text{"soon"}], [\text{"soon"}, \text{"as"}, \text{"possible"}] \}.
Note that one 3-gram was dropped from this collection, because it is not simplicial.
We would also add 1-dimensional simplices \{ [\text{"call"}, \text{"me"}], [\text{"me"}, \text{"as"}], [\text{"call"}, \text{"as"}], [\text{"as"}, \text{"soon"}], [\text{"me"}, \text{"soon"}], [\text{"soon"}, \text{"as"}], [\text{"as"}, \text{"possible"}], [\text{"soon"}, \text{"possible"}] \}
Finally, the 0-dimensional word simplices would correspond to the unique words found in this sentence: \{ [\text{"call"}], [\text{"me"}], [\text{"as"}], [\text{"soon"}], [\text{"possible"}] \}.
We call the collection of all word simplices of all dimensions extracted this way from the corpus, the cells of the corpus.
Dimensionality of a cell is the dimension of its word simplex.
We now define the n-skeleton S_n to be the set of all n-dimensional cells of the corpus.
The n-skeleton is therefore the collection of all (n+1)-gram simplices [w_0, w_1, ..., w_n] for which the following condition is satisfied.
Every subsequence, of any length, composed of words from \{w_0, w_1, ..., w_n\} appears somewhere in \mathcal{C} within a window of k word tokens.
Observe that the 0-skeleton S_0 is generated by the lexicon extracted directly from \mathcal{C}.
Furthermore, the (k-1)-skeleton (top dimensional cells of the complex) will be composed of k-grams found in the corpus.
Note that the collection of all skeleta defined this way on the n-grams of the corpus, forms an abstract simplicial complex.
Furthermore, the simplices encode the distribution of words within the predefined n-gram window size, and the orientation of simplices expresses the order of words as they appear in the sentences of the corpus.
CONTEXT GROUPS
Given an n-skeleton S_n extracted from the corpus of natural language text \mathcal{C}, I use it as a basis for an algebraic construction.
In particular, I define the n-th context groupC_n to be the free abelian group with the generator set equal to S_n.
In other words, the word n simplices extracted from \mathcal{C} form the basis form this group.
Its elements are formal integral sums of n-grams found in the corpus.
For instance 2 \times [\text{"you"}, \text{"know"}, \text{"a"}]- 3 \times [\text{"word"}, \text{"by"}, \text{"the"}]+ [\text{"company"}, \text{"it"}, \text{"keeps"}] could be an element of C_2 given some corpus of English text.
In general the group contains all finite formal combinations of n-gram contexts with integer coefficients.
\sum_{\alpha \in \mathcal{A}} k_{\alpha} \sigma_{\alpha}^n \text{ with } k_{\alpha} \in \mathbb{Z}
where \mathcal{A} is an index set and \sigma_{\alpha}^n range over the n-skeleton S_n.
CONTEXT HOMOMORPHISMS
There is a natural homomorphism which maps the n-dimensional context group into the (n-1)-dimensional context group.
In order to define it, I first introduce the notion of a word simplex boundary, which I will denote with the \partial symbol.
Intuitively, we would like to define the boundary of an n-simplex to be composed of (n-1)-simplices representing its faces.
As we will see later, we would also like the boundary map to be nilpotent.
The easiest way of defining such boundary maps in a coherent way is to consider the natural orientation of the faces induced from the linear ordering of the words within the simplex.
It is useful to consider specific examples before stating the general formula.
Suppose we have a bigram of words such as "go to".
We can visualize it's orientation as an arrow pointing from the word "go" to the word "to" (see figure 4 for a visualization).
We then define the boundary of this bigram as the sum of its words with coefficients being -1 at the source and +1 at the target.
This is analogous to a vector pointing from a source vector to the target vector being the difference between the target and the source.
That is:
\partial [\text{"to"}, \text{"go"}] = [\text{"go"}] - [\text{"to"}]
For a trigram "go to the", we can visualize it as a triangle with vertices identified with the words it contains, and the orientation being a closed path around its edges.
As previously we will trace this path writing down the faces in a summation.
Note that previously the faces had different signs.
Similarly, we will assign signs to the bigram edges.
A visual way to do this, would be to consider a bigram edge positive if the orientation on the path we trace agrees with the ordering of the words within the larger trigram whose boundary we are computing.
Note the last edge on the path is the word simplex [\text{"the"}, \text{"go"}].
This orientation is opposite to the ordering of these words when viewed as a subsequence of the trigram simplex [\text{"go"}, \text{"to"}, \text{"the"}], as it would be obtained from it by removing the middle word "to".
The correct face is the bigram simplex [\text{"go"}, \text{"the"}], which we will identify with the negation of the simplex [\text{"the"}, \text{"go"}].
With this observation in mind, we can write down the boundary of our trigram as:
$$
\begin{align*}
& \partial [\text{"go"}, \text{"to"}, \text{"the"}] \\
& = [\text{"go"}, \text{"to"}] + [\text{"to"}, \text{"the"}] + [\text{"the"}, \text{"go"}] \\
& = [\text{"to"}, \text{"the"}] - [\text{"go"}, \text{"the"}] + [\text{"go"}, \text{"to"}] \\
& = (-1)^0 \cdot [\hat{\text{"go"}}, \text{"to"}, \text{"the"}] + (-1)^1 \cdot [\text{"go"}, \hat{\text{"to"}}, \text{"the"}] \\
& + (-1)^2 \cdot [\text{"go"}, \text{"to"}, \hat{\text{"the"}}]
\end{align*}
$$
Observe, that in the last line of the above equation, we rearranged the terms in the following way.
Each term represents the original trigram, with one word removed (marked by the hat symbol above it).
That is, each term is actually a bigram representing a face of the trigram whose boundary we are computing.
Furthermore, we alternate the signs of these bigram terms by raising negative one to the power corresponding to the index of the word we removed.
This has the result of flipping the orientation of the previously incorrectly ordered bigram [\text{"the"}, \text{"go"}] to its proper orientation [\text{"go"}, \text{"the"}] (as dictated by the order of the words appearing in the original trigram).
Another reason for rewriting this boundary equation in such form, is that it suggests a generalization to arbitrary n-grams.
We are now ready to define a boundary of a word simplex.
Given a n-dimensional word simplex, generated by an (n+1)-gram of words w_0\dotsw_n, we define its boundary as an integer linear combination of n-gram simplices:
$$
\partial [w_0, \dots, w_n] = \sum_i (-1)^i [w_0, \dots, \hat{w_i}, \dots, w_n]
$$
For completeness, let us compute the boundary of the full 4-gram in figure 4:
$$
\begin{align*}
& \partial [\text{"go"}, \text{"to"}, \text{"the"}, \text{"park"}] \\
& = [\text{"to"}, \text{"the"}, \text{"park"}] - [\text{"go"}, \text{"the"}, \text{"park"}] \\
& + [\text{"go"}, \text{"to"}, \text{"park"}] - [\text{"go"}, \text{"to"}, \text{"the"}]
\end{align*}
$$
Using this construction, given an arbitrary linear combination of word n-simplices c = \sum_{\alpha \in \mathcal{A}} k_{\alpha} \sigma_{\alpha}^n \text{ with } k_{\alpha} \in \mathbb{Z}, where each \sigma_{\alpha}^n is an n-simplex composed of n+1 words [w_{\alpha_0}, \dots, w_{\alpha_n}], we define its n-dimensional boundary operator \partial_n by extending the simplex boundary map linearly:
$$
\begin{align*}
& \partial_n \left( c \right) \\
& = \partial_n \sum_{\alpha \in \mathcal{A}} k_{\alpha} \sigma_{\alpha}^n \\
& = \sum_{\alpha \in \mathcal{A}} k_{\alpha} \partial_n \sigma_{\alpha}^n \\
& = \sum_{\alpha \in \mathcal{A}} k_{\alpha} \sum\limits_{i=0}^{n} \left( -1 \right)^i [w_{\alpha_0}, \dots, \hat{w_{\alpha_i}}, \dots, w_{\alpha_n}]
\end{align*}
$$
It can be easily verified that \partial_n is a group homomorphism C_n \longrightarrow C_{n-1}.
Furthermore, it is nilpotent (see figure 5 for a geometric analogy).
The image of the context homomorphism in dimension n is a subgroup of the context group in dimension n-1.
Furthermore, since context groups are free abelian, it is a normal subgroup.
This last property will allow us to take quotients of context groups later in this article.
For the computation we will perform later, it is useful to represent context homomorphisms as integer matrices.
First, we impose alphabetical ordering on the 0-skeleton.
Then the ordering for higher dimensional skeleta follows from the canonical ordering on the cartesian product (i.e. an n-gram is ordered as if it was an element of the cartesian product of n copies of the lexicon).
We then define a n-indexed collection of matrices \mathcal{B} := \left\{B_n | n \in \{0, .., k-1\} \right\}, where k is the maximum window size, as previously.
We will call the elements of \mathcal{B} the boundary matrices.
They are defined in the following way.
The n-dimensional boundary matrix B_n \in \mathcal{B} is a r \times c matrix, whose columns are identified with the n-skeleton, while rows correspond to the elements of the (n-1)-skeleton in the ordering defined above.
That is c = |S_{n}| and r = |S_{n-1}|.
The entries of the boundary matrixB_n are then defined as follows.
For each column j, we look at the corresponding n-simplex s_j^{(n)} \in S_n and compute its formal boundary \partial_n \left( s_j^{(n)} \right).
Recall that the word simplex boundary extends to a full group homomorphism between free abelian groups generated by S_n and S_{n-1}.
The boundary matrix is then a matrix representation of this group homomorphism - that is the entries of B_n are the coefficients (-1)^i in the boundary map formula.
More concretely, the j-th column of B_n is the vector of coefficients k_j \in \{0, \pm 1\} in the boundary map summation defined earlier in this article, when applied to an element of S_n.
The resulting collection of matrices \mathcal{B} will then be used in computations to analyze the topological structure of the word manifold.
The boundary matrices obtained from natural language corpora are highly sparse tensors over the ring of integers, which allows for efficient computations using BLAS algorithms.
CONTEXT COMPLEXES
After inducing all the context groups from a corpus (generated by the skeleta defined by simplicial n-grams) we have a sequence of abelian groups together with group homomorphisms:
$$
\varnothing \rightarrow C_n \xrightarrow{\partial_{n}} C_{n-1} \xrightarrow{\partial_{n-1}} \dots \rightarrow C_1 \xrightarrow{\partial_1} C_0 \rightarrow \varnothing
$$
where \varnothing represents the trivial group, n is the highest cell dimension (which is bounded by one less the maximum n-gram size considered, but could be less depending on the grammatical structure manifested by the corpus), and the boundary map satisfies \forall_n \partial_{n-1} \circ \partial_n = 0.
A sequence of groups together with nilpotent group homomorphisms of this sort is called a chain complex.
In order to rigorously construct a topological space from the type of data represented by natural language text, we need to add another layer on top of our word simplex construction.
We will call this space a context complex.
To define it, we first take a step back and introduce a idea of a cell complex, which presents a more general inductive procedure for constructing topological spaces.
Subsequently, we will link it back to our n-gram simplices, and define our context complex as a special case of this general notion, known as a delta complex.
A cell complex, also known as a CW (compact-weak) complex, is a topological space built inductively from the following data:
0-skeleton: a discrete set of points (in our case they will be words)
(k+1)-skeleton : obtained from k-skeleton \mathcal{X}^k by attaching (k+1)-cells via maps
It is useful to explain this definition in more detail, and look at an example before discussing its restricted case that we will use to induce topological structure from strings of words.
The construction starts with a discrete set of points (in our case a finite set, with each point corresponding to a unique word token extracted from a corpus of raw natural language text).
These points form the skeleton of 0-dimensional cells of the complex.
We can think of these points as 0-dimensional disks \mathbb{D}^0 (they have no boundary).
Subsequently, we build a graph, whose vertex set is this 0-skeleton of points, by attaching edges (1-dimensional cells).
These edges are topologically 1-dimensional disks \mathbb{D}^1, and their boundary is a pair of points (the endpoints), which is the zero dimensional sphere \mathbb{S}^0.
According to the above construction we construct the 1-skeleton by pasting (done by the identification with the image of the attaching map) these 1-cells in such a way, that their endpoints are identified with the points already in the complex (i.e. the initial set of points).
In our setting, these 1-cells will represent bigrams of words, and their endpoints are words in the bigram.
The bigrams will therefore have the effect of linking initially discrete set of points (the lexicon) into a possibly disjoint set of graphs (a 1-dimensional manifold).
The construction continues inductively.
In the next step, we would look at 2-cells, which are topologically 2-dimensional disks \mathbb{D}^2, whose boundary is the circle \mathbb{S}^1.
The 2-skeleton in our case will be induced from the trigrams.
These disks are homeomorphic to triangles with vertices equal to the three words of a trigram, and edges representing faces (bigrams obtained by removing one word from the trigram at any position - i.e. contexts of the removed word).
The boundary of a trigram of words "x y z" is composed of the bigram contexts of each word in the trigram "_ y z", "x _ z", "z y _".
This procedure continues to higher dimensions (3-balls are homeomorphic to tetrahedra, which we will identify with 4-grams of words).
In general we are attaching n-dimensional topological closed balls (or n-dimensional disks) \mathbb{D}^n by identifying their boundaries, which are homeomorphic to (n-1)-dimensional spheres \mathbb{S}^{n-1}, to the previously constructed skeleton.
In our case this process will terminate at dimension one less than the largest n-gram size extracted from the corpus.
The entire space is then the union of all the skeleta (the identification quotient was taken at each inductive skeleton construction step).
The notion of a CW complex is too general for our purpose.
Not every valid CW structure could arise from n-gram patterns.
Natural language text is composed of strings of discrete word tokens.
Our open sets will be generated by co-occurrence patterns of these tokens within n-gram contexts.
These n-gram contexts naturally fit into the notion of abstract simplices.
We want to map these discrete word sequences onto a manifold, whose topology is built up from the n-grams.
In order to do so we will consider a restricted version of the CW construction, which links it to the word simplices discussed previously.
A \Delta-complex structure on a space \mathcal{X} is defined by a collection of continuous maps \sigma_{\alpha} : \Delta^n \longrightarrow \mathcal{X} (where n depends on \alpha) s.t.
\sigma_{\alpha} |_{\overset{\circ}{\Delta}} is injective, and every point point of \mathcal{X} is in the image of exactly one of these restrictions
in the quotient space obtained by identifying faces of \Delta^n with \Delta^{n+1} , the restrictions of \sigma_{\alpha} to those faces become exactly equal to the corresponding maps used to attach those faces (the identification is done in the simplex space via a linear map - see figure 7)
To define topology, we can say that A \subseteq \mathcal{X} is open \iff \sigma_{\alpha}^{-1}(A) is open \forall \alpha .
Note that this is a special (more restrictive) case of a CW-complex. The restriction comes from the structure of the attaching maps. The simplices are homeomorphic to discs.
In case of oriented simplices, we want the identification maps of faces with lower dimensional simplices to preserve orientation.
Figure 8 shows an example of delta complex structure on the torus.
CORPUS HOMOLOGY
Homology theory constructs functorial mappings from the category of topological spaces and continuous maps to the category of abelian groups and group homomorphisms.
In this translation homotopy equivalent manifolds induce isomorphic homology groups.
This allows us to employ powerful theorems from commutative algebra for the study of topological spaces.
Homology theory quantifies the structure of cavities (holes in different dimensions) in the topological space under consideration.
There are many possible procedures resulting in equivalent homology theories (i.e. the resulting homology groups are isomorphic).
For the purposes of studying the types of finite structures that exist in natural language text, we will employ a combinatorial version of this theory known as simplicial homology.
Recall that the key property of the context group homomorphism \partial_\bullet defined above is that \forall_n \partial_{n-1} \circ \partial_n = 0, which implies that \text{im} \partial_n \subseteq \ker \partial_{n-1}.
The elements of \( \ker \partial_{\bullet} \) are called cycles, and the elements of \( \text{im} \partial_{\bullet} \) are called boundaries.
Homology of the word manifold is then defined as the group theoretic quotient of cycles modulo boundaries.
$$H_n(C_{\bullet}, \partial_{\bullet}) = \frac{\ker \partial_n}{\text{im} \partial_{n+1}}$$
These groups form algebraic invariants of topology, so homeomorphic word manifold structures will result in same groups up to isomorphism.
Homology generalizes graph theoretic methods, providing information about higher dimensional structures and their interaction with lower dimensional structures within the manifold.
For instance, homology in the zeroth dimension counts the number of connected components of a space.
Homology in the first dimension is the abelianization of the fundamental group, which can be thought of as studying loops, and their interaction with holes within the manifold.
Homology here can be thought of as a coarser approximation to homotopy by a finitely generated abelian group, which allows us to develop algorithmic approaches to its computation.
The reader is encouraged to consult for proofs and more details on these constructions.
We are often interested not in the particular groups but just in their ranks, known as Betti numbers, which can be interpreted as counting cavities in all dimensions of the manifold, and thus give a coarse description of its shape (see figure 9 for an example).
In the following sections I develop algorithms to automatically compute these numbers from raw text files.
However, it is informative to see an example done by hand first.
We will improve upon this computation in later sections by taking advantage of matrix factorizations.
Figure 10 shows a 2-dimensional complex composed of three 0-cells \{x, y, z\}, four 1-cells \{a, b, c, d\}, and two 2-cells \{A, B\}.
The elements of each skeleton will form the generating sets for free abelian groups C_0C_1C_2 respectively.
$$
\begin{align*}
C_0 &= < x, y, z > \\
C_1 &= < a, b, c, d > \\
C_2 &= < A, B >
\end{align*}
$$
The elements of these groups are integer linear combinations of these cells.
For instance 3x - 5y +7z \in C_0 is a 0-chain, 5d - 3a \in C_1 is a 1-chain, 2A \in C_2 is a 2-chain.
The other boundary maps are zero maps.
We can summarize this in the following chain complex diagram.
$$
\varnothing \longrightarrow C_2 \xrightarrow{\partial_2} C_1 \xrightarrow{\partial_1} C_0 \longrightarrow \varnothing
$$
Let us compute homology of this manifold.
That is we are interested in computing the following group quotients.
$$
\begin{align*}
H_0(\mathcal{X}) &= C_0 / \text{im} \partial_1 \\
H_1(\mathcal{X}) &= \ker \partial_1 / \text{im} \partial_2 \\
H_2(\mathcal{X}) &= \ker \partial_2
\end{align*}
$$
The first quotient measures the number of connected components, and is easy to compute as the image of \partial_1 is a free group on 3 generators, which is isomorphic to three copies of the integers \mathbb{Z}\times\mathbb{Z}\times\mathbb{Z}, and the kernel of \partial_0 is all of C_0, which is 4 copies of \mathbb{Z}, because \partial_0 is the zero map.
Taking the quotient we are left with a single copy of the infinite cyclic group, that is H_0(\mathcal{X}) = \mathbb{Z}, which expresses the fact that there is a single connected component.
In order to compute homology in the other dimensions we need to perform some linear algebra operations computing the kernels.
Let us look at homology in the first dimension, which is a measure of 1-dimensional holes (or loops) in our space.
Recalling our previous remark that homology can be summarized as looking at cycles modulo boundaries, let us first figure out what the cycles in \mathcal{X} are.
Visually inspecting figure 10, it seems that the 1-dimensional chains a+b+c and a+b+d form cycles.
To verify it, let us compute their boundaries.
$$
\begin{align*}
& \partial_1 \left( a+b+c \right) \\
& = \partial_1 (a) + \partial_1 (b) + \partial_1 (c) \\
& = (y-x) + (z-y) + (x-z) \\
& = 0
\end{align*}
$$
$$
\begin{align*}
& \partial_1 \left( a+b+d \right) \\
& = \partial_1 (a) + \partial_1 (b) + \partial_1 (d) \\
& = (y-x) + (z-y) + (x-z) \\
& = 0
\end{align*}
$$
The chain c-d is also a cycle as: \partial_1 (c-d) = (x-z) - (x-z) = 0.
In order to find the group of all 1-dimensional cycles, we need to compute the kernel of \partial_1.
A cycle is just any integer linear combination of 1-cells, \alpha a + \beta b + \gamma c + \delta d, which is mapped to 0 by the boundary homomorphism.
We thus have
$$
\begin{align*}
\partial (\alpha a + \beta b + \gamma c + \delta d)
&= \alpha \partial (a) + \beta \partial (b) + \gamma \partial (c) + \delta \partial (d) \\
&= \alpha (y-x) + \beta (z-y) + \gamma (x-z) + \delta (x-z) \\
&= (- \alpha + \gamma + \delta) x + (\alpha - \beta) y + (\beta - \gamma - \delta) z
\end{align*}
$$
In order to determine the 0-cell coefficients we need to solve the system of linear equations
$$
\begin{align*}
- \alpha + \gamma + \delta &= 0 \\
\alpha - \beta &= 0 \\
\beta - \gamma - \delta &= 0
\end{align*}
$$
Equivalently, we find the nullspace of the matrix (solve Ax=0):
which means the space of solutions can be parametrized by
$$
\begin{align*}
\alpha &= r+s \\
\beta &= r+s \\
\gamma &= r \\
\delta &= s
\end{align*}
$$
hence any 1-cycle can be written as
$$
\begin{pmatrix}
\alpha \\
\beta \\
\gamma \\
\delta
\end{pmatrix}
=
r
\begin{pmatrix}
1 \\
1 \\
1 \\
0
\end{pmatrix}
+
s
\begin{pmatrix}
1 \\
1 \\
0 \\
1
\end{pmatrix}
$$
and so the space of cycles is 2-dimensional with a basis \{a+b+c, a+b+d\}.
Going back to our boundary operator, we found that
$$
\ker \partial_1 = < a+b+c, a+b+d > \cong \mathbb{Z} \oplus \mathbb{Z}
$$
and since the image of the boundary from one dimension above is \text{im} \partial_2 = < c-d > \cong \mathbb{Z}, we have
$$
H_1(\mathcal{X}) = \ker \partial_1 / \text{im} \partial_2 = \mathbb{Z} \oplus \mathbb{Z} / \mathbb{Z} \cong \mathbb{Z}
$$
That is, our manifold has a single hole in dimension 1, generating the infinite cyclic homology group H_1.
For completeness, we also have a single spherical cavity (2-dimensional hole), generated by the empty space between the 2-cells A and B.
$$
H_2(\mathcal{X}) = \ker \partial_2 = < A-B > \cong \mathbb{Z}
$$
Therefore, the complete set of Betti numbers for the manifold in figure 10 is \left\{ 1, 1, 1 \right\}.
Thanks to the Universal Coefficient Theorem, we can lift our algebraic description of homology to the realm of vector spaces over the rationals, and express the manual computation presented previously in terms of linear algebra operations.
This will allow us to take advantage of efficient Basic Linear Algebra Subprograms (BLAS), and develop code to automatically obtain invariants such as betti numbers from raw text files.
ITERATIVE APPROACH
Before presenting the main approach used in the experiments, I want to briefly mention earlier ideas that I experimented with, prior to arriving at the simultaneous change of basis algorithm.
When we switch homology coefficients from the ring of integers \mathbb{Z} to the field of rational numbers \mathbb{Q}, our n-gram cells become basis vectors, and their formal sums become linear combinations over the field of rationals.
One naive approach, which is computationally valid but inefficient, is to think of homology classes as a basis for a subspace of a vector space contained in the kernel of the boundary map \partial_k but outside the image of \partial_{k+1}.
Thinking this way, we can follow the following algorithm.
Compute a basis \mathcal{B} for the kernel of \partial_k.
Loop over the image of the set of (k+1)-cells under \partial_{k+1}, eliminating boundaries from \mathcal{B}.
However, instead of thinking of spaces directly, we can instead take a more category theoretic approach, and work with the boundary maps without the need to loop over generators.
This leads to a more efficient and elegant solution - the cokernel algorithm.
COKERNEL ALGORITHM
Because we have a context complex, which satisfies the chain complex condition \forall_k \partial_k \circ \partial_{k+1} = 0, the image of \partial_{k+1} maps onto a subspace of the kernel of \partial_k, and we can factor the boundary map by a copy of the kernel as in figure 11.
This factorization gives us a linear map m, such that \partial_{k+1} = \ker \partial_k \circ m.
We can compute a matrix representation of m using numpy linalg module.
import numpy as np
# ... - setup code resulting in:
# kdk - kernel of boundary in dimension k
# dkp1 - boundary in dimension k+1
m = np.linalg.lstsq(kdk, dkp1)
FIGURE 11
Once we obtain a matrix for m, we can compute its cokernel.
Let us denote the matrix representation this cokernel map by \text{coker}(m) (see figure 11).
Each column of \text{coker}(m) is a homology generator.
However, we want to think of these vectors as elements of the context group C_k.
This translation can be done by simply composing the maps - or equivalently performing the matrix multiplication:
$$
\text{ker}(\partial_k) \text{coker}(m)
$$
This algorithm is faster than looping over images of boundary generators, since we are using efficient numpy subroutines, which call highly optimized procedures (some written in FORTRAN), and we can obtain betti numbers in a single step.
This leads to the final approach, yielding the algorithm used in the experiments.
In this algorithm, similarly to the cokernel computation, we do not need to loop over generators, and instead we work with matrix representations of boundary maps directly.
The computation however, relies on a simultaneous change of basis for the two consecutive context groups.
SIMULTANEOUS CHANGE OF BASIS ALGORITHM
We are now ready to derive an algorithm for computing the shape of words - i.e. the topology of the word manifold defined in the previous sections.
In what follows, we will think of our context groups as vector spaces, whose bases are the elements of the skeleta (S_n) of the word manifold defined previously.
Recall, that these are determined by the n-grams of tokens found in the corpus.
Our goal is to understand how we can compute homology in dimension n of the word manifold in a procedural manner, that can be translated into efficient code.
In order to do that, we have to understand the context homomorphism C_n \xrightarrow{\partial_n} C_{n-1}.
In our vector space interpretation, this homomorphism becomes a linear map between vector spaces with bases determined by n-grams and (n-1)-grams of tokens extracted from the corpus.
For instance, if we are talking about trigrams of words of the form w_0 w_1 w_2, the context homomorphism maps them to linear combinations of the bigram contexts of each word in the trigram, extended linearly to the entire vector space spanned by the trigrams.
For a single basis vector w_0 w_1 w_2, the bigram contexts are \_ w_1 w_2, w_0 \_ w_2, and w_0 w_1 \_, and the context homomorphism maps it into a formal sum w_1 w_2 - w_0 w_2 + w_0 w_1, which is a vector in the space of bigrams, and has a standard representation in that basis as a sparse vector of all 0's except the two 1's and a single -1 in the components corresponding to those bigrams in some ordering of the basis (lexicographic in our computations).
Because we have those bases, we can write a matrix representation of this operator with respect to those two bases, by recording the coefficients in the mapping of each basis vector.
It is informative, to be more explicit in defining things here for the benefit of future readers of this article.
Suppose we are studying some corpus, and we derived the skeleta as discussed previously, with the 0-skeleton being the set of unique tokens extracted from it.
In order to present a manageable example, let us focus on a subset of 7 words, and look at the patterns of their co-occurrence.
For simplicity of notation we will denote these words with integers 1,\dots,7.
Suppose now, that in the given corpus the triples of words \{ \{ 1,2,3 \}, \{ 1,2,4 \}, \{ 1,3,4 \}, \{ 2,3,4\} \} appeared in the same context, but the entire quadruple \{1,2,3,4\} did not.
This also means that the pairs \{ \{ 1,2 \}, \{ 1,3 \}, \{ 1,4 \}, \{ 2,3 \}, \{ 2,4 \}, \{ 3,4 \} \} appear in the same context based on the triples given.
Furthermore, suppose that the pairs of words \{ \{ 3,5 \}, \{ 4,5 \} \} do share contexts, but the triple \{3,4,5\} does not.
Let us also suppose that there is a bigram \{6,7\} somewhere in the corpus, which does not share contexts with the other words 1,\dots,5.
Figure 12 shows a visualization of a submanifold of the word manifold generated by these seven words.
Let us now consider the relationship between bigrams and the words.
We can define two vector spaces using this data.
The first space C_0 is spanned by linear combinations of words - i.e it is a 7-dimensional space with a basis corresponding to the seven words 1,\dots,7.
The second space, C_1, represents the bigrams, and in our example it is a 9-dimensional space with basis corresponding to the set of pairs \{ \{ 1,2 \}, \{ 1,3 \}, \{ 1,4 \}, \{ 2,3 \}, \{ 2,4 \}, \{ 3,4 \}, \{ 3,5 \}, \{ 4,5 \}, \{6,7\} \}.
Recall that the context homomorphism will map each n-gram to a linear combination with alternating signs (a vector in the context space) of (n-1)-grams representing the contexts of each word in the n-gram.
In case of a bigram, the contexts are just the other word in the bigram, and the result is a difference between two word vectors.
For example the basis element \{ 1, 3 \} of C_1 will be mapped to the vector \{ 3 \} - \{ 1 \} within C_0, because the context of the word 1 in this case is the word 3, and vice versa.
Looking at figure 12, we can write down a matrix representation of the linear operator \partial_1 (that is the bigram homomorphism) with respect to the basis for bigrams and the basis for words in the following way.
We list the bigram basis as labels for the columns of a matrix, and the words as labels for the rows.
We then record the coefficients in those bases of where each bigram is mapped within the word space.
Thus, in our example, \{ 1, 3 \} would label the column vector with the representation [-1, 0, 1, 0, 0, 0, 0] with respect to the ordered basis for the word space C_0.
The full matrix representation of the bigram homomorphism is therefore:
We also note that there is an implicit zero map \partial_0, which maps all words to the trivial group of no context.
It is represented by the 1x7 matrix of all zeros (hence rank 0).
Equivalently, \ker (\partial_0) = \mathbb{Q}^{\text{dim}C_0} = \mathbb{Q}^{7}.
We have the following chain complex.
$$
C_2 \xrightarrow{\partial_2} C_1 \xrightarrow{\partial_1} C_0 \xrightarrow{\partial_0} \varnothing
$$
Where C_2 is determined by triples of words, C_1 is the space of bigrams and C_0 is the space of words.
Let A be the matrix representation of \partial_1, and B be the matrix representation of \partial_0 (that is the zero matrix of a single row) with respect to the bigram and word bases as above.
Since B is just the zero matrix, we don't need to study it further.
The A matrix is, however, nontrivial, and it is not clear what its rank is from the representation above.
The idea behind the algorithm developed here is to use reversible row and column operations to transform this matrix into a reduced representation.
These operations correspond to change of basis in the corresponding bigram and word spaces.
In particular elementary column operations on our matrix \partial_1 correspond to changing the basis for the space of bigrams, and elementary row operations correspond to a change of basis in the word space.
The key to determining the space of 1-dimensional cycles will be to perform simultaneous change of basis in the bigram and word spaces in such a way that the A matrix becomes nearly diagonal (composed of a diagonal submatrix and zeros outside that submatrix).
In particular, by performing the following operations:
multiplying column/row by -1
exchanging column/row i with column/row j
adding an integer multiple of column/row i to column/row j (for i \neq j)
The two representations of \partial_1 obtained this way (A and \tilde{A}) are thus expressed by similar matrices.
Similar matrices represent the same linear maps, but between vector spaces in different bases.
The matrices Q and P encode the necessary change of bases in the bigram and word spaces in order for \tilde{A} to be a matrix representation of \partial_1.
It turns out we can always reduce a matrix representation of the context homomorphism to such form by performing the necessary manipulation of n-gram bases.
Concretely, we can always reduce it to a matrix of the form:
where \forall_i a_i \geq 1 and a_1 | a_2 | \dots | a_k (that is the diagonal values divide each other).
This fact can be seen by the following argument.
Find the lowest absolute value entry in the A matrix.
Suppose that this value is a_{ij}.
Check that it divides all the other entries in the same row.
If there is an entry a_{ik} which is not divisible by a_{ij}, we can write a_{ik} = qa_{ij}+r.
The Euclidean algorithm implies \text{gcd}\left( a_{ik}, a_{ij} \right) = \text{gcd}\left( a_{ij}, r \right).
We can replace the entry a_{ik} with the remainder r by elementary column operation: subtract q-multiple of column j from column k.
Now we start with a_{ik} as our smallest absolute value entry and repeat the process until the smallest entry divides all entries in the same row.
We can similarly perform the same operations on the columns (subtracting multiples of rows) to have the smallest entry also divide all entries in the same column.
Eventually we have the condition that the smallest entry, say a_{ij} divides all entries in the same row and the same column.
Suppose it does not, however, divide some off-diagonal entry a_{kl}.
From the previous steps we know that there exists some m \in \mathbb{Z} such that a_{kj} = ma_{ij}.
We can perform the elementary row operation: replace row k with the difference A_{k,:} - mA_{i,:}.
We then have a_{kj} = 0, and adding row k to row i reduces back to the case where some entries in the same row are not divisible by a_{ij}.
Repeating this procedure, we can eventually reduce the matrix to a situation where the smallest entry divides all other entries in the matrix.
If the entry is negative, we can multiply it by negative one to flip the sign.
Finally we can exchange rows and columns to bring the nonzero entries to the form where the smallest entry is in the upper left corner and all other entries in the same row and column are zero:
Observe that a_1 now divides all entries in the submatrix C.
By repeating the entire procedure on this smaller submatrix C, we eventually get the desired form:
Note that the dimension of the bigram space C_1 is equal to the number of columns in matrix representation \tilde{A} of the context homomorphism \partial_1.
Similarly, the dimension of the word space is equal to the number of rows in \tilde{A}.
Furthermore, we can read off the dimension of the subspace of the word space corresponding to boundaries (image of \partial_1, which is collapsed in the homology quotient) as the number of nonzero entries in \tilde{A} - that is k.
Finally, what remains is the number of columns of all zeros in \tilde{A} (n-k where n is the number of columns, and k the number of nonzero entries a_i), which gives the dimension of the space of cycles in the bigram space.
The form of matrix \tilde{A} obtained from A by elementary row and column operations as above is called the Smith Normal Form (SNF) of A.
We can generalize this example into the following theorem, which we will use to perform computations in the following sections.
Proposition: Let A be an m \times n matrix, and B be an l \times m integer matrix with BA = 0. Then \ker (B) / \text{im} (A) = \bigoplus\limits_{i=1}^{r} \mathbb{Z}/\alpha_i \oplus \mathbb{Z}^{m-r-s}, where r = \text{rank} (A), s = \text{rank} (B), and \alpha_1, \ldots, \alpha_r are the nonzero elements of the Smith normal form of A.
The matrix B was the zero map \partial_0, a 1x7 matrix of all zeros, so \ker (\partial_0) = \mathbb{Z}^{|C_0|} = \mathbb{Z}^{7}.
From the Smith normal form above r = \text{rank} (\tilde{A}) = 5, so H_0 = \mathbb{Z}^{7-5-0} = \mathbb{Z}^2, which visually agrees with the two connected components in figure 12.
Similar computation gives H_1 = \mathbb{Z} and H_2 = \mathbb{Z}, which correspond to the one and two dimensional cavities (represented by a loop of bigrams, and the empty space within the tetrahedron of trigrams) in figure 12.
I performed a total of 1320 experiments with eight corpora of natural language text.
Each experiment involves a random sampling of 5000 sentences.
The sampling for each experiment was repeated 10 times, and the results were summarized with a box plot.
The experiments took about a month on a single Ryzen TR node with 48 CPU cores, 256 GB of RAM, and an NVIDIA RTX 3090 GPU.
The first experiment derives word manifolds from a smaller test corpus constructed manually by linguists at Brown University .
In order to ablate the contribution of linguistic structure in the formation of topological features within the word manifold, I also construct synthetic corpora, obtained from this corpus by procedural generation targeting specific effects.
This artificially generated data is compared to natural data in the following three experiments.
The next six sets of experiments target larger corpora of modern languages obtained by crawling native news sources over the same period of time: Arabic, English, French, German, Japanese, Russian.
Finally, in the eighth set of experiments, I apply the same analysis to transliterations of the Voynich Manuscript, shedding new light on this unsolved puzzle in corpus linguistics.
Figure 13 shows box plots comparing free ranks of homology groups in the first four dimensions of the word manifold.
Those results can be interpreted as a coarse summary of the structure of cavities in different dimensions.
First observation is that dimensions 1 and 2 seem to be most sensitive to grammatical differences between languages.
Voynich is a clear outlier in dimensions 0 and 1.
This is interesting, as the experiments with synthetic data in the next subsection suggest that holes in dimension 0 are related to parts of speech, and holes in dimension 1 encode sentence level syntactic information.
In dimension 1, French has more topological structure than English, while in dimension 2 English exhibits more complex features than French.
This is interesting when looking at visualization of their vector space embeddings in figure 1, where English appears more ball like (which would make sense, since dimension 2 is defined by sphere like objects)!
The study of topological relationship between word manifolds associated to raw natural language text, and their vector space embeddings induced by language models is the topic of the next article.
The manifold with the closest match in dimension 1 with the Voynich Manuscript is Russian, which would align with the known history of the manuscript, providing support to several mainstream theories about its possible origin.
Japanese is a clear outlier in dimension 3 (see figure 14), manifesting nontrivial topological structure.
This is interesting since it is the only Asian language included in the study, suggesting that higher dimensions of the word manifold possibly measure differences between language groups.
It is a topic I plan to investigate in future work.
In this study I took a corpus of natural language text in English, and used it as a basis for generating additional corpora targeting specific structures.
These corpora were constructed from the natural language text by means of three transforms, aimed to be progressively more destructive to linguistic structure: permutation, zipf, and uniform.
The permutation transform simply permutes the order of words within each sentence of the corpus.
This operation preserves many word interactions, limiting possible context of any given word.
The zipf transform generates sentences randomly by unigram sampling form the Zipf's Law distribution.
Finally, the uniform transform samples words uniformly at random.
Both zipf and uniform transforms maintain the natural distribution in sentence length.
I then derived the word manifolds from both the natural and synthetic corpora and compared them against each other.
Figure 15 shows box plots of free ranks of homology groups of those word manifolds associated to natural, and three synthetically generated corpora.
We observe significant differences in dimension 1 and 2.
In dimension 1, the uniform corpus is an outlier, with significantly less topological features.
This suggests that dimension 1 is at least partially associated with sentence structure, as uniform transform destroys sentences.
On the other hand, permutation and zipf transforms produce too many holes in dimension 1, which suggests that dimension 1 must also encode syntax and semantics to some degree, going beyond simple word statistics.
Interestingly, dimension 0, which measures the number of connected components in the manifold, is possibly related to the parts of speech.
Figure 16 shows a zoomed in view of the zeroth dimension.
We note that the natural corpus seems to have around 8 connected components, which would correlate with 8 classical parts of speech .
The permutation corpus also exhibits nontrivial homology in this dimension.
By contrast the uniform and zipf transforms produce fully connected word manifolds.
This makes sense, since the permutation transform still retains sentence structure to some degree, limiting possible word interactions, while the other two transforms completely destroy sentence boundaries, which allows for arbitrary connections between words to be formed.
Figure 17 shows summary of dimension 2.
In this dimension the permutation and natural corpora are the closest, and clear outliers from the rest.
This suggests that higher dimensions of the word manifold go beyond statistical correlations between words and involve higher level linguistic structure.
Recall that the uniform and zipf corpora are generated using simple unigram distributions, which destroys most of grammatical structure.
By contrast the permutation transform will retain a lot of n-gram context data, which contains syntactic and semantic information about words.
This is clearly manifested in the 2-dimensional cavity structure of the associated word manifolds.
I presented a novel approach to the study of linguistic structure through a topological lens.
I defined the notion of a word manifold, and described an algorithm for its induction from raw natural language text.
I also discussed the computation of algebraic invariants of the word manifold in form of homology groups.
I then applied my method to a variety of languages, synthetically generated data, and an unknown script from the 15th century, currently held at Yale University library.
These results show that the topology of the word manifold is influenced by linguistic structure expressed by the corpus.
Furthermore, we can interpret dimensions of the word manifold by comparing natural and synthetic data.
In the next articles, I explore the relationship of the word manifolds associated to raw text data, and vector space representations of linguistic units that arise in neural language models.
Remarks
This article contains results I presented at the International Conference on Machine Learning (ICML 2022 in Baltimore, MD, USA).