In this article, I give a general overview of a series of experiments, grouped into ten coherent categories, that form phase one of the shape of words project. The experiment descriptions also introduce new concepts, and cover some of the previously discussed ideas in more detail.
In the previous article, the shape of words , I described in general terms, a research program aiming at understanding natural language data through the lens of algebraic topology.
The central theme of the shape of words project is exploring to what degree topological structure plays a role in deriving linguistic unit representations, and how can topological constructs be employed for the purpose of gaining deeper insights into neural language models.
A better understanding of representation manifolds inherent to neural language models can lead to more informative or compressed representations.
Recent developments in deep learning hint at the utility of understanding topological structure of data in improving representational power of neural information processing systems.
For instance, in
In order to attempt answering the questions posed in the shape of words project, I first develop methods for assigning meaningful topology to the two sides of a language model: raw text data, and its vector space representations induced by the neural network. For this purpose, two sets of techniques will be used. On the corpus side, I construct a simplicial complex, that encodes word co-occurrence patterns. On the embedding side, I construct Vietoris-Rips filtrations based on representations induced by the neural language model. A starting observation I make here, is that these two structures must be naturally related. This is because the language model induces representations via self-supervised back-propagation learning from masked sequences of words appearing in the corpus. The representations that are useful for predicting masked words from their contexts, must therefore encode word co-occurrence patterns expressed by the n-grams extracted from the corpus.
The project progresses by exploring topological structure of the raw n-grams and their representations, investigating relationships between them, and later attempting to formalize conclusions in the language of Category Theory. Before that is possible, a series of experiments needs to be done. Those experiments are aimed at understanding the structure of topological manifolds associated to the corpus, as well as their vector space representations within neural language models. The experiments also investigate learning dynamics, representations of sentences and documents, and the utility of topological methods for inducing more informative representations (resulting in model compression). The following sections outline the goals of each experiment to be performed.
SECTION 2In this experiment, I will compute the word manifold homology of several synthetic corpora in order to gauge the behavior of the algorithm developed for this purpose. The resulting analysis will form a baseline, before moving onto real natural language data. I will perform the analysis for varying sizes of n-grams (which determine maximum possible dimensionality of the resulting simplicial complex), and qualitatively examine the outputs. The main mathematical apparatus used here, will be homology of finite simplicial complexes computed using Smith normal form of the boundary map.
SECTION 3In this experiment, I will examine what kind of linguistic signal affects the topological structure of the word manifold the most. The mathematical and algorithmic methods will be the same as in the previous experiment, but I will use natural language data instead. The natural language corpora will be the moving piece, and they will be selected to ablate syntactic and semantic signal in the data. The experiment has two parts. In the first part, I construct word manifolds from corpora in two languages (English and French) that are translations of the same articles. This is meant to keep semantic signal constant while changing the syntax. In the subsequent part, I study documents from the same language (English) but taken from different styles of writing, and focused on different topics. This is meant to keep the syntax constant while varying semantics. I then compare the linguistic manifolds corresponding to varying these two dimensions of natural language data.
SECTION 4This experiment came from the realization that there are two complementary ways of viewing a corpus of text. One is to consider words as the objects of study, and build structure from the relationships between words based on their surrounding contexts (figure 1 shows a visualization of a context extracted from data used in this experiment). A dual view that I want to explore in this experiment leads to the construction of the context manifold. This is another simplicial complex built from the same data as the word manifold in the previous experiments, but the atomic objects of study here are n-grams of words, which I call contexts. The structure of the manifold is then induced from the relationships between contexts determined by the words that appear in them.
To exemplify this idea, it is useful to give a simple illustration. Consider the following three 3-grams taken from some corpus of natural language text: "grass was green", "grass is green", "it was good". We could look at the middle word in each 3-gram: "was", "is", "was". The remaining two words are then the context of the middle word: "grass __ green", "grass __ green", "it __ good". In the first view (focused on words), we could say that the word "was" and the word "is" are related because they share the same context "grass __ green". In the second view (focused on contexts), we can say that the contexts "grass __ green" and "it __ good" are related because they share the same word "was".
The two views can be summarized as follows.
This experiment will construct this dual manifold based on the same data from the previous word manifold experiments, and compare the differences in emergent topological structure.
SECTION 5In this experiment, I train a neural language model on corpora of synthetic and natural language text using the masked sequence prediction objective. In this general task, the model is given sentences from the corpus with random words missing, and the task is to predict the missing words given their context. In order to make the predictions, the model induces vector space representation of the corpus that is informative for approximating the word co-occurrence distribution present in its sentences. The ambient space is high dimensional (I consider a range of values 50-500 dimensions), but the word embeddings form a point cloud in a neighborhood of a lower dimensional submanifold. The goal of this experiment is to explore the topological features of this manifold by approximating it with a sequence of simplicial complexes of lower dimension. This involves projecting the point cloud onto a lower dimensional space and clustering the preimages of an open cover in this projection. I then associate a simplicial complex to this clustering, where 0-simplices are groups of words, and higher dimensional simplices provide a coarse sketch of their distribution within the high-dimensional embedding manifold. This description is reductionist, and is meant to give more qualitative view of the data. The following experiments investigate topological features of this manifold more directly (without employing this type of dimensionality reduction).
In order to have full access and understanding of the language model, and the ability to ablate contribution of it's parts, I construct the model from scratch using my own framework developed for this purpose. The code used does not import any third party libraries but instead uses Basic Linear Algebra Subprograms to perform minimal required computation in a transparent way, which allows for full understanding of the representation induction process. This will be beneficial to both myself, as I conduct the experiments, as well as to the readers and collaborators, as they will be able to get fuller understanding of the computational processes involved, and thus follow the arguments without requiring additional assumptions.
SECTION 6
In this experiment, I study the representation manifold of the neural language model developed in the previous experiment more directly.
For this purpose, I employ the methodology of topological data analysis (see figure 2 for an overview diagram of the process).
In particular, I will compute persistent homology of the point cloud of vectors corresponding to the words of the corpus, in a convergent neural language model trained on this data.
There are various representation methods based on the idea of topological persistence.
I plan to look at a few canonical techniques such as persistence images (
In this experiment, I will examine the structure of local word neighborhoods within the embedding manifold.
Given a word
Collapsing the vertices to form multiple edges visually highlights strongly related groups of words like the set {'federal', 'national', 'central', 'local', 'international'} in figure 5. I also highlight surrounding blocks of words forming coherent groups.
FIGURE 5
I implemented a visualization tool in JavaScript, which allows me to explore a corpus in real time by generating word neighborhoods for each word in the corpus at varying levels of neighbors and generations.
The resulting graph is then embedded using an energy minimization algorithm on the plane
The experiment builds on this construction as follows. Starting with a given word token, I generate its n/g-expansion in the representation space induced by a neural language model. I then construct a Vietoris-Rips filtration on the point cloud corresponding to this expansion. Subsequently, I extract topological persistence diagrams for varying choices of the hyperparameters n and g. I am particularly interested in the relative p-Wasserstein distances between diagrams corresponding to growing values of the hyperparameters. The hypothesis I want to test is that those distances will become smaller, and at some point converge to 0. When this happens, the topology of the growing neighborhoods stabilizes, meaning that further expanding the size of the neighborhood does not change the homotopy type of the manifold represented by the n/g-expansion. I take this stable point as the definition of the homotopy type of a word neighborhood.
SECTION 8In this experiment, I will apply topological methods to modelling of time series associated with the hidden state vector of a gated recurrent language model. Such models have a vector representation of the cell state, that is updated sequentially as data from a time series is fed into the system. In deep learning NLP methods, this form of neural representation has been used to encode sentences and documents. Vector representations of each word in a sentence are submitted into the recurrent cell, which evolves the hidden state. We can then use the final hidden state, or some function (e.g. arithmetic mean) of all the states (after each word was processed) as a vector space representation of the entire sequence (e.g. a sentence from the corpus). I am particularly interested in analyzing the structure of such sequences (i.e. sentence representations in a recurrent language model) from a topological perspective. For this purpose, I reinterpret a time series in a geometric way, by associating a topological manifold to the hidden state trajectory. This procedure involves two steps: sliding window embedding, and the construction of a simplicial complex. The first step reinterprets the time series as a point cloud in a different space, where axes correspond to samples within the window. The second step then constructs a topological space from the resulting point cloud.
SECTION 9In this experiment, I will use the tools from previous experiments to analyze how various topological markers evolve during the induction process of language representations. Concretely, I will train neural language models on corpus data while freezing embedding layers at various timesteps and computing topological descriptors. I will then analyze how the topology of linguistic unit representations evolves as the model is trained. The purpose of this experiment is to test if there are topological patterns of convergence, and if they can be used to identify overfitting and derive early stopping heuristics for neural language models.
SECTION 10In this experiment, I investigate if the knowledge about the structure of natural language data learned in previous experiments can be used to augment a neural architecture in a way that could take advantage of topological information. In particular, I am interested in testing the hypothesis that homotopy type of the word and embedding manifolds, can carry useful linguistic information, and be used as priors in language model training. For this purpose I implement a form of regularization that encourages the representation manifold of the language model to approximate a pre-determined homotopy type. I then perform experiments, where I condition a new untrained model on the topological information contained in the corpus, and measure differences in convergence times, as well as power of the derived representations. I call this procedure topological conditioning, and the hypothesis tested in this experiment is whether it can lead to more informative representations or improved convergence for neural language models.
SECTION 11
In this experiment, I incorporate topological information on supra-lexical level.
In particular, I introduce topological priors into document representation, starting with a definition of a 1-skeleton on larger substrings within the text.
This can be done in practice via an adjacency matrix that determines relationships between parts of the document (e.g. sentences or paragraphs).
Majority of the prior experiments in this project focus on intrinsic evaluation of language representations.
The objective of this experiment is to evaluate topologically enriched document representations extrinsically.
Extrinsic evaluation, does not test directly for linguistic information in the embedding manifold, but rather examines induced representations by using them as input to downstream NLP tasks, under the assumption that performance on real-world tasks can be used as a proxy for the amount of linguistic information contained within.
For this purpose, I introduce a test model for multi-document question answering.
I test the contribution of topological information by implementing a topologically informed QA system and evaluating it on the HotPotQA Dataset,
The hypothesis I want to test in this experiment is that topologically informed representations can produce more sustainable NLP systems by significantly reducing model complexity (similarly to how understanding the topology of planetary motion allows us to reparametrize geocentrism into heliocentrism, leading to significant model compression - as pictured in figure 10), and hence lowering the cost of model training.
This article is an idea paper on a work in progress.