This article introduces a procedure, which will allow us to derive topological objects from each dimension of the embedding manifold separately. The technique involves two steps. First we interpret each component of a hidden state vector in a langauge model as a time series over the word tokens in a sentence consumed by the model. We then view this time series as a topological manifold and compute its algebraic invariants.

SECTION 1

INTRODUCTION

In the previous approaches, we looked at each sentence as a point cloud within the representation space of our neural language model. Although, the order of words within each sentence is implicitly captured by the structure of the point cloud (because of the way word vectors are induced by LM), we did not explicitly take it into consideration when inducing topological features. In this approach we take the ordering of the embeddings directly into account by performing a re-representation step designed to model time series data. This allows us to study homological properties of each dimension within the representation manifold of our language model. When words from a corpus are fed into the neural network implementation of the language model, its hidden state vector traces out a path in the embedding space. We can interpret topological properties of these paths, and their relationship to corpus data, by analyzing each dimension of the hidden state vector as a time series.

SECTION 2

TOPOLOGICAL MAPPER

Every sentence of the corpus generates multiple sequences of floating point numbers - one in each dimension of the representation manifold. We can transform those sequences, into topological objects, and study a notion of shape for each factor of the word embedding. In order to do this, we slide a window over the time series of the hidden states associated to the LM, and compute topological invariants of the resulting point clouds (see figure 1 for an illustration of the idea).

FIGURE 1

We can reinterpret a time series of values as a geometric object by performing a sliding window embedding . The resulting point cloud can then be interpreted as noisy samples from an underlying manifold. The topology of this manifold can be studied using tools from computational algebraic topology. It reveals intrinsic properties of the original time series that are not easily captured by standard methods.

The first step is the construction of the sliding window embedding. This step depends on two parameters: \tau for the delay and d for the dimension. Let f_i(t) be the value of the i-th component of the hidden state vector in our language model, after t words of the sentence being analyzed were consumed by it. We collect the values f_i(t), f_i(t+\tau), \cdots, f_i(t + (d - 1) \times \tau), which results in a vector of d values.

$$ SW_{d,\tau}f_i(t) = \begin{bmatrix} f_i(t) \\ f_i(t+\tau ) \\ \vdots \\ f_i(t+(d-1)\tau) \end{bmatrix} \in \mathbb{R}^{d} $$

The dimension d in our case corresponds to the n-gram size chosen. For instance, if we look at a 3-gram model, we would slide a window of 3 values over representations of each sentence. We do this in each dimension separately. We then analyze the collection of such vectors obtained from each sentence of the corpus. Thus sentence with w words, embedded into a d-dimensional manifold by the neural language model, analyzed using n-gram size of n, will produce d, n-dimensional point clouds (note that the result is independent of the number of words w in the sentence as we summarize it by sliding the window). We then analyze each one of these point clouds as a sample from an underlying topological manifold using techniques from the previous articles. That is we compute Vietoris-Rips filtrations and their homology. This produces topological summaries of each dimension in our representation space.

Remarks

This article introduces mathematical techniques in more detail than is possible in my journal publications and conference talks. It is useful as an introduction to some of the mathematical constructions I use, especially for Computational Linguistics audience unfamiliar with these ideas.