I will look at the representation manifolds induced by neural language models from raw corpus data in several complementary ways. The technique explained in this article forms a basis for one of these perspectives. It can be thought of as a topological dimensionality reduction method, where the goal is to summarize the shape of our representation space with a rough sketch in form of a low dimensional topological manifold.

SECTION 1

INTRODUCTION

The goal of this project is to summarize the shape of our representation space with a rough sketch in form of a low dimensional topological manifold. This reduced representation can be thought of as a map approximating the shape of our embedding space. Such description can be visually inspected by a human, while remaining more topologically informative than a naive projection. Instead of growing \epsilon balls around points directly, we can map them to a different space first, define an open cover, and then cluster the original points within the preimage of each cover set. This produces a summary of the topological features in the embedding with a simplicial complex of a chosen dimension . In later experiments, we generate 1-dimensional simplicial complexes (i.e. graphs) from every sentence in a corpus of raw text, as interpreted by a neural language model. These will allow for a visual exploration of the shapes of these high dimensional embedding manifolds in order to identify linguistic phenomena correlated with increased topological complexity in the representation space.

SECTION 2

TOPOLOGICAL MAPPER

Figure 1 shows a visualization of this process for a point cloud sampled from the circle (\mathbb{S}^2). The general procedure can be summarized as follows.

Given data points \mathbb{X} = \{x_1, \ldots, x_n\}, x_i \in \mathbb{R}^d , a function f: \mathbb{R}^d \rightarrow \mathbb{R}^m, m < d , and a cover \mathcal{U} = \bigcup_{i \in \mathcal{I}} U_i of the image f(\mathbb{X}) (where \mathcal{I} is some index set) we construct a simplicial complex as follows:

For each U_i \in \mathcal{U} , cluster f^{-1}(U_i) into k_{U_i} clusters C_{U_{i,1}}, \ldots, C_{U_i,k_{U_i}}
\underset{U_i \in \mathcal{U}}{\bigsqcup} \{C_{U_{i,1}}, \ldots, C_{U_i,k_{U_i}}\} now define a cover of \mathbb{X} ; calculate the nerve of this cover

Nerve is defined in the following way. Given a cover \mathcal{U} = \bigcup_{i \in \mathcal{I}} U_i , the nerve of \mathcal{U} is the simplicial complex \mathcal{C}(\mathcal{U}) where the 0-skeleton is formed by the sets in the cover (each U_i is a vertex) and \sigma =[U_{j_0}, \ldots, U_{j_k}] is a k-simplex \iff \bigcap\limits_{l=0}^{k} U_{l_k} \neq 0.

FIGURE 1

Inducing topological structure from a point cloud representing noisy samples from a neighborhood of a 1-dimensional submanifold of \mathbb{R}^2

Remarks

This article introduces mathematical techniques in more detail than is possible in my journal publications and conference talks. It is useful as an introduction to some of the mathematical constructions I use, especially for Computational Linguistics audience unfamiliar with these ideas.