I will look at the representation manifolds induced by neural language models from raw corpus data in several complementary ways. The technique explained in this article forms a basis for one of these perspectives. It can be thought of as a topological dimensionality reduction method, where the goal is to summarize the shape of our representation space with a rough sketch in form of a low dimensional topological manifold.



The goal of this project is to summarize the shape of our representation space with a rough sketch in form of a low dimensional topological manifold. This reduced representation can be thought of as a map approximating the shape of our embedding space. Such description can be visually inspected by a human, while remaining more topologically informative than a naive projection. Instead of growing \epsilon balls around points directly, we can map them to a different space first, define an open cover, and then cluster the original points within the preimage of each cover set. This produces a summary of the topological features in the embedding with a simplicial complex of a chosen dimension . In later experiments, we generate 1-dimensional simplicial complexes (i.e. graphs) from every sentence in a corpus of raw text, as interpreted by a neural language model. These will allow for a visual exploration of the shapes of these high dimensional embedding manifolds in order to identify linguistic phenomena correlated with increased topological complexity in the representation space.



Figure 1 shows a visualization of this process for a point cloud sampled from the circle (\mathbb{S}^2). The general procedure can be summarized as follows.

Given data points \mathbb{X} = \{x_1, \ldots, x_n\}, x_i \in \mathbb{R}^d , a function f: \mathbb{R}^d \rightarrow \mathbb{R}^m, m < d , and a cover \mathcal{U} = \bigcup_{i \in \mathcal{I}} U_i of the image f(\mathbb{X}) (where \mathcal{I} is some index set) we construct a simplicial complex as follows:

  1. For each U_i \in \mathcal{U} , cluster f^{-1}(U_i) into k_{U_i} clusters C_{U_{i,1}}, \ldots, C_{U_i,k_{U_i}}
  2. \underset{U_i \in \mathcal{U}}{\bigsqcup} \{C_{U_{i,1}}, \ldots, C_{U_i,k_{U_i}}\} now define a cover of \mathbb{X} ; calculate the nerve of this cover
Nerve is defined in the following way. Given a cover \mathcal{U} = \bigcup_{i \in \mathcal{I}} U_i , the nerve of \mathcal{U} is the simplicial complex \mathcal{C}(\mathcal{U}) where the 0-skeleton is formed by the sets in the cover (each U_i is a vertex) and \sigma =[U_{j_0}, \ldots, U_{j_k}] is a k-simplex \iff \bigcap\limits_{l=0}^{k} U_{l_k} \neq 0.

Inducing topological structure from a point cloud representing noisy samples from a neighborhood of a 1-dimensional submanifold of \mathbb{R}^2


This article introduces mathematical techniques in more detail than is possible in my journal publications and conference talks. It is useful as an introduction to some of the mathematical constructions I use, especially for Computational Linguistics audience unfamiliar with these ideas.