In this article, we explore some simple ways to impose a metric space structure on partitions of the lexicon. We develop code in Python to compare partitions of a finite set. This will be used later as a utility function in some of the experiments outlined in the previous article - Empirical Road Map .

SECTION 1

INTRODUCTION

Corpus data is discrete in nature. Although the underlying linguistic structure can be viewed through a distributed vector space representation, and ultimately our goals are to impose a notion of continuity on raw corpora of natural language text, we need a way to reason about the types of discrete topologies that arise in the construction of the word manifold. In this article, we will concern ourselves with a very particular notion that arises in this setting - that of a partition of a set. In future articles we develop methods to deal with intersections of subsets, and derive metrics for comparing discrete topologies.

In what follows we will develop some tools to map out the space of partitions of a discrete set of points. I published the resulting code implementation in this git repository for future reference. In particular, given two different partitions of a set (which in our future usage will refer to a lexicon of a language extracted from the corpus), we wish to measure the similarity of these partitions. We will use this code to measure certain features of the word manifold. However, there are many other settings in which we might like to perform such a comparison while working with natural language data. For instance, we might want to compare results obtained from applying unsupervised syntactic category induction algorithms to a given corpus of text. The resulting groups of words form a partition of the lexicon, and the code developed here can be used to obtain a quantitative comparison of the results. Furthermore, since our code produces a similarity metric, we could use it to perform clustering.

Definition: Let \mathbb{S} be a set. A family of sets \mathcal{P} \subset \mathbb{2}^{\mathbb{S}} is a partition of \mathbb{S} \iff \mathbb{S} = \underset{\mathbb{A} \in \mathcal{P}}{\bigsqcup} \mathbb{A} with all \mathbb{A} \neq \emptyset .

In other words, a partition of a set is a collection of its disjoint nonempty subsets whose union equals the whole set.

For instance, let \mathbb{S} = \{1, 2, 3, 4, 5, 6, 7, 8, 9\} \subset \mathbb{Z} . One possible partition is the collection \mathcal{P} = \left\{ \{3\}, \{7, 9\}, \{1, 2, 6\}, \{4, 5, 8\} \right\} . Another partition is \mathcal{P}' = \left\{ \{3, 6\}, \{8, 9\}, \{3, 4, 8\}, \{2, 5, 6, 9\} \right\} . Figure 1 shows a visualization of these two partitions with every element of the partition represented by a different color.

FIGURE 1

Two partitions \mathcal{P} = \left\{ \{3\}, \{7, 9\}, \{1, 2, 6\}, \{4, 5, 8\} \right\} and \mathcal{P}' = \left\{ \{3, 6\}, \{8, 9\}, \{3, 4, 8\}, \{2, 5, 6, 9\} \right\} of the set \mathbb{S} = \{1, 2, 3, 4, 5, 6, 7, 8, 9\} \subset \mathbb{Z} .

FIGURE 2

Figure 2 shows a visualization of the space of all possible partitions of the 5-element set \mathbb{S} = \{1, 2, 3, 4, 5\} .

Given a partition \mathcal{P} of a set \mathbb{S} , we call the elements of \mathcal{P} cells. For any element x \in \mathbb{S} we write [x] to designate the element of \mathcal{P} containing x . Since elements of the partition are disjoint subsets of \mathbb{S} , the [\bullet] operation is a well defined mapping \mathbb{S} \rightarrow \mathcal{P} . For a particular element of the partition A \in \mathcal{P} , any element x \in A is called a representative of the cell A = [x] . The choice of a representative defines an injection from the partition to the set. The mapping from elements of the set to their corresponding cells in the partition is referred to as the canonical surjection.

This surjective map is often called the canonical projection when we view partition as a quotient of a set over a given equivalence relation. Note that the space of partitions of a set \mathbb{S} is isomorphic to the space of equivalence relations on \mathbb{S} . In particular, given a symmetric, reflexive, and transitive relation \thicksim on \mathbb{S} , we can define the associated partition \mathcal{P} = \mathbb{S} / \sim by taking the quotient of our set over the given relation. Conversely, given a partition \mathcal{P} we can define the associated equivalence relation on \mathbb{S} by x \sim_{\mathcal{P}} y \iff x \in [y] . For this reason, we often use the terms partition and equivalence relation in similar contexts.

The cardinality of the set of all partitions of a finite set with n elements is called the Bell number and denoted B_n . It can be computed by the recursive formula B_{n+1} = \sum\limits_{k=0}^{n} \binom{n}{k} B_k . These numbers can be calculated using the Bell triangle in the following way. Start with number 1 in the first row. Repeat recursively: start a new row with the rightmost element of the previous row; construct consecutive elements of the current row by adding the number at current position and the number directly above it (this produces the next entry in the current row) until there is no number above the current position to generate the next entry. An animated illustration of the procedure is shown in figure 3. The sequence of Bell numbers can then be read off the diagonal (last entries in each row) of the triangle. The formula comes from the realization that we can generate partitions of n+1 element sets by adding a new element to every member of a partition of n element set or appending it to those partitions as a separate singleton set. This produces a tree whose leaves are possible partitions (see figure 4 for an example).

FIGURE 3

Computation of the sizes of partitions using the Bell triangle. There is one partition of a singleton, two partitions of a set with two elements, five partitions of a set with three elements, \ldots . The consecutive Bell numbers are the last entries in each row.

FIGURE 4

Every level of this tree represents partitions of a finite set. Partitions of the set \mathbb{S} = \{1, 2, 3\} are the leaves of this tree.

SECTION 2

COMPARING PARTITIONS

Counting all possible partitions is a good way to start thinking about the structure of the partition space. However, the more interesting, and natural question, that immediately follows is the following.

Given a particular partition \mathcal{P} of some set \mathbb{S} and a collection of different partitions of this set, or possibly different sets sharing a nonempty intersection with \mathbb{S} , can we rank-order these partitions using some coherent notions of similarity to \mathcal{P} ?

In what follows, we will refer to the particular chosen partition \mathcal{P} of the given set \mathbb{S} as the gold standard. The question we want to explore, is how to compare alternative ways of partitioning our set, to the gold standard partition under consideration.

We will develop several independent measures of similarity between partitions. They will be based on the idea of a confusion matrix, but inspired by differing viewpoints. The main ideas behind these measures will be: counting pairs of elements, summation over set intersections, and information theoretic criteria such as mutual information. The subsequent section combines these approaches into a singular metric that we implement in Python.

Before moving forward, it is good to examine edge cases (a practice often followed in mathematical sciences). As before, let \mathbb{S} be a set, and \mathcal{P} a chosen partition. There are two obvious edge cases, which we will call trivial or degenerate partitions. Namely, the partion containing a single element (the entire set \mathcal{P} = \left\{ \mathbb{S} \right\} ) and the partition whose cardinality is equal to that of the set itself (composed of all singletons - i.e. every element of the set is in a separate cell \mathcal{P} = \left\{ \{x\} | x \in \mathbb{S} \right\} ).

Now suppose we are given two partitions \mathcal{P} = \left\{ C_1, \ldots, C_k \right\} and \mathcal{P}' = \left\{ C_1', \ldots, C_k' \right\} of \mathbb{S} with k and l cells respectively. The confusion matrix or contingency table M = \left( m_{i,j} \right) associated to the pair \mathcal{P}, \mathcal{P}' is the k \times l matrix whose i,j -th entry is m_{i,j} = \left| C_i \cap C_j' \right| . Furthermore, we will say that \mathcal{P}' is a refinement of \mathcal{P} , or equivalently \mathcal{P} is a coarsening \mathcal{P}' , if each cell of \mathcal{P}' is contained in some cell of \mathcal{P} . That is: \forall C_j' \in \mathcal{P}' \exists C_i \in \mathcal{P}: C_j' \subseteq C_i . We will also define the product of these two partitions \mathcal{P} and \mathcal{P}' , with cells as above, to be the coarsest common refinement of both: \mathcal{P} \times \mathcal{P}' = \left\{ C_i \cap C_j' | C_i \in \mathcal{P}, C_j' \in \mathcal{P}', C_i \cap C_j' \neq \emptyset \right\} . Note that this product also forms a partition of \mathbb{S} , and if \mathcal{P}' is a refinement of \mathcal{P} , then \mathcal{P} \times \mathcal{P}' = \mathcal{P}' .

Chi Squared

Given those definitions, a basic statistically inspired similarity measure between two partitions is the chi squared coefficient . There are many variations on this measure, originally introduced by Pearson in 1900 . The coefficient, in its basic form, can be computed (using our prior notation for partitions and cells) as:

\chi (\mathcal{P}, \mathcal{P}') = \sum\limits_{i=1}^{k} \sum\limits_{j=1}^{l} \frac{\left( m_{i,j} - \frac{|C_i||C_j'|}{|\mathbb{S}|} \right)^2}{\frac{|C_i||C_j'|}{|\mathbb{S}|}}

Four Subsets of \mathbb{S} \times \mathbb{S}

In what follows, we will consider four special subsets of the Cartesian product \mathbb{S} \times \mathbb{S} of our set with itself, which arise naturally when considering its two distinct partitions \mathcal{P} and \mathcal{P}'

S_{0,0} : pairs that belong to different cells in both partitions
S_{0,1} : pairs that are in different cells under \mathcal{P} but in the same cell under \mathcal{P}'
S_{1,0} : pairs that are in the same cell under \mathcal{P} but in different cells under \mathcal{P}'
S_{1,1} : pairs belong to the same cells under both partitions

Define n_{i,j} to be the size of S_{i,j} , and let n = |\mathbb{S}| , then

n_{0,0} + n_{0,1} + n_{1,0} + n_{1,1} = \binom{n}{2}

Rand Index

Another simple measure we could use is inspired by classification problems. The Rand Index ( \mathcal{R} ) extends the simple idea of counting correctly classified examples into a measure on the space of partitions. It arises naturally if we think of cells of the partitions \mathcal{P} and \mathcal{P}' as classes into which we assign elements of the set \mathbb{S} . The index ranges from 0 (no pair of elements is classified identically in both schemes) to 1 (the classification is exactly the same).

\mathcal{R} \left( \mathcal{P}, \mathcal{P}' \right) = \frac{2 \left( n_{0,0} n_{1,1} \right)}{n \left( n-1 \right)}

It is a good starting measure to compare partitions, inspired by classification accuracy. However, it has numerous issues that make the raw index unsuitable for comparing discrete topologies of our word manifolds. The Rand Index converges to 1 as the number of cells increases, and it is unstable with respect to the number of cells even for moderate size partitions. In fact the expected value of \mathcal{R} computed from two randomly sampled partitions does not converge to a constant.

For this reason, we will use an adjusted version of the Rand Index to mitigate these issues. We assume that the two partitions \mathcal{P} and \mathcal{P}' are randomly sampled from a generalized hypergeometric distribution. This gives a null hypothesis that the two partitions are generated randomly with a fixed number of cells and a fixed number of elements per cell. These constants are possibly different between the two partitions. We will call this adjusted index \widetilde\mathcal{R} , and define it to be the normalized difference between our original Rand Index and its expected value under the null hypothesis.

\widetilde{\mathcal{R}} \left( \mathcal{P}, \mathcal{P}' \right) = \frac{ \sum\limits_{i=1}^{k}\sum\limits_{j=1}^{l} \binom{m_{i,j}}{2} - t_3} {\frac{1}{2} \left( t_1 + t_2 \right) - t_3 }

where t_1 = \sum\limits_{i=1}^{k}\binom{|C_i|}{2}, t_2 = \sum\limits_{j=1}^{l} \binom{|C_j'|}{2}, t_3 = \frac{2 t_1 t_2}{n(n-1)}

With this modification, our index has expected value of zero in case of completely independent partitions, and maximum value of one for identical ones. There are still some issues with this measure, such as possibility of negative values for some choices of partition pairs, as well as the statistical assumptions on the distribution of clusterings. We will smooth them out in our final metric by combining other approaches.

Fowlkes-Mallows Index

Another index that, after some adjustment, can be used in our combined metric comes from unsupervised machine learning literature. It was originally introduced as a metric for hierarchical clustering algorithms. However, we can also apply it to flat clustering, and if we view our partitions \mathcal{P} and \mathcal{P}' as results of running two clustering schemes on the points of set \mathbb{S} , we can use it to measure similarity between them.

A version of this index also appears in the context of information retrieval, where it can be interpreted as the geometric mean of precision and recall. Statistical properties of the Fowlkes-Mallows Index index are similar to our adjusted Rand Index. However, it comes with other issues such as numerical instability even for small partition sizes. I include it in this article as an alternative that might work better in some situations, but prefer using the adjusted Rand Index by default.

The Fowlkes-Mallows Index in its generalized form can be computed in our setting by the following formula:

\mathcal{FM}\left(\mathcal{P}, \mathcal{P}'\right) = \frac{\sum\limits_{i=1}^{k}\sum\limits_{j=1}^{l}m_{i,j}^2-n}{\sqrt{\left( \sum_i |C_i|^2 \right) \left( \sum_j |C_j'|^2 \right)}} = \frac{n_{1,1}}{\sqrt{\left( n_{1,1} + n_{1,0} \right) \left( n_{1,1} + n_{0,1}\right)}}

Mirkin Metric

This is a true metric on the space of clusterings of a set (i.e. it is symmetric and satisfies the identity of indiscernibles, as well as the triangle inequality). It comes form the Hamming distance in information theory literature. In our setting, the Mirkin Metric corresponds to representation of our partitions as vectors in \left( \mathbb{Z}/2\mathbb{Z} \right)^{\binom{|\mathbb{S}|}{2}}, where a 0 designates that two elements of \mathbb{S} belong to different cells in \mathcal{P} and \mathcal{P}' , while a 1 designates belonging to a common cell in both partitions.

This meausre is related to the Rand Index by:

\mathcal{M} \left( \mathcal{P}, \mathcal{P}' \right) = 2 \left( n_{0,1} + n_{1,0} \right) = n \left( n - 1 \right) \left( 1 - \mathcal{R} \left( \mathcal{P}, \mathcal{P}' \right) \right)

where n = |\mathbb{S}| and the n_{0,1}, n_{1,0} are the sizes of the subsets defined previously.

Jaccard Index

This index originated in ecology literature for measuring biodiversity. It disregards pairs that are in different cells in both partitions.

\mathcal{J} \left( \mathcal{P}, \mathcal{P}' \right) = \frac{n_{1,1}}{n_{1,1} + n_{1,0} + n_{0,1}}

van Dongen Measure

Another measure derived from combinatorics literature is a symmetric similarity index based on maximum intersections of cells. It is also a true metric on the space of clusterings (similarly to the Mirkin metric).

Recall our contingency table M = \left( m_{i,j} \right) introduced in earlier in this section.

\mathcal{D} \left( \mathcal{P}, \mathcal{P}' \right) = 2n - \sum\limits_{i=1}^{k}\max_j m_{i,j} - \sum\limits_{j=1}^{l}\max_i m_{i,j}

Variation of Information

All of the metrics we defined so far come from measuring sizes of various subsets of the Cartesian product of our set with itself, where the subsets depended on the given pair of partitions.

Our final metric is inspired by the notion of entropy as defined by Claude Shannon in his seminal paper introducing information theory in the context of noisy channel communication .

Suppose we pick elements from our underlying set \mathbb{S} uniformly at random. The probability that a given element belongs to cell i is proportional to cell size p(i) = \frac{|C_i|}{n} . We will define the entropy associated with the partition \mathcal{P} to be:

\mathcal{H} \left( \mathcal{P} \right) = - \sum\limits_{i=1}^{k} p(i) \log_2 p(i)

This entropy measures the degree of uncertainty about which cell a randomly sampled element belongs to. The negative log probability is a positive number (I call it the plog) which diverges to infinity for infinitesimally small probability values, and decays to zero as probability approaches 1. Hence this definition of entropy can be thought of as the expected plog under the given partition. As always it is useful to look at edge cases. In the case of the trivial (degenerate) partition \mathcal{P} = \left\{ \mathbb{S} \right\} (where the whole set forms a single cell), the entropy is zero (since there is no uncertainty about which cell any given point belongs to - they all belong to the only cell in the partition). The other extreme case of all singleton cells produces maximum entropy value (the degree of uncertainty is maximized).

A related notion, that we need to discuss, before we can define our information theoretic metric is that of mutual information.

\mathcal{I} \left( \mathcal{P}, \mathcal{P}' \right) = \sum\limits_{i=1}^{k} \sum\limits_{j=1}^{l} p(i,j) \log_2 \frac{p(i,j)}{p(i)p(j)}

This can be interpreted as the reduction in uncertainty about the cell of a sampled element of \mathbb{S} in partition \mathcal{P} given the knowledge of its cell assignment in another partition \mathcal{P}' . In the formula above, the p(i,j) is the joint probability of an element belonging to cell C_i in \mathcal{P} and cell C_j' in \mathcal{P}' . That is p(i,j) = \frac{|C_i \cap C_j'|}{n} .

This mutual information already satisfies the properties of a metric on the space of partitions of \mathbb{S} , which is bounded by the minimum of the two entropies \mathcal{I} \left( \mathcal{P}, \mathcal{P}' \right) \leq \min \left\{ \mathcal{H} \left( \mathcal{P} \right), \mathcal{H} \left( \mathcal{P}' \right) \right\} . However, the entropy itself is unbounded, so mutual information is difficult to interpret in this raw form. We could normalize it by the geometric mean of the entropies of the two partitions to counter that effect. Instead, we will use it as part of another metric with nicer properties.

We are now ready to define the final metric, which we call Variation of Information (as it is analogous to the notion of total variation in functional analysis):

\mathcal{VI} \left( \mathcal{P}, \mathcal{P}' \right) = \mathcal{H} \left( \mathcal{P} \right) + \mathcal{H} \left( \mathcal{P}' \right) - 2 \mathcal{I} \left( \mathcal{P},\mathcal{P}' \right)

It is informative to rewrite this definition as a sum of two terms:

\mathcal{VI} \left( \mathcal{P}, \mathcal{P}' \right) = \left[ \mathcal{H} \left( \mathcal{P} \right) - \mathcal{I} \left( \mathcal{P},\mathcal{P}' \right) \right] + \left[ \mathcal{H} \left( \mathcal{P}' \right) - \mathcal{I} \left( \mathcal{P},\mathcal{P}' \right) \right]

In this way, the first summand can be interpreted as the amount of information about \mathcal{P} that was lost during transition to \mathcal{P}' , while the second summand corresponds to the amount of information about \mathcal{P}' gained in this process.

Before continuing onto the next section of this article, it is useful to wrap up this part by summarizing some properties of \mathcal{VI} .

Our Variation of Information metric is:

a true metric on the space of all clusterings of \mathbb{S}
unbounded but asymptotically dominated (in fact for all n ) by \log n
more precisely, if the number of clusters is \mathcal{O}(K) with K \leq \sqrt{n} then \mathcal{VI} \left( \mathcal{P}, \mathcal{P}' \right) is bounded by 2 \log K ; this allows us to compare, add, or subtract \mathcal{VI} distances across different clustering spaces independently of the underlying data (since for large n partitions of different data sets with possibly different numbers of points will be on the same scale in this metric, as long as the number of cells is bounded)
the product of two partitions (as defined earlier in this article) is colinear in the sense \mathcal{VI} \left( \mathcal{P}, \mathcal{P}' \right) = \mathcal{VI} \left( \mathcal{P}, \mathcal{P} \times \mathcal{P}' \right) + \mathcal{VI} \left( \mathcal{P} \times \mathcal{P}', \mathcal{P}' \right)
the previous property implies that \mathcal{VI} \left( \mathcal{P}, \mathcal{P}' \right) \geq \mathcal{VI} \left( \mathcal{P}, \mathcal{P} \times \mathcal{P}' \right) with equality in case of \mathcal{P}' = \mathcal{P} \times \mathcal{P}' , which means that the nearest neighbor of \mathcal{P} is either a refinement of \mathcal{P} or a partition whose refinement is \mathcal{P}
more precisely, it can be shown that the nearest clustering in the metric space defined by \mathcal{VI} is the clustering obtained by splitting one element from the smallest cluster (or a dual merge process of the same kind); this is a very desirable property, implying that small perturbations in the clustering result in small distance variation
the Information Variation distance between two nonidentical clusterings has a lower bound of \frac{2}{n} ; this means that the space becomes more granular (increased resolution) as we grow the set size - another desirable property
finally, it is efficient from computational complexity perspective; we need \mathcal{O} \left( n \right) time to produce the confusion matrix and \mathcal{O} \left( kl \right) to compute the metric from it - i.e. the total time is \mathcal{O} \left( n + kl \right)

There are possible improvements to the Information Variation metric, which we will not pursue for now, as our usage of the metric will be limited, and we will develop more fine-tuned ways of comparing topologies based on homological algebra in future articles. One relaxation that works nicely in information theoretic setting would be considering a probability distribution over elements belonging to a cell, instead of hard assignment. We could also extend this to weighted elements.

SECTION 3

COMBINED METRIC

In this section we implement the ideas discussed in the previous section, guided by optimizations sourced from relevant literature. The result will be a tool that computes similarity scores between partitions of word groups provided in form of text files extracted from the corpus. This program computes a metric on the space of all partitions of a lexicon into word categories. The resulting metric is a convex combination of scores based on the following measures:

Variation of Information
Adjusted Rand Index
Jacard Index
Mirkin Metric
van Dongen Metric

The first measure on the list differs in theoretical properties from the remaining four. It is based on information theoretic considerations as opposed to point-pair counts. Variation of Information makes no prior assumptions on the data and it comes with very desirable mathematical properties (discussed earlier in this article).

Jacard index and the adjusted Rand index have been modified into a metric. All of the selected measures are symmetric and have been normalized to map into the [0, 1] interval of the reals.

Input consists of text files with space separated words. Clusters are separated by the new line character.

def loadVocabulary(vocabularyFileName): vocabularyList = [] vocabularyFile = open(vocabularyFileName) for line in vocabularyFile: tokens = line.split() if tokens[0] == '#': continue if not tokens: continue vocabularyList.append(tokens[0]) vocabularyFile.close() return vocabularyList

def loadPartition(partitionFileName): partition = [] partitionFile = open(partitionFileName) for line in partitionFile: tokens = line.split() if tokens[0] =='#': continue if not tokens: continue partition.append(tokens) partitionFile.close() return partition def loadPartitions(partitionFileNameList): partitions = [] for i in range(len(partitionFileNameList)): partitionFileName = partitionFileNameList[i] partitions.append(loadPartition(partitionFileName)) return partitions

def computeContingencyTable(partition1, partition2): import numpy as np contingencyTable = np.zeros((len(partition1), len(partition2))) for i in range(len(partition1)): for j in range(len(partition2)): contingencyTable[i][j] = len(set(partition1[i]).intersection(set(partition2[j]))) return contingencyTable

def clusterProbability(partition, k): nk = 1.0*len(partition[k]) n = 1.0*len(reduce(lambda x, y: x+y, partition)) return nk/n

def partitionEntropy(partition): import math result = 0 for k in range(len(partition)): pk = clusterProbability(partition, k) result += pk*math.log(pk, 2) result *= -1.0 return result

def mutualInformation(contingencyTable): import math result = 0 n = contingencyTable.sum() for i in range(contingencyTable.shape[0]): for j in range(contingencyTable.shape[1]): pij = 1.0*contingencyTable[i][j]/n if pij != 0: pi = contingencyTable.sum(axis=1)[i]/n pj = contingencyTable.sum(axis=0)[j]/n result += pij*math.log(pij/(pi*pj), 2) return result

def variationOfInformation(contingencyTable): import math k1, k2 = contingencyTable.shape n = contingencyTable.sum() p1 = contingencyTable.sum(axis=1)/n p2 = contingencyTable.sum(axis=0)/n h1 = 0 h2 = 0 i12 = mutualInformation(contingencyTable) for k in range(k1): pk = p1[k] h1 += pk*math.log(pk, 2) h1 *= -1 for k in range(k2): pk = p2[k] h2 += pk*math.log(pk, 2) h2 *= -1 return (h1 + h2 - 2.0*i12) def normalizedVariationOfInformation(contingencyTable): import math k1, k2 = contingencyTable.shape n = contingencyTable.sum() p1 = contingencyTable.sum(axis=1)/n p2 = contingencyTable.sum(axis=0)/n h1 = 0 h2 = 0 i12 = mutualInformation(contingencyTable) for k in range(k1): pk = p1[k] h1 += pk*math.log(pk, 2) h1 *= -1 for k in range(k2): pk = p2[k] h2 += pk*math.log(pk, 2) h2 *= -1 return ((h1 + h2 - 2.0*i12)/math.log(n, 2))

def n11(contingencyTable): n = contingencyTable.sum() tmp = (contingencyTable**2).sum() return 0.5*(tmp - n) def n10(contingencyTable): rs = contingencyTable.sum(axis=1) tmp1 = (rs**2).sum() tmp2 = (contingencyTable**2).sum() return 0.5*(tmp1 - tmp2) def n01(contingencyTable): cs = contingencyTable.sum(axis=0) tmp1 = (cs**2).sum() tmp2 = (contingencyTable**2).sum() return 0.5*(tmp1 - tmp2) def n00(contingencyTable): n = contingencyTable.sum() ses = (contingencyTable**2).sum() rs = contingencyTable.sum(axis=1) cs = contingencyTable.sum(axis=0) tmp1 = (rs**2).sum() tmp2 = (cs**2).sum() return 0.5*(n*n + ses - tmp1 - tmp2)

def adjustedRandIndex(contingencyTable): n = contingencyTable.sum() return (1.0 - 2.0*(n11(contingencyTable) + n00(contingencyTable))/(n*n - n))

def jacardIndex(contingencyTable): return (1.0 - n11(contingencyTable)/(n11(contingencyTable) + n01(contingencyTable) + n10(contingencyTable)))

def mirkinMetric(contingencyTable): n = contingencyTable.sum() ses = (contingencyTable**2).sum() rs = contingencyTable.sum(axis=1) cs = contingencyTable.sum(axis=0) tmp1 = (rs**2).sum() tmp2 = (cs**2).sum() return (tmp1 + tmp2 - 2.0*ses) def normalizedMirkinMetric(contingencyTable): n = contingencyTable.sum() ses = (contingencyTable**2).sum() rs = contingencyTable.sum(axis=1) cs = contingencyTable.sum(axis=0) tmp1 = (rs**2).sum() tmp2 = (cs**2).sum() return ((tmp1 + tmp2 - 2.0*ses)/(n*n))

def vanDongenMetric(contingencyTable): k1, k2 = contingencyTable.shape n = contingencyTable.sum() tmp1 = 0 tmp2 = 0 for i in range(k1): aux = 0 for j in range(k2): if (contingencyTable[i][j] > aux): aux = contingencyTable[i][j] tmp1 += aux for j in range(k2): aux = 0 for i in range(k1): if (contingencyTable[i][j] > aux): aux = contingencyTable[i][j] tmp2 += aux return (2.0*n - tmp1 - tmp2) def normalizedVanDongenMetric(contingencyTable): k1, k2 = contingencyTable.shape n = contingencyTable.sum() tmp1 = 0 tmp2 = 0 for i in range(k1): aux = 0 for j in range(k2): if (contingencyTable[i][j] > aux): aux = contingencyTable[i][j] tmp1 += aux for j in range(k2): aux = 0 for i in range(k1): if (contingencyTable[i][j] > aux): aux = contingencyTable[i][j] tmp2 += aux return ((2.0*n - tmp1 - tmp2)/(2*n))

def combinationMetric(contingencyTable): return (0.2*normalizedVariationOfInformation(contingencyTable) + 0.2*adjustedRandIndex(contingencyTable) + 0.2*jacardIndex(contingencyTable) + 0.2*normalizedMirkinMetric(contingencyTable) + 0.2*normalizedVanDongenMetric(contingencyTable))

import sys def main(): if (len(sys.argv) >= 3): goldStandard = loadPartition(sys.argv[1]) partitions = loadPartitions(sys.argv[2:]) for i in range(len(partitions)): partition = partitions[i] contingencyTable = computeContingencyTable(goldStandard, partition) print('comparing gold standard file to partition file: ' + sys.argv[i + 2]) print('\tVariation of Information: ' + str(variationOfInformation(contingencyTable))) print('\tNormalized Variation of Information: ' + str(normalizedVariationOfInformation(contingencyTable))) print('\tAdjusted Rand Index (metric modified): ' + str(adjustedRandIndex(contingencyTable))) print('\tJacard Index (metric modified): ' + str(jacardIndex(contingencyTable))) print('\tMirkin Metric: ' + str(mirkinMetric(contingencyTable))) print('\tNormalized Mirkin Metric: ' + str(normalizedMirkinMetric(contingencyTable))) print('\tvan Dongen Metric: ' + str(vanDongenMetric(contingencyTable))) print('\tNormalized van Dongen Metric: ' + str(normalizedVanDongenMetric(contingencyTable))) print('\tCombination Metric: ' + str(combinationMetric(contingencyTable))) print('\n') else: print("USAGE: python partition-metric.py gold-standard-partition [partition-files]") if __name__ == "__main__": main()

For a simple example run, consider the following three files representing partitons of the set \mathbb{S} = \left\{ a, b, c, d, e, f, g, h \right\}

gold standard

a b c d e f g h

partition 1

a b c d e h f g

partition 2

a d f h b e g c

Running our program with the three files as inputs produces the following combination metric distances between gold standard and partitions 1 and 2 respectively: 0.11938596727993364 0.6613502529942193

All three partitions have different number of cells, and distribution of set elements. However, we see that partition 1 is much closer to the gold standard than partition 2. This is expected, because there is significant similarity in cell membership between the gold standard and partition 1. In contrast, partition 2 contains two clusters whose members were mostly distributed among four different cells in the gold standard partition (with the exception of elements b and c , which were in the same cell).

Remarks

This article is meant to distill some ideas in code. Future articles will develop more optimized algorithms to deal with discrete topologies of linguistic units.