Second-order co-occurrence pointwise mutual information

In computational linguistics, second-order co-occurrence pointwise mutual information (SOC-PMI) is a method used to measure semantic similarity, or how close in meaning two words are. The method does not require the two words to appear together in a text. Instead, it works by analyzing the "neighbor" words that typically appear alongside each of the two target words in a large body of text (corpus). If the two target words frequently share the same neighbors, they are considered semantically similar.

For example, the words "cemetery" and "graveyard" may not appear in the same sentence often, but they both frequently appear near words like "buried," "dead," and "funeral." SOC-PMI uses this shared context to determine that they have a similar meaning.

The method is called "second-order" because it doesn't look at the direct co-occurrence of the target words (which would be first-order), but at the co-occurrence of their neighbors (a second level of association). The strength of these associations is quantified using pointwise mutual information (PMI).

History

The method builds on earlier work like the PMI-IR algorithm, which used the AltaVista search engine to calculate word association probabilities.^{[citation needed]} The key advantage of a second-order approach like SOC-PMI is its ability to measure similarity between words that do not co-occur often, or at all. The British National Corpus (BNC) has been used as a source for word frequencies and contexts for this method.

Methodology

The SOC-PMI algorithm measures the similarity between two words, $w_{1}$ and $w_{2}$ , in several steps.

Step 1: Score neighboring words with PMI

First, for each target word ( $w_{1}$ and $w_{2}$ ), the algorithm identifies its "neighbor" words within a certain text window (e.g., within 5 words to the left or right) across a large corpus. The strength of the association between a target word $t_{i}$ and its neighbor $w$ is calculated using pointwise mutual information (PMI). A higher PMI value means the two words appear together more often than would be expected by chance.

The PMI between a target word $t_{i}$ and a neighbor word $w$ is calculated as:

f^{\text{pmi}}(t_{i},w)=\log _{2}{\frac {f^{b}(t_{i},w)\times m}{f^{t}(t_{i})f^{t}(w)}}

where:

$f^{b}(t_{i},w)$ is the number of times $t_{i}$ and $w$ appear together in the context window.
$f^{t}(t_{i})$ is the total number of times $t_{i}$ appears in the corpus.
$f^{t}(w)$ is the total number of times $w$ appears in the corpus.
$m$ is the total number of tokens (words) in the corpus.

Step 2: Create a semantic 'signature' for each word

For each target word ( $w_{1}$ and $w_{2}$ ), the algorithm creates a list of its most significant neighbors. This is done by taking the top $\beta$ neighbor words, sorted in descending order by their PMI score with the target word. This list of top neighbors, $X^{w}$ , acts as a semantic "signature" for the word $w$ .

X^{w}=\{X_{i}^{w}\}

, for

i=1,2,\ldots ,\beta

The size of this list, $\beta$ , is a parameter of the method.

Step 3: Compare the signatures

The algorithm then compares the signatures of $w_{1}$ and $w_{2}$ . It looks for words that are present in both signatures. The similarity of $w_{1}$ to $w_{2}$ is calculated by summing the PMI scores of $w_{2}$ with every word in $w_{1}$ 's signature list.

The $\beta$ -PMI summation function defines this score. The score for $w_{1}$ with respect to $w_{2}$ is:

f(w_{1},w_{2},\beta )=\sum _{i=1}^{\beta }(f^{\text{pmi}}(X_{i}^{w_{1}},w_{2}))^{\gamma }

This sum only includes terms where the PMI value is positive. The exponent $\gamma$ (with a value > 1) is used to give more weight to neighbors that are more strongly associated with $w_{2}$ .

This calculation is done in both directions:

The similarity of $w_{1}$ with respect to $w_{2}$ :

f(w_{1},w_{2},\beta _{1})=\sum _{i=1}^{\beta _{1}}(f^{\text{pmi}}(X_{i}^{w_{1}},w_{2}))^{\gamma }

The similarity of $w_{2}$ with respect to $w_{1}$ :

f(w_{2},w_{1},\beta _{2})=\sum _{i=1}^{\beta _{2}}(f^{\text{pmi}}(X_{i}^{w_{2}},w_{1}))^{\gamma }

Step 4: Calculate final similarity score

Finally, the total semantic similarity is the average of the two scores from the previous step.

\mathrm {Sim} (w_{1},w_{2})={\frac {f(w_{1},w_{2},\beta _{1})}{\beta _{1}}}+{\frac {f(w_{2},w_{1},\beta _{2})}{\beta _{2}}}

This score can be normalized to fall between 0 and 1. For example, using this method, the words cemetery and graveyard achieve a high similarity score of 0.986 (with specific parameter settings).

References

Islam, A. and Inkpen, D. (2008). Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data 2, 2 (Jul. 2008), 1–25.
Islam, A. and Inkpen, D. (2006). Second Order Co-occurrence PMI for Determining the Semantic Similarity of Words, in Proceedings of the International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy, pp. 1033–1038.