Second-order co-occurrence pointwise mutual information

In computational linguistics, second-order co-occurrence pointwise mutual information (SOC-PMI) is a method used to measure semantic similarity, or how close in meaning two words are. The method does not require the two words to appear together in a text. Instead, it works by analyzing the "neighbor" words that typically appear alongside each of the two target words in a large body of text (corpus). If the two target words frequently share the same neighbors, they are considered semantically similar.

For example, the words "cemetery" and "graveyard" may not appear in the same sentence often, but they both frequently appear near words like "buried," "dead," and "funeral." SOC-PMI uses this shared context to determine that they have a similar meaning.

The method is called "second-order" because it doesn't look at the direct co-occurrence of the target words (which would be first-order), but at the co-occurrence of their neighbors (a second level of association). The strength of these associations is quantified using pointwise mutual information (PMI).

History

edit

The method builds on earlier work like the PMI-IR algorithm, which used the AltaVista search engine to calculate word association probabilities.[citation needed] The key advantage of a second-order approach like SOC-PMI is its ability to measure similarity between words that do not co-occur often, or at all. The British National Corpus (BNC) has been used as a source for word frequencies and contexts for this method.

Methodology

edit

The SOC-PMI algorithm measures the similarity between two words,   and  , in several steps.

Step 1: Score neighboring words with PMI

edit

First, for each target word (  and  ), the algorithm identifies its "neighbor" words within a certain text window (e.g., within 5 words to the left or right) across a large corpus. The strength of the association between a target word   and its neighbor   is calculated using pointwise mutual information (PMI). A higher PMI value means the two words appear together more often than would be expected by chance.

The PMI between a target word   and a neighbor word   is calculated as:

 

where:

  •   is the number of times   and   appear together in the context window.
  •   is the total number of times   appears in the corpus.
  •   is the total number of times   appears in the corpus.
  •   is the total number of tokens (words) in the corpus.

Step 2: Create a semantic 'signature' for each word

edit

For each target word (  and  ), the algorithm creates a list of its most significant neighbors. This is done by taking the top   neighbor words, sorted in descending order by their PMI score with the target word. This list of top neighbors,  , acts as a semantic "signature" for the word  .

 , for  

The size of this list,  , is a parameter of the method.

Step 3: Compare the signatures

edit

The algorithm then compares the signatures of   and  . It looks for words that are present in both signatures. The similarity of   to   is calculated by summing the PMI scores of   with every word in  's signature list.

The  -PMI summation function defines this score. The score for   with respect to   is:

 

This sum only includes terms where the PMI value is positive. The exponent   (with a value > 1) is used to give more weight to neighbors that are more strongly associated with  .

This calculation is done in both directions:

  • The similarity of   with respect to  :
 
  • The similarity of   with respect to  :
 

Step 4: Calculate final similarity score

edit

Finally, the total semantic similarity is the average of the two scores from the previous step.

 

This score can be normalized to fall between 0 and 1. For example, using this method, the words cemetery and graveyard achieve a high similarity score of 0.986 (with specific parameter settings).

References

edit