Table Of Contents

Previous topic

Composition models

Next topic

Evaluation of similarity scores

This Page

Similarity queries

Jump to command-line usage.

Python code

Comparing two words/phrases

A semantic space can be used to compute a similarity score between two words or phrases, given a similarity measure:

#ex06.py
#-------
from composes.utils import io_utils
from composes.similarity.cos import CosSimilarity

#load a space
my_space = io_utils.load("./data/out/ex01.pkl")

print my_space.cooccurrence_matrix
print my_space.id2row

#compute similarity between two words in the space 
print my_space.get_sim("car", "car", CosSimilarity())
print my_space.get_sim("car", "book", CosSimilarity())

List of available similarity measures.

The words/phrases to be compared DO NOT have to be stored in the same semantic space (although of course they must be represented by the same number of dimensions). Computing similarities between elements from different spaces:

#ex07.py
#-------
from composes.utils import io_utils
from composes.similarity.cos import CosSimilarity

#load two spaces
my_space = io_utils.load("./data/out/ex01.pkl")
my_per_space = io_utils.load("./data/out/PER_SS.ex05.pkl")

print my_space.id2row
print my_per_space.id2row

#compute similarity between a word and a phrase in the two spaces
print my_space.get_sim("car", "sports_car", CosSimilarity(), 
                       space2 = my_per_space)

Finding neighbours

A semantic space can also be used to compute the k-nearest neighbours of a word or phrase, according to a similarity measure:

#ex08.py
#-------
from composes.utils import io_utils
from composes.similarity.cos import CosSimilarity

#load a space
my_space = io_utils.load("./data/out/ex01.pkl")

#get the top 2 neighbours of "car"
print my_space.get_neighbours("car", 2, CosSimilarity())

Again, the neighbours can be extracted from a different space:

#ex09.py
#-------
from composes.utils import io_utils
from composes.similarity.cos import CosSimilarity

#load two spaces
my_space = io_utils.load("./data/out/ex01.pkl")
my_per_space = io_utils.load("./data/out/PER_SS.ex05.pkl")

print my_space.id2row
print my_space.cooccurrence_matrix
print my_per_space.id2row
print my_per_space.cooccurrence_matrix

#get the top two neighbours of "car" in a peripheral space 
print my_space.get_neighbours("car", 2, CosSimilarity(), 
                              space2 = my_per_space)

Command-line tools

Comparing a list of word or phrase pairs

Usage:

python2.7 compute_similarities.py [options] [config_file]

Options:

-i, --input input_comparison_file

Input file containing the list of word/phrase pairs to be compared (NB: if an element of a pair to be compared is not in the space(s) used for the comparison, the pair will be assigned 0 similarity).

-c, --columns columns_in_the_input_file

Columns in the input file containing the words/phrases to be compared. For example -c 1,2 if the words/phrases are given as the first two columns.

-o, --output directory

Output directory. After running the command, this directory will contain new text files with names SIMS.input_comparison_file.space_file.similarity_measure (e.g., SIMS.word_pairs1.txt.CORE_SS.myfile.ppmi.euclidean) or SIMS.input_comparison_file.space_file1.space_file2.similarity_measure (names of this sort become quickly monstrous, as in: SIMS.word_pairs2.txt.CORE_SS.myfile.ppmi.nmf_200.PER_SS.perfile.CORE_SS.myfile.ppmi.nmf_200.cos). Note that a separate file is created for each input semantic space or semantic space pair, and for each similarity measure. The output files contain the lines of the input file with the similarity score of the word/phrase pair they contain appended (e.g., if input contained line car book, output might contain car book 0.438529009654; if input has line car book 1, output will have car book 1 0.438529009654).

-s, --space space_file or space_file1,space_file2

File(s) containing the space(s) to be used. If a second file is provided, the second element of the pairs is retrieved from the additional space. Pickle format (and .pkl extension) required. One of -s or –in_dir required.

--in_dir directory

Input directory for the semantic spaces. If provided, all files with .pkl extension in the input directory are loaded one at a time and the -s value is ignored. In this case, output files will be produced for all input files, but it is not possible to request cross-space measurements as it is with the -s option. One of -s or –in_dir required.

-m, --sim_measure similarity_measures

List of comma-separated similarity measures. Example: cos,lin. List of available similarity measures.

-l, --log file

Logger output file. Optional, by default no logging output is produced.

-h, --help

Displays help message.

Examples:

python2.7 compute_similarities.py -i ../examples/data/in/word_pairs1.txt -c 1,2 -s ../examples/data/out/ex01.pkl -o ../examples/data/out/ -m cos,euclidean        
python2.7 compute_similarities.py -i ../examples/data/in/word_pairs2.txt -c 1,2 -s ../examples/data/out/ex01.pkl,../examples/data/out/PER_SS.ex05.pkl -o ../examples/data/out/ -m cos,euclidean        

Finding the neighbours of a list of words/phrases

Usage:

python2.7 compute_neighbours.py [options] [config_file]

Options:

-i, --input input_file

Input file containing the list of words/phrases, one per line.

-o, --output directory

Output directory. Naming conventions as for compute_similarities.py with prefix NEIGHBOURS instead of SIMS. The output file contains each input word/phrase on a separate line, followed by tab-prefixed lines showing the neighbours and corresponding similarity scores.

-s, --space space_file or space_file1,space_file2

File(s) containing the space(s) to be used. If a second file is provided, the neighbours are extracted from the additional space. Pickle format (and .pkl extension) required.

-m, --sim_measure similarity_measure

Similarity measure. Example: cos. List of available similarity measures.

-n, --no_neighbours number_of_neighbours

Number of neighbours to be returned. Optional, default: 20.

-l, --log file

Logger output file. Optional, by default no logging output is produced.

-h, --help

Displays help message.

Examples:

python2.7 compute_neighbours.py -i ../examples/data/in/word_list.txt -n 2 -s ../examples/data/out/ex01.pkl -o ../examples/data/out/ -m cos    
python2.7 compute_neighbours.py -i ../examples/data/in/word_list.txt -n 2 -s ../examples/data/out/ex01.pkl,../examples/data/out/PER_SS.ex05.pkl -o ../examples/data/out/ -m cos