Table Of Contents

Previous topic

Tutorial

Next topic

Composition models

This Page

Creating a semantic space

Jump to command-line usage.

Python code

Creating a space from co-occurrence counts

A space is created from co-occurrence counts, which can be read from a file in sparse (sm) or dense (dm) format. See information about the formats.

#ex01.py
#-------
from composes.semantic_space.space import Space

#create a space from co-occurrence counts in sparse format
my_space = Space.build(data = "./data/in/ex01.sm",
                       rows = "./data/in/ex01.rows",
                       cols = "./data/in/ex01.cols",
                       format = "sm")

#export the space in sparse format
my_space.export("./data/out/ex01", format = "sm")
    
#export the space in dense format
my_space.export("./data/out/ex01", format = "dm")

Here is how the input files of this example look like.

Alternatively, space objects can be saved and loaded using pickle:

#ex02.py
#-------
from composes.semantic_space.space import Space
from composes.utils import io_utils

#create a space from co-occurrence counts in sparse format
my_space = Space.build(data = "./data/in/ex01.sm",
                       rows = "./data/in/ex01.rows",
                       cols = "./data/in/ex01.cols",
                       format = "sm")

#print the co-occurrence matrix of the space
print my_space.cooccurrence_matrix

#save the Space object in pickle format
io_utils.save(my_space, "./data/out/ex01.pkl")
    
#load the saved object
my_space2 = io_utils.load("./data/out/ex01.pkl")

#print the co-occurrence matrix of the loaded space
print my_space2.cooccurrence_matrix

Applying transformations on spaces

A transformation (e.g., weighting, dimensionality reduction) can be applied on a semantic space, yielding a new semantic space.

Weighting

Example:

#ex03.py
#-------
from composes.utils import io_utils
from composes.transformation.scaling.ppmi_weighting import PpmiWeighting

#create a space from co-occurrence counts in sparse format
my_space = io_utils.load("./data/out/ex01.pkl")

#print the co-occurrence matrix of the space
print my_space.cooccurrence_matrix

#apply ppmi weighting
my_space = my_space.apply(PpmiWeighting())

#print the co-occurrence matrix of the transformed space
print my_space.cooccurrence_matrix

Here is the list of available weighting schemes.

Dimensionality reduction

Example:

#ex04.py
#-------
from composes.utils import io_utils
from composes.transformation.dim_reduction.svd import Svd

#load a space
my_space = io_utils.load("./data/out/ex01.pkl")

#print the co-occurrence matrix and the columns of the space
print my_space.cooccurrence_matrix
print my_space.id2column

#apply svd reduction
my_space = my_space.apply(Svd(2))

#print the transformed space
print my_space.cooccurrence_matrix
print my_space.id2column

After dimensionality reduction, the space contains no information about the columns (context features) of the space.

Here is the list of available dimensionality reduction methods.

Other transformations

And here is the full set of transformations available.

Peripheral elements

A peripheral space contains elements that are specializations of the elements of another “core” space. For example, counts of phrases such as sports car, history book can be interpreted as peripheral to a core noun space containing car, book and so on. This affects the way some transformations are applied on these elements (as explained in the introduction, global/marginal statistics will be extracted from the core element data only).

A peripheral space can only be instantiated given a core space. All the transformations of the core space are automatically applied to the peripheral elements.

Creating a peripheral space:

#ex05.py
#-------
from composes.utils import io_utils
from composes.semantic_space.peripheral_space import PeripheralSpace
from composes.transformation.scaling.ppmi_weighting import PpmiWeighting


#load a space and apply ppmi on it
my_space = io_utils.load("./data/out/ex01.pkl")
my_space = my_space.apply(PpmiWeighting())

print my_space.cooccurrence_matrix
print my_space.id2row

#create a peripheral space 
my_per_space = PeripheralSpace.build(my_space,
                                     data="./data/in/ex05.sm",
                                     cols="./data/in/ex05.cols",
                                     format="sm")

print my_per_space.cooccurrence_matrix
print my_per_space.id2row

#save the space
io_utils.save(my_per_space, "./data/out/PER_SS.ex05.pkl")

And the corresponding input files.

IMPORTANT! The columns of the peripheral space have to be identical to those in the core space (including their order)!

Space class: documentation; PeripheralSpace class: documentation.

Command-line tools

Building core spaces

Usage:

python2.7 build_core_space.py [options] [config_file]

The options are:

-i, --input input_file_prefix

Prefix of the input files.

--input_format input_format

Input format of the file containing co-occurrence counts: one of sm (sparse matrix), dm (dense matrix), pkl (pickle), see information about the input formats.

-o, --output directory

Output directory. For each specification of space creation parameters, a file named CORE_SS.inputname.parameters.format will be left in this directory. For example, if the input co-occurrence file has prefix myfile, a file called CORE_SS.myfile.pkl in the output directory has just the co-occurrence data in pickle format, the file CORE_SS.myfile.ppmi.pkl contains the corresponding space transformed by Positive PMI, CORE_SS.myfile.nmf_100.pkl is the space reduced to 100 dimensions via NMF, CORE_SS.myfile.ppmi.nmf_100.pkl underwent both transformations and CORE_SS.myfile.ppmi.nmf_100.dm contains the same information in dense matrix format.

-w, --weighting weighting_schemes

List of comma-separated weighting schemes. Example: ppmi,plmi,plog,epmi,none. Optional. List of available weighting schemes.

-r, --reduction dimensionality_reduction_methods

List of comma-separated dimensionality reduction methods, with target reduced dimensionality appended. Example: svd_100,nmf_100,nmf_300,none. Optional. List of available dimensionality reduction methods.

-s, --selection feature_selection_methods

List of comma-separated feature selection methods, with number of features (dimensions) to be preserved appended. Example: top_sum_2000,top_length_1000. Optional. Information about currently supported feature selection methods.

-n, --normalization normalization_methods

List of comma-separated normalization methods. Example: row,all. Optional. Information about normalization options.

--output_format additional_output_format

Additional output format for the semantic space: one of sm (sparse matrix), dm (dense matrix). It will be generated in addition to the default pickle output. Optional.

--gz True/False

If True, the input file is assumed to be a gzipped archive with .gz extension. Optional, default False.

-l, --log file

Logger output file. Optional, by default no logging output is produced.

-h, --help

Displays help message.

Examples:

python2.7 build_core_space.py -i ../examples/data/in/ex01 --input_format sm -o ../examples/data/out/ 
python2.7 build_core_space.py -i ../examples/data/in/ex01 --input_format sm --output_format dm -w ppmi,plog -r svd_2 -n none,row -o ../examples/data/out/ -l ../examples/data/out/ex01.log
#or
python2.7 build_core_space.py ../examples/data/in/config1.cfg
python2.7 build_core_space.py ../examples/data/in/config2.cfg

As the two latter examples show, the script can read parameters from configuration files such as these.

Adding peripheral elements

Usage:

python2.7 build_peripheral_space.py [options] [config_file]

The options are:

-i, --input input_file_prefix

Prefix of the input files.

--input_format input_format

Input format of the file containing co-occurrence counts: one of sm (sparse matrix), dm (dense matrix), pkl (pickle).

-o, --output directory

Output directory. The files left in it have a name beginning with PER_SS.input_file_prefix., followed by the corresponding core space file names.

-c, --core core_space_file

File containing the core space. Pickle format (and .pkl extension) required. One of -c or –core_in_dir has to be provided.

--core_in_dir directory_of_core_space_files

If provided, it loads all space files found in the directory and computes the current space as peripheral to each of them. One of -c or –core_in_dir has to be provided.

--core_filter filter_string

If core_in_dir is provided, it acts as a filter on the file names contained in it: file names not containing the filter string are ignored. Optional, by default no filter is applied.

--output_format additional_output_format

Additional output format for the semantic space: one of sm (sparse matrix), dm (dense matrix). This is in addition to default pickle output. Optional.

--gz True/False

If True, the input file is assumed to be a gzipped archive with .gz extension. Optional, default False.

-l, --log file

Logger output file. Optional, by default no logging output is produced.

-h, --help

Displays help message.

Example:

python2.7 build_peripheral_space.py -i ../examples/data/in/ex05 --input_format sm -o ../examples/data/out/ -c ../examples/data/out/CORE_SS.ex01.ppmi.svd_2.pkl