Jump to command-line usage.
A space is created from co-occurrence counts, which can be read from a file in sparse (sm) or dense (dm) format. See information about the formats.
#ex01.py #------- from composes.semantic_space.space import Space #create a space from co-occurrence counts in sparse format my_space = Space.build(data = "./data/in/ex01.sm", rows = "./data/in/ex01.rows", cols = "./data/in/ex01.cols", format = "sm") #export the space in sparse format my_space.export("./data/out/ex01", format = "sm") #export the space in dense format my_space.export("./data/out/ex01", format = "dm")
Here is how the input files of this example look like.
Alternatively, space objects can be saved and loaded using pickle:
#ex02.py #------- from composes.semantic_space.space import Space from composes.utils import io_utils #create a space from co-occurrence counts in sparse format my_space = Space.build(data = "./data/in/ex01.sm", rows = "./data/in/ex01.rows", cols = "./data/in/ex01.cols", format = "sm") #print the co-occurrence matrix of the space print my_space.cooccurrence_matrix #save the Space object in pickle format io_utils.save(my_space, "./data/out/ex01.pkl") #load the saved object my_space2 = io_utils.load("./data/out/ex01.pkl") #print the co-occurrence matrix of the loaded space print my_space2.cooccurrence_matrix
A transformation (e.g., weighting, dimensionality reduction) can be applied on a semantic space, yielding a new semantic space.
Example:
#ex03.py #------- from composes.utils import io_utils from composes.transformation.scaling.ppmi_weighting import PpmiWeighting #create a space from co-occurrence counts in sparse format my_space = io_utils.load("./data/out/ex01.pkl") #print the co-occurrence matrix of the space print my_space.cooccurrence_matrix #apply ppmi weighting my_space = my_space.apply(PpmiWeighting()) #print the co-occurrence matrix of the transformed space print my_space.cooccurrence_matrix
Here is the list of available weighting schemes.
Example:
#ex04.py #------- from composes.utils import io_utils from composes.transformation.dim_reduction.svd import Svd #load a space my_space = io_utils.load("./data/out/ex01.pkl") #print the co-occurrence matrix and the columns of the space print my_space.cooccurrence_matrix print my_space.id2column #apply svd reduction my_space = my_space.apply(Svd(2)) #print the transformed space print my_space.cooccurrence_matrix print my_space.id2column
After dimensionality reduction, the space contains no information about the columns (context features) of the space.
Here is the list of available dimensionality reduction methods.
And here is the full set of transformations available.
A peripheral space contains elements that are specializations of the elements of another “core” space. For example, counts of phrases such as sports car, history book can be interpreted as peripheral to a core noun space containing car, book and so on. This affects the way some transformations are applied on these elements (as explained in the introduction, global/marginal statistics will be extracted from the core element data only).
A peripheral space can only be instantiated given a core space. All the transformations of the core space are automatically applied to the peripheral elements.
Creating a peripheral space:
#ex05.py #------- from composes.utils import io_utils from composes.semantic_space.peripheral_space import PeripheralSpace from composes.transformation.scaling.ppmi_weighting import PpmiWeighting #load a space and apply ppmi on it my_space = io_utils.load("./data/out/ex01.pkl") my_space = my_space.apply(PpmiWeighting()) print my_space.cooccurrence_matrix print my_space.id2row #create a peripheral space my_per_space = PeripheralSpace.build(my_space, data="./data/in/ex05.sm", cols="./data/in/ex05.cols", format="sm") print my_per_space.cooccurrence_matrix print my_per_space.id2row #save the space io_utils.save(my_per_space, "./data/out/PER_SS.ex05.pkl")
And the corresponding input files.
IMPORTANT! The columns of the peripheral space have to be identical to those in the core space (including their order)!
Space class: documentation; PeripheralSpace class: documentation.
Usage:
python2.7 build_core_space.py [options] [config_file]The options are:
- -i, --input input_file_prefix¶
Prefix of the input files.
- --input_format input_format¶
Input format of the file containing co-occurrence counts: one of sm (sparse matrix), dm (dense matrix), pkl (pickle), see information about the input formats.
- -o, --output directory¶
Output directory. For each specification of space creation parameters, a file named CORE_SS.inputname.parameters.format will be left in this directory. For example, if the input co-occurrence file has prefix myfile, a file called CORE_SS.myfile.pkl in the output directory has just the co-occurrence data in pickle format, the file CORE_SS.myfile.ppmi.pkl contains the corresponding space transformed by Positive PMI, CORE_SS.myfile.nmf_100.pkl is the space reduced to 100 dimensions via NMF, CORE_SS.myfile.ppmi.nmf_100.pkl underwent both transformations and CORE_SS.myfile.ppmi.nmf_100.dm contains the same information in dense matrix format.
- -w, --weighting weighting_schemes¶
List of comma-separated weighting schemes. Example: ppmi,plmi,plog,epmi,none. Optional. List of available weighting schemes.
- -r, --reduction dimensionality_reduction_methods¶
List of comma-separated dimensionality reduction methods, with target reduced dimensionality appended. Example: svd_100,nmf_100,nmf_300,none. Optional. List of available dimensionality reduction methods.
- -s, --selection feature_selection_methods¶
List of comma-separated feature selection methods, with number of features (dimensions) to be preserved appended. Example: top_sum_2000,top_length_1000. Optional. Information about currently supported feature selection methods.
- -n, --normalization normalization_methods¶
List of comma-separated normalization methods. Example: row,all. Optional. Information about normalization options.
- --output_format additional_output_format¶
Additional output format for the semantic space: one of sm (sparse matrix), dm (dense matrix). It will be generated in addition to the default pickle output. Optional.
- --gz True/False¶
If True, the input file is assumed to be a gzipped archive with .gz extension. Optional, default False.
- -l, --log file¶
Logger output file. Optional, by default no logging output is produced.
- -h, --help¶
Displays help message.
Examples:
python2.7 build_core_space.py -i ../examples/data/in/ex01 --input_format sm -o ../examples/data/out/ python2.7 build_core_space.py -i ../examples/data/in/ex01 --input_format sm --output_format dm -w ppmi,plog -r svd_2 -n none,row -o ../examples/data/out/ -l ../examples/data/out/ex01.log #or python2.7 build_core_space.py ../examples/data/in/config1.cfg python2.7 build_core_space.py ../examples/data/in/config2.cfg
As the two latter examples show, the script can read parameters from configuration files such as these.
Usage:
python2.7 build_peripheral_space.py [options] [config_file]The options are:
- -i, --input input_file_prefix¶
Prefix of the input files.
- --input_format input_format¶
Input format of the file containing co-occurrence counts: one of sm (sparse matrix), dm (dense matrix), pkl (pickle).
- -o, --output directory¶
Output directory. The files left in it have a name beginning with PER_SS.input_file_prefix., followed by the corresponding core space file names.
- -c, --core core_space_file¶
File containing the core space. Pickle format (and .pkl extension) required. One of -c or –core_in_dir has to be provided.
- --core_in_dir directory_of_core_space_files¶
If provided, it loads all space files found in the directory and computes the current space as peripheral to each of them. One of -c or –core_in_dir has to be provided.
- --core_filter filter_string¶
If core_in_dir is provided, it acts as a filter on the file names contained in it: file names not containing the filter string are ignored. Optional, by default no filter is applied.
- --output_format additional_output_format¶
Additional output format for the semantic space: one of sm (sparse matrix), dm (dense matrix). This is in addition to default pickle output. Optional.
- --gz True/False¶
If True, the input file is assumed to be a gzipped archive with .gz extension. Optional, default False.
- -l, --log file¶
Logger output file. Optional, by default no logging output is produced.
- -h, --help¶
Displays help message.
Example:
python2.7 build_peripheral_space.py -i ../examples/data/in/ex05 --input_format sm -o ../examples/data/out/ -c ../examples/data/out/CORE_SS.ex01.ppmi.svd_2.pkl