To get better acquainted with the toolkit, we recommend that you try the following exercises, which show how to perform a typical series of steps in Compositional Distributional Semantics with a realistically sized data-set. This is also a chance for users to get a glimpse of how DISSECT works on real data.
The practice is based on a small dataset. from our own experiments that includes:
We also provide solutions for each exercise in both Python and using the command-line tools.
DATAPATH in the exercises below indicates the directory where you download and decompress the dataset. You should modify the paths in the solution to match your system.
BUILD CORE SEMANTIC SPACE 1. Create a semantic space from co-occurrence counts in DATA_PATH/core.sm, using words in DATA_PATH/core.rows as rows and words in DATA_PATH/core.cols as columns 2. Apply ppmi weighting on it 3. Apply SVD-500 on it 4. Save the space in pickle format (.pkl) 5. Print the top 10 neighbors of "eat-v" in this space
BUILD PERIPHERAL SEMANTIC SPACE 1. Create a semantic space for noun-verb phrases based on the core semantic phrase in exercise 1 using the counts in DATA_PATH/sv.sm, the dimensions in DATA_PATH/sv.cols, the rows in DATA_PATH/sv.rows 2. Print the top 10 neighbors of "delivery-n_pay-v" in this space 3. Print the top 10 neighbors of "delivery-n_pay-v" in the core space 4. Save the space in pickle format (.pkl)
TRAIN AND APPLY A COMPOSITION MODEL 1. Load training data from file DATA_PATH/training_pairs.txt 2. Load core space (argument space) from exercise 1 and peripheral space (observed phrase space) from exercise 2 3. Train a lexical function model on the two spaces using Ridge Regression with lambda=2 4. Load testing pairs from DATA_PATH/testing_pairs.txt (list of elements to be composed) 5. Apply the trained lexical function model on these pairs and save the result phrase space 6. Print the top 10 neighbors of "conflict-n_erupt-v" in the composed phrase space 7. Print the top 10 neighbors of "conflict-n_erupt-v" in the argument space
EVALUATE A COMPOSITIONAL MODEL 1. Load the composed space you have just saved in exercise 3 2. Use this space to compute the similarities of the pairs in DATA_PATH/testing_pairs.txt. (Columns 1 and 2 contain the pairs of words.) 3. Evaluate the similarities using Pearson correlation. DATA_PATH/gold.txt contains the gold standard scores in column 3.
from composes.semantic_space.space import Space from composes.transformation.scaling.ppmi_weighting import PpmiWeighting from composes.transformation.dim_reduction.svd import Svd from composes.transformation.feature_selection.top_feature_selection import TopFeatureSelection from composes.similarity.cos import CosSimilarity from composes.utils import io_utils from composes.utils import log_utils if __name__ == '__main__': # set constants data_path = "/home/dissect/demo/" log_file = data_path + "all.log" core_cooccurrence_file = data_path + "core.sm" core_row_file = data_path + "core.rows" core_col_file = data_path + "core.cols" core_space_file = data_path + "core.pkl" # config log file log_utils.config_logging(log_file) print "Building semantic space from co-occurrence counts" core_space = Space.build(data=core_cooccurrence_file, rows=core_row_file, cols=core_col_file, format="sm") print "Applying ppmi weighting" core_space = core_space.apply(PpmiWeighting()) # print "Applying feature selection" # core_space = core_space.apply(TopFeatureSelection(5000)) print "Applying svd 500" core_space = core_space.apply(Svd(500)) print "Saving the semantic space" io_utils.save(core_space, core_space_file) print "Finding 10 neighbors of \"eat-v\"" neighbors = core_space.get_neighbours("eat-v", 10, CosSimilarity()) print neighbors
from composes.semantic_space.space import Space from composes.semantic_space.peripheral_space import PeripheralSpace from composes.similarity.cos import CosSimilarity from composes.utils import io_utils, log_utils import os if __name__ == '__main__': # set constants data_path = "/home/dissect/demo/" log_file = data_path + "all.log" core_space_file = data_path + "core.pkl" per_cooccurrence_file = data_path + "sv.sm" per_row_file = data_path + "sv.rows" per_col_file = data_path + "sv.cols" per_space_file = data_path + "sv.pkl" # config log file log_utils.config_logging(log_file) print "Building peripheral space" core_space = io_utils.load(core_space_file,Space) per_space = PeripheralSpace.build(core_space,data=per_cooccurrence_file, cols=per_col_file, rows=per_row_file, format="sm") print "Saving peripheral space" io_utils.save(per_space, per_space_file) print "Finding neighbors of \"delivery-n_pay-v\" in the peripheral space" neighbors = per_space.get_neighbours("delivery-n_pay-v", 10, CosSimilarity()) print neighbors print "Finding neighbors of \"delivery-n_pay-v\" in the core space" neighbors = per_space.get_neighbours("delivery-n_pay-v", 10, CosSimilarity(), core_space) print neighbors
from composes.utils import io_utils, log_utils from composes.composition.lexical_function import LexicalFunction from composes.utils import regression_learner from composes.similarity.cos import CosSimilarity if __name__ == '__main__': # set constants data_path = "/home/dissect/demo/" log_file = data_path + "all.log" core_space_file = data_path + "core.pkl" per_space_file = data_path + "sv.pkl" training_pair_file = data_path + "training_pairs.txt" testing_pair_file = data_path + "testing_pairs.txt" composed_space_file = data_path + "composed.pkl" # config log file log_utils.config_logging(log_file) print "Reading in train data" train_data = io_utils.read_tuple_list(training_pair_file, fields=[0,1,2]) print "Training Lexical Function compositional model" core_space = io_utils.load(core_space_file) per_space = io_utils.load(per_space_file) # comp_model = WeightedAdditive() comp_model = LexicalFunction(learner = regression_learner.RidgeRegressionLearner(param=2)) comp_model.train(train_data, core_space, per_space) print "Composing phrases" test_phrases = io_utils.read_tuple_list(testing_pair_file, fields=[0,1,2]) composed_space = comp_model.compose(test_phrases, core_space) print "Saving composed space" io_utils.save(composed_space, composed_space_file) print "Finding neighbors of \"conflict-n_erupt-v\" in the composed space" neighbors = composed_space.get_neighbours("conflict-n_erupt-v", 10, CosSimilarity()) print neighbors print "Finding neighbors of \"conflict-n_erupt-v\" in the core space" neighbors = composed_space.get_neighbours("conflict-n_erupt-v", 10, CosSimilarity(), core_space) print neighbors
from composes.utils import io_utils, scoring_utils, log_utils from composes.similarity.cos import CosSimilarity if __name__ == '__main__': # set constants data_path = "/home/dissect/demo/" log_file = data_path + "all.log" composed_space_file = data_path + "composed.pkl" gold_standard_file = data_path + "gold.txt" # config log file log_utils.config_logging(log_file) print "Reading similarity test data..." test_pairs = io_utils.read_tuple_list(gold_standard_file, fields=[0,1]) gold = io_utils.read_list(gold_standard_file, field=2) print "Loading composed space" composed_space = io_utils.load(composed_space_file) print "Computing similarity with lexical function..." pred = composed_space.get_sims(test_pairs, CosSimilarity()) print "Scoring lexical function based on computed similarities..." print scoring_utils.score(gold, pred, "pearson")
# set pythonpath export PYTHONPATH=/home/dissect/git/toolkit/src:$PYTHONPATH export PYTHON=/opt/python/bin/python2.7 export TOOLKIT_DIR=/home/dissect/git/toolkit export OUT_DIR=/home/dissect/demo export DATA_DIR=/home/dissect/demo export LOG_FILE=$OUT_DIR/all.log #************************************************************************************** echo exercise 1 echo STARTING BUILDING CORE export CORE_IN_FILE_PREFIX=core export CORE_SPC=CORE_SS.core.ppmi.svd_500.pkl # run build core space pipeline $PYTHON $TOOLKIT_DIR/src/pipelines/build_core_space.py -i $DATA_DIR/$CORE_IN_FILE_PREFIX --input_format=sm -o $OUT_DIR -w ppmi -r svd_500 -l $LOG_FILE echo FINISHED BUILDING CORE SPACE # find neighbors echo "eat-v" > $DATA_DIR/word_list.txt $PYTHON $TOOLKIT_DIR/src/pipelines/compute_neighbours.py -i $DATA_DIR/word_list.txt -n 10 -s $OUT_DIR/$CORE_SPC -o $OUT_DIR -m cos
# set pythonpath export PYTHONPATH=/home/dissect/git/toolkit/src:$PYTHONPATH export PYTHON=/opt/python/bin/python2.7 export TOOLKIT_DIR=/home/dissect/git/toolkit export OUT_DIR=/home/dissect/demo export DATA_DIR=/home/dissect/demo export LOG_FILE=$OUT_DIR/all.log #************************************************************************************** echo exercise 2 echo STARTING PERIPHERAL PIPELINE export CORE_SPC=CORE_SS.core.ppmi.svd_500.pkl export PER_RAW_FILE=$DATA_DIR/sv export PER_SPC=PER_SS.sv.CORE_SS.core.ppmi.svd_500.pkl # run build peripheral space pipeline $PYTHON $TOOLKIT_DIR/src/pipelines/build_peripheral_space.py -i $PER_RAW_FILE --input_format sm -c $OUT_DIR/$CORE_SPC -o $OUT_DIR -l $LOG_FILE echo FINISHED BUILDING PERIPHERAL SPACE # find neighbors echo "delivery-n_pay-v" > $DATA_DIR/word_list.txt $PYTHON $TOOLKIT_DIR/src/pipelines/compute_neighbours.py -i $DATA_DIR/word_list.txt -n 10 -s $OUT_DIR/$PER_SPC -o $OUT_DIR -m cos $PYTHON $TOOLKIT_DIR/src/pipelines/compute_neighbours.py -i $DATA_DIR/word_list.txt -n 10 -s $OUT_DIR/$PER_SPC,$OUT_DIR/$CORE_SPC -o $OUT_DIR -m cos
# set pythonpath export PYTHONPATH=/home/dissect/git/toolkit/src:$PYTHONPATH export PYTHON=/opt/python/bin/python2.7 export TOOLKIT_DIR=/home/dissect/git/toolkit export OUT_DIR=/home/dissect/demo export DATA_DIR=/home/dissect/demo export LOG_FILE=$OUT_DIR/all.log #************************************************************************************** echo exercise 3 export CORE_SPC=CORE_SS.core.ppmi.svd_500.pkl export PER_SPC=PER_SS.sv.CORE_SS.core.ppmi.svd_500.pkl export TRAIN_FILE=$DATA_DIR/training_pairs.txt export TRNED_MODEL=TRAINED_COMP_MODEL.lexical_func.training_pairs.txt.pkl export COMP_FILE=$DATA_DIR/testing_pairs.txt export COMP_SPC=COMPOSED_SS.LexicalFunction.testing_pairs.txt.pkl echo STARTING TRAINING export MODEL=lexical_func # run training pipeline $PYTHON $TOOLKIT_DIR/src/pipelines/train_composition.py -i $TRAIN_FILE -m $MODEL -o $OUT_DIR -a $OUT_DIR/$CORE_SPC -p $OUT_DIR/$PER_SPC --regression ridge --intercept True --crossvalidation False --lambda 2.0 -l $LOG_FILE echo FINISHED TRAINING echo STARTING COMPOSING SPACE # run apply composition pipeline /opt/python/bin/python2.7 $TOOLKIT_DIR/src/pipelines/apply_composition.py -i $COMP_FILE --load_model $OUT_DIR/$TRNED_MODEL -o $OUT_DIR -a $OUT_DIR/$CORE_SPC -l $LOG_FILE # find neighbors echo "conflict-n_erupt-v" > $DATA_DIR/word_list.txt $PYTHON $TOOLKIT_DIR/src/pipelines/compute_neighbours.py -i $DATA_DIR/word_list.txt -n 10 -s $OUT_DIR/$COMP_SPC -o $OUT_DIR -m cos $PYTHON $TOOLKIT_DIR/src/pipelines/compute_neighbours.py -i $DATA_DIR/word_list.txt -n 10 -s $OUT_DIR/$COMP_SPC,$OUT_DIR/$CORE_SPC -o $OUT_DIR -m cos
# set pythonpath export PYTHONPATH=/home/dissect/git/toolkit/src:$PYTHONPATH export PYTHON=/opt/python/bin/python2.7 export TOOLKIT_DIR=/home/dissect/git/toolkit export OUT_DIR=/home/dissect/demo export DATA_DIR=/home/dissect/demo export LOG_FILE=$OUT_DIR/all.log #************************************************************************************** echo exercise 4 echo STARTING COMPUTING SIMS export COMP_SPC=COMPOSED_SS.LexicalFunction.testing_pairs.txt.pkl export SIM_DIR=$OUT_DIR/similarity export TEST_FILE=$DATA_DIR/gold.txt # create output directory for similarity if the directory doesn't exist if [ ! -d "$SIM_DIR" ]; then mkdir $SIM_DIR fi # run sim pipeline $PYTHON $TOOLKIT_DIR/src/pipelines/compute_similarities.py -i $TEST_FILE -s $OUT_DIR/$COMP_SPC -o $SIM_DIR -m cos -c 1,2 -l $LOG_FILE echo FINISH COMPUTE SIMS echo STARTING EVAL SIMS # run evaluation pipeline $PYTHON $TOOLKIT_DIR/src/pipelines/evaluate_similarities.py --in_dir $SIM_DIR -m pearson -c 3,4 -l $LOG_FILE echo FINISH EVAL SIMS