Table Of Contents

Previous topic

Evaluation of similarity scores

This Page

Practice with realistic data

To get better acquainted with the toolkit, we recommend that you try the following exercises, which show how to perform a typical series of steps in Compositional Distributional Semantics with a realistically sized data-set. This is also a chance for users to get a glimpse of how DISSECT works on real data.

The practice is based on a small dataset. from our own experiments that includes:

  • Co-occurrence counts for nouns, verbs are extracted from Wikipedia, BNC and ukWaC corpora (core.sm). The two files core.rows, core.cols contain the lists of words and contexts, respectively
  • Co-occurrence counts for subject-verb phrases (sv.sm, sv.rows, sv.cols) also extracted from Wikipedia, BNC and ukWaC
  • The subject-intransitive verb dataset from Mitchell and Lapata (2008) that are modified to fit our format (gold.txt). This dataset is used for evaluating compositional models
  • A list of noun-verb phrases used in training compositional models (training_pairs.txt) and a list of subject-intransitive verbs that appear in Mitchell and Lapata’s dataset

We also provide solutions for each exercise in both Python and using the command-line tools.

DATAPATH in the exercises below indicates the directory where you download and decompress the dataset. You should modify the paths in the solution to match your system.

Exercise 1

BUILD CORE SEMANTIC SPACE
1. Create a semantic space from co-occurrence counts in DATA_PATH/core.sm, using words in DATA_PATH/core.rows
as rows and words in DATA_PATH/core.cols as columns
2. Apply ppmi weighting on it
3. Apply SVD-500 on it
4. Save the space in pickle format (.pkl)
5. Print the top 10 neighbors of "eat-v" in this space

Exercise 2

BUILD PERIPHERAL SEMANTIC SPACE
1. Create a semantic space for noun-verb phrases based on the core semantic phrase in exercise 1 
using the counts in DATA_PATH/sv.sm, the dimensions in DATA_PATH/sv.cols, the rows in DATA_PATH/sv.rows 
2. Print the top 10 neighbors of "delivery-n_pay-v" in this space
3. Print the top 10 neighbors of "delivery-n_pay-v" in the core space
4. Save the space in pickle format (.pkl)

Exercise 3

TRAIN AND APPLY A COMPOSITION MODEL
1. Load training data from file DATA_PATH/training_pairs.txt
2. Load core space (argument space) from exercise 1 and peripheral space (observed phrase space) from exercise 2
3. Train a lexical function model on the two spaces using Ridge Regression with lambda=2
4. Load testing pairs from DATA_PATH/testing_pairs.txt (list of elements to be composed)
5. Apply the trained lexical function model on these pairs and save the result phrase space
6. Print the top 10 neighbors of "conflict-n_erupt-v" in the composed phrase space
7. Print the top 10 neighbors of "conflict-n_erupt-v" in the argument space

Exercise 4

EVALUATE A COMPOSITIONAL MODEL
1. Load the composed space you have just saved in exercise 3
2. Use this space to compute the similarities of the pairs in DATA_PATH/testing_pairs.txt. (Columns 1 and 2 contain the pairs of words.)
3. Evaluate the similarities using Pearson correlation. DATA_PATH/gold.txt contains the gold standard scores in column 3.

Python solutions

Exercise 1

from composes.semantic_space.space import Space
from composes.transformation.scaling.ppmi_weighting import PpmiWeighting
from composes.transformation.dim_reduction.svd import Svd
from composes.transformation.feature_selection.top_feature_selection import TopFeatureSelection
from composes.similarity.cos import CosSimilarity
from composes.utils import io_utils
from composes.utils import log_utils



if __name__ == '__main__':
    # set constants
    data_path = "/home/dissect/demo/"
    log_file = data_path + "all.log"
    core_cooccurrence_file = data_path + "core.sm"
    core_row_file = data_path + "core.rows"
    core_col_file = data_path + "core.cols"
    core_space_file = data_path + "core.pkl"
    
    # config log file
    log_utils.config_logging(log_file)
    
    print "Building semantic space from co-occurrence counts"
    core_space = Space.build(data=core_cooccurrence_file, rows=core_row_file,
                             cols=core_col_file, format="sm")
    
    print "Applying ppmi weighting"
    core_space = core_space.apply(PpmiWeighting())
    # print "Applying feature selection"
    # core_space = core_space.apply(TopFeatureSelection(5000))
    print "Applying svd 500"
    core_space = core_space.apply(Svd(500))
    
    print "Saving the semantic space"
    io_utils.save(core_space, core_space_file)
    
    print "Finding 10 neighbors of \"eat-v\""
    neighbors = core_space.get_neighbours("eat-v", 10, CosSimilarity())
    print neighbors
    

Exercise 2

from composes.semantic_space.space import Space
from composes.semantic_space.peripheral_space import PeripheralSpace
from composes.similarity.cos import CosSimilarity
from composes.utils import io_utils, log_utils
import os

if __name__ == '__main__':
    
    # set constants
    data_path = "/home/dissect/demo/"
    log_file = data_path + "all.log"
    core_space_file = data_path + "core.pkl"
    per_cooccurrence_file = data_path + "sv.sm"
    per_row_file = data_path + "sv.rows"
    per_col_file = data_path + "sv.cols"
    per_space_file = data_path + "sv.pkl"
    
    # config log file
    log_utils.config_logging(log_file)
    
    print "Building peripheral space"
    core_space = io_utils.load(core_space_file,Space)
    per_space = PeripheralSpace.build(core_space,data=per_cooccurrence_file, cols=per_col_file,
                                      rows=per_row_file, format="sm")
    
    print "Saving peripheral space"
    io_utils.save(per_space, per_space_file)
    
    print "Finding neighbors of \"delivery-n_pay-v\" in the peripheral space"
    neighbors = per_space.get_neighbours("delivery-n_pay-v", 10, CosSimilarity())
    print neighbors
    
    print "Finding neighbors of \"delivery-n_pay-v\" in the core space"
    neighbors = per_space.get_neighbours("delivery-n_pay-v", 10, CosSimilarity(), core_space)
    print neighbors
    
    
    

Exercise 3

from composes.utils import io_utils, log_utils
from composes.composition.lexical_function import LexicalFunction
from composes.utils import regression_learner
from composes.similarity.cos import CosSimilarity

if __name__ == '__main__':
    # set constants
    data_path = "/home/dissect/demo/"
    log_file = data_path + "all.log"
    core_space_file = data_path + "core.pkl"
    per_space_file = data_path + "sv.pkl"
    training_pair_file = data_path + "training_pairs.txt"
    testing_pair_file = data_path + "testing_pairs.txt"
    composed_space_file = data_path + "composed.pkl"
    
    # config log file
    log_utils.config_logging(log_file)
    
    print "Reading in train data"
    train_data = io_utils.read_tuple_list(training_pair_file, fields=[0,1,2])
    
    print "Training Lexical Function compositional model"
    core_space = io_utils.load(core_space_file)
    per_space = io_utils.load(per_space_file)
    
    # comp_model = WeightedAdditive()
    comp_model = LexicalFunction(learner = regression_learner.RidgeRegressionLearner(param=2))
    comp_model.train(train_data, core_space, per_space)
    
    print "Composing phrases"
    test_phrases = io_utils.read_tuple_list(testing_pair_file, fields=[0,1,2])
    composed_space = comp_model.compose(test_phrases, core_space)
    
    print "Saving composed space"
    io_utils.save(composed_space, composed_space_file)
    
    print "Finding neighbors of \"conflict-n_erupt-v\" in the composed space"
    neighbors = composed_space.get_neighbours("conflict-n_erupt-v", 10, CosSimilarity())
    print neighbors
    
    print "Finding neighbors of \"conflict-n_erupt-v\" in the core space"
    neighbors = composed_space.get_neighbours("conflict-n_erupt-v", 10, CosSimilarity(), core_space)
    print neighbors

Exercise 4

from composes.utils import io_utils, scoring_utils, log_utils
from composes.similarity.cos import CosSimilarity

if __name__ == '__main__':
     # set constants
    data_path = "/home/dissect/demo/"
    log_file = data_path + "all.log"
    composed_space_file = data_path + "composed.pkl"
    gold_standard_file = data_path + "gold.txt"
    
    # config log file
    log_utils.config_logging(log_file)
    
    print "Reading similarity test data..."
    test_pairs = io_utils.read_tuple_list(gold_standard_file, fields=[0,1])
    gold = io_utils.read_list(gold_standard_file, field=2)
    
    print "Loading composed space"
    composed_space = io_utils.load(composed_space_file)
    print "Computing similarity with lexical function..."
    pred = composed_space.get_sims(test_pairs, CosSimilarity())
    
    print "Scoring lexical function based on computed similarities..."
    print scoring_utils.score(gold, pred, "pearson")

Command-line solutions

Exercise 1

# set pythonpath
export PYTHONPATH=/home/dissect/git/toolkit/src:$PYTHONPATH
export PYTHON=/opt/python/bin/python2.7
export TOOLKIT_DIR=/home/dissect/git/toolkit
export OUT_DIR=/home/dissect/demo
export DATA_DIR=/home/dissect/demo
export LOG_FILE=$OUT_DIR/all.log

#**************************************************************************************
echo exercise 1
echo STARTING BUILDING CORE
export CORE_IN_FILE_PREFIX=core
export CORE_SPC=CORE_SS.core.ppmi.svd_500.pkl

# run build core space pipeline
$PYTHON $TOOLKIT_DIR/src/pipelines/build_core_space.py -i $DATA_DIR/$CORE_IN_FILE_PREFIX --input_format=sm -o $OUT_DIR -w ppmi -r svd_500 -l $LOG_FILE
echo FINISHED BUILDING CORE SPACE

# find neighbors
echo "eat-v" > $DATA_DIR/word_list.txt
$PYTHON $TOOLKIT_DIR/src/pipelines/compute_neighbours.py -i $DATA_DIR/word_list.txt -n 10 -s $OUT_DIR/$CORE_SPC -o $OUT_DIR -m cos

Exercise 2

# set pythonpath
export PYTHONPATH=/home/dissect/git/toolkit/src:$PYTHONPATH
export PYTHON=/opt/python/bin/python2.7
export TOOLKIT_DIR=/home/dissect/git/toolkit
export OUT_DIR=/home/dissect/demo
export DATA_DIR=/home/dissect/demo
export LOG_FILE=$OUT_DIR/all.log

#**************************************************************************************
echo exercise 2
echo STARTING PERIPHERAL PIPELINE

export CORE_SPC=CORE_SS.core.ppmi.svd_500.pkl
export PER_RAW_FILE=$DATA_DIR/sv
export PER_SPC=PER_SS.sv.CORE_SS.core.ppmi.svd_500.pkl

# run build peripheral space pipeline
$PYTHON $TOOLKIT_DIR/src/pipelines/build_peripheral_space.py -i $PER_RAW_FILE --input_format sm -c $OUT_DIR/$CORE_SPC -o $OUT_DIR -l $LOG_FILE
echo FINISHED BUILDING PERIPHERAL SPACE

# find neighbors
echo "delivery-n_pay-v" > $DATA_DIR/word_list.txt
$PYTHON $TOOLKIT_DIR/src/pipelines/compute_neighbours.py -i $DATA_DIR/word_list.txt -n 10 -s $OUT_DIR/$PER_SPC -o $OUT_DIR -m cos
$PYTHON $TOOLKIT_DIR/src/pipelines/compute_neighbours.py -i $DATA_DIR/word_list.txt -n 10 -s $OUT_DIR/$PER_SPC,$OUT_DIR/$CORE_SPC  -o $OUT_DIR -m cos

Exercise 3

# set pythonpath
export PYTHONPATH=/home/dissect/git/toolkit/src:$PYTHONPATH
export PYTHON=/opt/python/bin/python2.7
export TOOLKIT_DIR=/home/dissect/git/toolkit
export OUT_DIR=/home/dissect/demo
export DATA_DIR=/home/dissect/demo
export LOG_FILE=$OUT_DIR/all.log

#**************************************************************************************
echo exercise 3

export CORE_SPC=CORE_SS.core.ppmi.svd_500.pkl
export PER_SPC=PER_SS.sv.CORE_SS.core.ppmi.svd_500.pkl
export TRAIN_FILE=$DATA_DIR/training_pairs.txt
export TRNED_MODEL=TRAINED_COMP_MODEL.lexical_func.training_pairs.txt.pkl
export COMP_FILE=$DATA_DIR/testing_pairs.txt
export COMP_SPC=COMPOSED_SS.LexicalFunction.testing_pairs.txt.pkl

echo STARTING TRAINING
export MODEL=lexical_func

# run training pipeline
$PYTHON $TOOLKIT_DIR/src/pipelines/train_composition.py -i $TRAIN_FILE -m $MODEL -o $OUT_DIR -a $OUT_DIR/$CORE_SPC -p $OUT_DIR/$PER_SPC --regression ridge --intercept True --crossvalidation False --lambda 2.0 -l $LOG_FILE

echo FINISHED TRAINING

echo STARTING COMPOSING SPACE
# run apply composition pipeline
/opt/python/bin/python2.7 $TOOLKIT_DIR/src/pipelines/apply_composition.py -i $COMP_FILE --load_model $OUT_DIR/$TRNED_MODEL -o $OUT_DIR -a $OUT_DIR/$CORE_SPC -l $LOG_FILE

# find neighbors
echo "conflict-n_erupt-v" > $DATA_DIR/word_list.txt
$PYTHON $TOOLKIT_DIR/src/pipelines/compute_neighbours.py -i $DATA_DIR/word_list.txt -n 10 -s $OUT_DIR/$COMP_SPC -o $OUT_DIR -m cos
$PYTHON $TOOLKIT_DIR/src/pipelines/compute_neighbours.py -i $DATA_DIR/word_list.txt -n 10 -s $OUT_DIR/$COMP_SPC,$OUT_DIR/$CORE_SPC  -o $OUT_DIR -m cos

Exercise 4

# set pythonpath
export PYTHONPATH=/home/dissect/git/toolkit/src:$PYTHONPATH
export PYTHON=/opt/python/bin/python2.7
export TOOLKIT_DIR=/home/dissect/git/toolkit
export OUT_DIR=/home/dissect/demo
export DATA_DIR=/home/dissect/demo
export LOG_FILE=$OUT_DIR/all.log

#**************************************************************************************
echo exercise 4

echo STARTING COMPUTING SIMS

export COMP_SPC=COMPOSED_SS.LexicalFunction.testing_pairs.txt.pkl
export SIM_DIR=$OUT_DIR/similarity
export TEST_FILE=$DATA_DIR/gold.txt

# create output directory for similarity if the directory doesn't exist
if [ ! -d "$SIM_DIR" ]; then
    mkdir $SIM_DIR
fi

# run sim pipeline
$PYTHON $TOOLKIT_DIR/src/pipelines/compute_similarities.py -i $TEST_FILE -s $OUT_DIR/$COMP_SPC -o $SIM_DIR -m cos -c 1,2 -l $LOG_FILE

echo FINISH COMPUTE SIMS
echo STARTING EVAL SIMS

# run evaluation pipeline
$PYTHON $TOOLKIT_DIR/src/pipelines/evaluate_similarities.py --in_dir $SIM_DIR -m pearson -c 3,4 -l $LOG_FILE
echo FINISH EVAL SIMS