Main features¶
Embed in Python code or use command-line tools¶
DISSECT is written in Python. Users with basic familiarity with this language can use DISSECT directly inside their scripts. However, DISSECT also provides many standard functionalities through a set of powerful command-line tools. For example, building different semantic spaces with either positive PMI or logarithm as weighting methods and dimensionality reduction carried out by SVD or NMF is as simple as:
python2.7 build_core_space.py -i input-matrix --input_format sm
--output_format dm -w ppmi,plog -r svd_200,nmf_200
-o outdir
Semantic space creation from co-occurrence matrices¶
The vectors representing words or other linguistic units in a semantic space ultimately encode values that derive from co-occurrence counts extracted from corpora (or other sources). The pipeline from corpora to semantic spaces can be roughly split into two major steps: pre-processing the corpus to collect the relevant counts, and processing the extracted counts mathematically. The first step will be highly language- and project-dependent: How do you tokenize the corpus? Which elements and linguistic contexts do you count? etc. DISSECT does not support pre-processing or counting, but takes directly a (dense or sparse) matrix of co-occurrence counts as input. In this way, we can focus on the more general mathematical side, where we provide various methods to reweight the counts with association measures, dimensionality reduction (supporting not only singular value decomposition but also the less commonly implemented non-negative matrix factorization), as well as supporting the creation of multiple semantic spaces with a single command-line call.
Composition functions¶
The core purpose of DISSECT is to make it easy to use a wide range of vector composition functions that have been proposed in the literature, including those of Mitchell and Lapata (2010), Baroni and Zamparelli (2010) (that we call Lexical Function) and Guevara (2010) (that we call Full Additive).
Function estimation¶
Generalizing the estimation methods of Baroni and Zamparelli and Guevara, we estimate the parameters of composition functions by approximating corpus-extracted vectors that exemplify their outputs. For example, to optimize an adjective-noun composition function, we might minimize (in a least-squares sense) the distance between the vectors the function produces for, e.g., rotten meat, carnivorous zombie and toxic waste and vectors representing the same phrases that have been directly extracted from the corpus (or obtained in some other way). We have a paper, currently under review, in which we explain this approach in detail: please contact us if you want a copy.
Peripheral elements¶
Lexical semantic spaces assume a finite vocabulary of words, each represented by vectors reflecting disjoint distributions in the source corpus. However, compositional spaces contain a potentially infinite number of vectors representing phrases and sentences, and such vectors will not record independent co-occurrences: For example, the vector for slimy maggot will presumably record a subset of the contexts that are also recorded in the vector of maggot. Since both weighting schemes and dimensionality reduction techniques depend on the set of target and context elements in a co-occurrence matrix, these redundancies risk to bias them in unwanted ways: For example, all maggot contexts that are also slimy maggot contexts would be artificially counted twice, distorting the computation of association measures.
We solve this problem by allowing the user to specify a separate co-occurrence matrix for peripheral elements (typically, phrases or sentences), whose values will be weighted and reduced using global statistics that have been collected from core elements only (typically, single words), and that are not affected by the peripheral elements in turn. In this way, you can keep adding new elements to the space without worrying about how they affect the overall characteristics of the space.