The input required for creating a semantic space (starting from text data, rather than pickles) consists of three files: a row file, a column file and a file containing co-occurrence counts (or other numerical values representing the association of rows and columns). For the command-line usage, these three files are required to have the same name, with different file extensions.
The same matrix formats are used by the the toolkit when generating textual output matrices.
A row file is a text file consisting of a list of strings, each string corresponding to a row in the matrix. The extension for this file is .rows.
Example:
man woman child
Similarly, a column file contains a list of strings, where each string corresponds to a column in the matrix. The extension for this file is .cols.
Example:
toy tv book
A matrix file in dense format contains m rows and n + 1 columns, where (m, n) is the shape of the input matrix. Each line contains a row string followed by its associated vector (n fields). The extension for this file is .dm.
Example:
man 3 5 0 woman 0 5 6 child 43 0 0
Each line in a sparse format matrix file contains three fields: the row string, the column string and the count. The extension for this file is .sm.
The corresponding sparse format file for the previous example:
man toy 3 man tv 5 woman tv 5 woman book 6 child toy 43