Text formats for matrices¶

The input required for creating a semantic space (starting from text data, rather than pickles) consists of three files: a row file, a column file and a file containing co-occurrence counts (or other numerical values representing the association of rows and columns). For the command-line usage, these three files are required to have the same name, with different file extensions.

The same matrix formats are used by the the toolkit when generating textual output matrices.

Row file¶

A row file is a text file consisting of a list of strings, each string corresponding to a row in the matrix. The extension for this file is .rows.

Example:
man
woman
child

Column file¶

Similarly, a column file contains a list of strings, where each string corresponds to a column in the matrix. The extension for this file is .cols.

Example:
toy
tv
book

Matrix file - dense format¶

A matrix file in dense format contains m rows and n + 1 columns, where (m, n) is the shape of the input matrix. Each line contains a row string followed by its associated vector (n fields). The extension for this file is .dm.

Example:
man 3 5 0
woman 0 5 6
child 43 0 0

Matrix file - sparse format¶

Each line in a sparse format matrix file contains three fields: the row string, the column string and the count. The extension for this file is .sm.

The corresponding sparse format file for the previous example:
man toy 3
man tv 5
woman tv 5
woman book 6
child toy 43

Table Of Contents

This Page

Text formats for matrices¶

Row file¶

Column file¶

Matrix file - dense format¶

Matrix file - sparse format¶

Navigation

Table Of Contents

This Page

Quick search

Text formats for matrices¶

Row file¶

Column file¶

Matrix file - dense format¶

Matrix file - sparse format¶

Navigation