FemSMA Experimentation Platform Documentation

Overview

The FemSMA Experimentation Platform is a tool for training and evaluating models that predict the gender of the author of a (short) text based on features extracted from a training corpus. It works on texts that are stored in the FemSMA database and employs the same tokenization and token feature generation mechanisms as the FemSMA Corpus Workbench, so for database details, token feature descriptions etc. refer to the Corpus Workbench Documentation as well.

Training and evaluating a model involves the following steps:

Selection of the classification features to be used by the model
Selection of a set of texts to be used as training (and test) set
Creation of the training (and test) set
Training and evaluating the model

Selection of classification features

Classification features are grouped into

Character based features
Token distribution based features
Token category based features
Token type based features
Part-of-speech features
Sentiment features
Sentiment distribution features
LIWC category features
Pronoun distribution features
Other features

Most of them are percentages, e.g. Words of a certain category in percent of the total number of words in that posting. A full description (dynamically generated) of all currently available features can be found below.

Selection of the training set

The combined training and test set must be selected from the database by

restricting the user
restricting the resource
restricting the markup
restricting the content

Creation of the training and the test set

After selecting the classification features and the combined training/test set the actual input files for the learning modules are to be written, and the percentage of test data to be retained has to be selected (default: 10%). As a result of training file generation a description of the selected features along with descriptive statistics and highlighting of significant features is displayed. In particular, mean and standard deviation of all features for both subgroups (female and male) are displayed, and, as a measure of the effect this feature has for distinguishing between the female and male cases, Cohen's d is shown (only if there is some significant effect!). Cf. Newman, M.L., C.J. Groom, L.D. Handelman, and J.W. Pennebaker. 2008. Gender differences in language use: An analysis of 14,000 text samples. Discourse Processes 45:211-246, Table 1 for a similar breakdown of main effects of gender on language use.

Training and Evaluation of the model

Currently, two learning algorithms are available, a support vector machine and a decision tree.

SVM model

The SVM engine used is LIBLINEAR -- A Library for Large Linear Classification. The 'C' parameter of LibLinear can be adjusted (default: 1). As a result of training/testing, the output of LibLinear is displayed (together with the confusion matrix).

Decision Tree model

The algorithm used is YaDT: Yet another Decision Tree builder. The adjustable parameter is the minimum number of instances that is required for a node to be split (default: 10, larger values result in a more general tree, smaller values may lead to overfitting).

As a result, confusion matrices on the training data and on the test data are displayed, along with the decision tree.

A visual representation of the decision tree is built from YaDT's XML output via XSLT, the different nodes are colored according to the ratio of F and M instances covered (100% F is red, 100% M is blue), branches can be opened and closed for inspection.