FemSMA Experimentation Platform Documentation

[Show all] [Hide all]
The FemSMA Experimentation Platform is a tool for training and evaluating models that predict the gender of the author of a (short) text based on features extracted from a training corpus. It works on texts that are stored in the FemSMA database and employs the same tokenization and token feature generation mechanisms as the FemSMA Corpus Workbench, so for database details, token feature descriptions etc. refer to the Corpus Workbench Documentation as well.

Training and evaluating a model involves the following steps:
Selection of classification features
Classification features are grouped into
  • Character based features
  • Token distribution based features
  • Token category based features
  • Token type based features
  • Part-of-speech features
  • Sentiment features
  • Sentiment distribution features
  • LIWC category features
  • Pronoun distribution features
  • Other features
Most of them are percentages, e.g. Words of a certain category in percent of the total number of words in that posting. A full description (dynamically generated) of all currently available features can be found below.
Selection of the training set
The combined training and test set must be selected from the database by
  • restricting the user
  • restricting the resource
  • restricting the markup
  • restricting the content
Creation of the training and the test set
After selecting the classification features and the combined training/test set the actual input files for the learning modules are to be written, and the percentage of test data to be retained has to be selected (default: 10%). As a result of training file generation a description of the selected features along with descriptive statistics and highlighting of significant features is displayed. In particular, mean and standard deviation of all features for both subgroups (female and male) are displayed, and, as a measure of the effect this feature has for distinguishing between the female and male cases, Cohen's d is shown (only if there is some significant effect!). Cf. Newman, M.L., C.J. Groom, L.D. Handelman, and J.W. Pennebaker. 2008. Gender differences in language use: An analysis of 14,000 text samples. Discourse Processes 45:211-246, Table 1 for a similar breakdown of main effects of gender on language use.
Training and Evaluation of the model
Currently, two learning algorithms are available, a support vector machine and a decision tree.
SVM model
The SVM engine used is LIBLINEAR -- A Library for Large Linear Classification. The 'C' parameter of LibLinear can be adjusted (default: 1). As a result of training/testing, the output of LibLinear is displayed (together with the confusion matrix).
Decision Tree model
The algorithm used is YaDT: Yet another Decision Tree builder. The adjustable parameter is the minimum number of instances that is required for a node to be split (default: 10, larger values result in a more general tree, smaller values may lead to overfitting).

As a result, confusion matrices on the training data and on the test data are displayed, along with the decision tree.

A visual representation of the decision tree is built from YaDT's XML output via XSLT, the different nodes are colored according to the ratio of F and M instances covered (100% F is red, 100% M is blue), branches can be opened and closed for inspection.
Available Classification Features
These are the currently available features