FemSMA Corpus Workbench Documentation

[Show all] [Hide all]
Overview
The FemSMA Corpus Workbench is a tool for the FemSMA corpus.

Screen Layout
Annotation Selector
The annotation selector allows for specifying which annotations are to be displayed. Annotations are organized in label groups, the Annotator is the person that created the annotation. There exist two 'pseudo-annotators', Automatic and All:
  • ALL is preferably to be used when displaying search results, where it is desirable to view all existing annotations,
  • Automatic is selected when viewing the Tokens pseudo label group (see Tokenization section)
Annotations can only be created when a 'real' annotator is selected.
Resource Selector
Documents are organized as resources, grouped together from where they have been downloaded. When selecting a resource, a resource description with some statistics, and the first document of the resource is displayed.

There exist three 'pseudo-resources':
  • SEARCH: opens a window to specify search critery
  • LAST SEARCH RESULTS: displays the results of the last document search
  • LAST ANNOTATION LIST: displays a clickable list of the results of the last annotation search.
Document Annotation Display
The document display shows
  • A user description (colored with the assigned gender of the user, if available. In case of several gender assignments, majority voting is used): When clicking on the button labeled '+', more user information is displayed. It is also possible to initiate a search for all postings of this user with a single click.
  • The (possibly annotated) document
  • Text annotation buttons (these are only displayed if a label group is selected)
Annotation
One of the main purposes of the FemSMA Corpus Workbench is to support users in annotating documents. Annotation means, that a span of text of the document is assigned a Label. Labels are organized in Label Groups.
Important properties of annotations as implemented in FemSMA are:
Annotating a document
Before a document can be annotated or it's annotations modified
  • It has to be loaded (via resource browsing or search).
  • The proper annotator has to be selected.
  • The desired Label Group has to be selected.
After finishing annotation and before moving to the next document the annotations have to be saved by clicking the 'Save Annotations' button!
Adding an annotation
  1. Select the phrase to be annotated by dragging the mouse (if the phrase consists of a single word, that word may be double-clicked).
  2. Click the appropriate colored label button.
NB: Annotations may not overlap! Therefore selections starting or ending within another annotation will not work!
Modifying an annotation
  • If the label of an annotated phrase has to be changend, simply double-click it and click the new colored label button.
  • if the boundaries of an annotation are to be changed, delete it first (double click it and then click the 'Delete Selected' button), an then add the annotation (see above).
Deleting annotations
  • If a single annotation has to be deleted, just double click it and then click the 'Delete Selected' button. Repeat if necessary.
  • All annotations (of the current document with the current annotator and label group) may be deleted by clicking the 'Delete All' button.
NB: Don't forget to save the annotations after deleting some or all of them, otherwise they will reappear if the document is loaded the next time.
Tips
It may simplify work if the annotator has an idea of how the phrases she likes to annotate look like. Then she may conduct a content search (see search examples below) and iterate over the messages in the search results.
The matches are highlighted and may be double clicked to be selected (just like an annotation) and then assigned a label. (Hint: when searching for word parts it may be useful to surround the search pattern by '\w*' to extend the match to the whole word).
Adding an annotator
In order to add a new annotator, click on the button right of the annotator selector. In the for that appears enter the full name and the login of the new annotator. Don't forget to click the 'Add' button, otherwise the data will not be entered into the database. The form can also be removed without adding a new annotator by clicking again.
Adding a Label Group
In order to add a new label group, click on the button right of the label group selector. Enter the description of the label group. Labels can be added by clicking the button right of the 'Labels:' header. See 'Editing a Label Group' below for further details.
Editing a Label Group
In order to edit an existing label group, click on the button right of the label group selector (after selecting the labe; group to be edited). A form with the label group description and the defined labels appears. Labels can be added by clicking the button right of the 'Labels:' header.
For each label, 3 fields are to be defined:
  • the label name (without blanks! only alphabetic characters, numbers and underscore are allowed! Avoid accented characters and umlauts)
  • the label description (don't leave it out, since the description serves as a tooltip for the label buttons and may be helpful during annotating)
  • the label color (click into the field and select a color from the color editor appearing).
Deleting Labels:
Online deletion of labels is not implemented, since this may result in loss of data. Instead, rename the label to 'delete'!. During database maintenance the label and it's associated annotations may be removed after confirmation.

NB. Don't forget to save your edits by clicking the 'Save' button. If you want to cancel the editing operation, click the button again and the editing form will be removed from the screen.
Search
Search mode is entered if the pseudo-resource 'Search' is selected.
Generally, search is performed on corpus documents (user postings) yielding a subset of the corpus documents. However, there are two different ways to display the results: The results of the last document search and the last annotation list are cached and available as a temporary resource.
Restricting the user
When checking the 'Restrict User' checkbox the search returns only documents that have been posted by a user matching the required criteria.
Two kinds of restrictions are implemented:
  • the user name (i.e. the screen name of the user that appears with the posting) can be restricted. The search is implemented via the LIKE operator, that means:
    • the search is case insensitve
    • '%' and '_' can be used as wildcard characters (see here for details)
  • the gender of the user can be restricted.
    NB: it is possible that several gender records are attached to a single user that come from different sources and may be even conflicting! Furthermore, there may be users without any gender information.
    The match succeeds for a user if s/he has at least one matching gender record.
Restricting the resource
When the 'Restrict Resource' checkbox is checked, the list of all currently available resources is displayed. Only postings from checked resources will be selected.
Restricting the markup
It is possible to select only postings that
  • have no annotation at all, or
  • that contain a certain annotation.
When checking the 'Restrict Markup' checkbox, the annotations sought can be clicked.
Restricting the content
When checking the 'Restrict Content' checkbox it is possible to restrict the search to postings that contain certain words or phrases. Content matching is performed via Perl regular expressions.
A brief explanation of regular explanation can be found here, a more technical reference is available here.
NB: regexp search is case sensitive.
Examples
Say, you want to search and annotate gender index expressions. A first try would be to search for
ich als frau
This search returns only one document containing the phrase sich als frau, which is not really what we wanted.
So the next try would be to allow for case variation:
[iI]ch als [fF]rau
This query yields more matches but still the not specific phrase as the first one. In order to specify, that we search for the word "ich", we can use the word-boundary meta-character "\b":
\b[iI]ch als [fF]rau
This indeed yields only matches where the first word is "ich".
Finally, if we are interested in finding phrases starting with the words "ich" and "als" and ending with the word "Frau" with a maximum of 30 intervening characters, none of which is a sentence boundary we could use:
\b[iI]ch als\b[^\.\!\?]{0,30}\b[fF]rau\b
This search also finds the phrase ich als fußballbegeisterte Frau.
Match highlighting
Results from a search involving a content restriction provide highlighting of the matching phrase(s). A highlighted phrase can be selected by double-clicking it. So it is easy to assign it a label. However, if the matching phrase already has a label assigned (given the current annotator and label group), highlighting cannot be performed. Similarly, if the matching phrase overlaps an already annotated phrase, only the non-overlapping parts of the match are highlighted.
Tokenization
Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens may be the input of further processing stages (e.g. parsing). In FemSMA each token has a token category and may carry a set of additional features derived from lexical resources.
Token Categories
ELLEllipsis
EMOEmoticon
EXCLAMExclamation
HASHTAGTwitter Hashtag
LQUOTLeft quotation mark
NUMNumber
PUNCTPunctuation
RQUOTRight quotation mark
URLURL
USERReference to user
WORDWord
Word Features
Word features are either derived from the orthographic appearence of the token or by lookup in appropriate lexical resources. The following resources are currently used:
General word features
ABBRAbbreviation
CAPAllcap orthography
CCCCharacter reduplication
FIRSTNFFemale firstname
FIRSTNMMale firstname
ITJInterjektion
SWEARSwear word
QUESTQuestion
Part-of-Speech Features
ADJAattributive Adjektive
ADJDprädikative oder adverbiale Adjektive
ADVAdverbien
APPOPostpositionen
APPRPräpositionen
APPRARTPräpositionen mit Artikel
APZRZirkumposition rechts
ARTbestimmter/unbestimmter Artikel
CARDKardinalzahlen
CMKomma
KOKOMVergleichspartikel
KONnebenordnende Konjunktion
KOUIunterordnende Konjunktion mit Infinitiv
KOUSunterordnende Konjunktion mit Satz
NEEigennamen
NNNomina
PAVPronominialadverbien
PDATattribuierendes Demonstrativpronomen
PDSsubstituierendes Demonstrativpronomen
PIATattribuierendes Indefinitpronomen ohne Determiner
PIDATattribuierendes Indefinitpronomen mit Determiner
PISsubstituierendes Indefinitpronomen
PPERirreflexives Personalpronomen
PPOSATattributierendes Possessivpronomen
PPOSSsubstituierendes Possessivpronomen
PRELATattributierendes Relativpronomen
PRELSsubstituierendes Relativpronomen
PRFreflexives Personalpronomen
PTKAPartikel bei Adjektiv oder Adverb
PTKANTAntwortpartikel
PTKNEGNegationspartikel
PTKVZabgetrennter Verbzusatz
PTKZUzu vor Infinitiv
PWATattributierendes Interrogativpronomen
PWAVadverbiales Interrogativpronomen
PWSsubstituierendes Interrogativpronomen
SENTInterpunktion am Satzende
VAFINfinite Auxiliarverben
VAIMPAuxiliarverben im Imperativ
VAINFAuxiliarverben im Infinitiv
VAPPPartizip Perfekt von Auxiliarverben
VMFINfinite Modalverben
VMINFModalverben im Infinitiv
VMPPPartizip Perfekt von Modalverben
VVFINfinite Vollverben
VVIMPVollverben im Imperativ
VVINFVollverben im Infinitiv
VVIZUVollverben im Infinitiv mit zu
VVPPPartizip Perfekt von Vollverben
Sentiment Features
SENT1positive sentiment
SENT2strong positive sentiment
SENT3very strong positive sentiment
SENT4extremely strong positive sentiment
SENT1negative sentiment
SENT2strong negative sentiment
SENT3very strong negative sentiment
SENT4extremely strong negative sentiment
LIWC features
LIWC1Pronoun
LIWC2I
LIWC3We
LIWC4Self
LIWC5You
LIWC6Other
LIWC7Negate
LIWC8Assent
LIWC9Article
LIWC10Preps
LIWC11Numbers
LIWC12Affect
LIWC13Positive emotion
LIWC14Positive feeling
LIWC15Optimism
LIWC16Negative emotion
LIWC17Anxiety
LIWC18Anger
LIWC19Sad
LIWC20Cognitive mechanism
LIWC21Cause
LIWC22Insight
LIWC23Discrepancy
LIWC24Inhibition
LIWC25Tentative
LIWC26Certain
LIWC27Senses
LIWC28See
LIWC29Hear
LIWC30Feel
LIWC31Social
LIWC32Communication
LIWC33Other reference
LIWC34Friends
LIWC35Family
LIWC36Humans
LIWC37Time
LIWC38Past
LIWC39Present
LIWC40Future
LIWC41Space
LIWC42Up
LIWC43Down
LIWC44Incl
LIWC45Excl
LIWC46Motion
LIWC47Occup
LIWC48School
LIWC49Job
LIWC50Achieve
LIWC51Leisure
LIWC52Home
LIWC53Sports
LIWC54TV
LIWC55Music
LIWC56Money
LIWC57Metaph
LIWC58Relig
LIWC59Death
LIWC60Physical
LIWC61Body
LIWC62Sex
LIWC63Eat
LIWC64Sleep
LIWC65Grooming
LIWC66Swear
LIWC67Non-fluency
LIWC68Filler
Token display as pseudo-annotation
In order to view the results of tokenization and explore the word features, token features can be displayed in a similar way as annotations by selecting the label group Tokens. Then buttons for all token classes and token features appearing in the current document are displayed.
  • By hovering the mouse cursor over a token class or feature button, the explanation of that feature is displayed as a tooltip.
  • By clicking a token class or feature button, all tokens having that class or feature are highlighted in the text.
  • By hovering the mouse cursor over the document, the token class and features of the token under the cursor are displayed as a tooltip.
Implementation Details
FemSMA Database
All FemSMA corpus data is contained in an SQLite relational database. This allows for
  • easy addition of new resources
  • easy maintenance of annotations originating from different annotators and for different label sets
  • easy searching of the corpus data with flexible search criteria
Database Schema
CREATE TABLE annotation (
  msg_id        integer,
  label_id      integer,
  annotator_id  integer,
  start         integer,
  end           integer
);
CREATE TABLE annotator (
  id            integer primary key autoincrement,
  name          text,
  login text
);
CREATE TABLE gender (
  user_id       integer,
  gender        char,
  how           text,         -- how the gender information was established
  assigned_by   text
);
CREATE TABLE label (
  id            integer primary key autoincrement,
  group_id      integer,
  name          text,
  descr         text,
  color         text
);
CREATE TABLE label_group (
  id            integer primary key autoincrement,
  descr         text
);
CREATE TABLE msg (
  id            integer primary key autoincrement,
  resource_id   integer,
  relpos        integer,    -- relative position within the resource
  date          datetime,
  user          text,
  text          text
);
CREATE TABLE quotation (
  msg_id        integer,
  relpos        integer,    -- relative position within the msg
  text          text
);
CREATE TABLE resource (
  id            integer primary key autoincrement,
  type          text,
  topic         text,
  descr         text,
  code          text,
  host          text,
  dominance     char,       -- F, M, G
  length_restr  integer,
  url           text,
  file          text
);
CREATE TABLE user (
  id            integer primary key autoincrement,
  host          text,
  screen_name   text,
  name          text
);