FemSMA Corpus Workbench Documentation

Overview

The FemSMA Corpus Workbench is a tool for

browsing.
searching, and
annotating

the FemSMA corpus.

The annotation selector allows for specifying which annotations are to be displayed. Annotations are organized in label groups, the Annotator is the person that created the annotation. There exist two 'pseudo-annotators', Automatic and All:

ALL is preferably to be used when displaying search results, where it is desirable to view all existing annotations,
Automatic is selected when viewing the Tokens pseudo label group (see Tokenization section)

Annotations can only be created when a 'real' annotator is selected.

Resource Selector

Documents are organized as resources, grouped together from where they have been downloaded. When selecting a resource, a resource description with some statistics, and the first document of the resource is displayed.

There exist three 'pseudo-resources':

SEARCH: opens a window to specify search critery
LAST SEARCH RESULTS: displays the results of the last document search
LAST ANNOTATION LIST: displays a clickable list of the results of the last annotation search.

Document Annotation Display

The document display shows

A user description (colored with the assigned gender of the user, if available. In case of several gender assignments, majority voting is used): When clicking on the button labeled '+', more user information is displayed. It is also possible to initiate a search for all postings of this user with a single click.
The (possibly annotated) document
Text annotation buttons (these are only displayed if a label group is selected)

Annotation

One of the main purposes of the FemSMA Corpus Workbench is to support users in annotating documents. Annotation means, that a span of text of the document is assigned a Label. Labels are organized in Label Groups.
Important properties of annotations as implemented in FemSMA are:

There may be different annotations for one document.
With each annotation the person who created it is stored.
For each document every Label Group/Annotator combination is possible only once.
Within one document the labels in a Label Group/Annotator combination may not overlap!
Overlapping labels may occur if one selects 'ALL' as annotator restriction (e.g. for displaying search results). In that case, the first of two overlapping labels is not displayed!
Document annotation is only possible if a person is selected as annotator.

Annotating a document

Before a document can be annotated or it's annotations modified

It has to be loaded (via resource browsing or search).
The proper annotator has to be selected.
The desired Label Group has to be selected.

After finishing annotation and before moving to the next document the annotations have to be saved by clicking the 'Save Annotations' button!

Adding an annotation

Select the phrase to be annotated by dragging the mouse (if the phrase consists of a single word, that word may be double-clicked).
Click the appropriate colored label button.

NB: Annotations may not overlap! Therefore selections starting or ending within another annotation will not work!

Modifying an annotation

If the label of an annotated phrase has to be changend, simply double-click it and click the new colored label button.
if the boundaries of an annotation are to be changed, delete it first (double click it and then click the 'Delete Selected' button), an then add the annotation (see above).

Deleting annotations

If a single annotation has to be deleted, just double click it and then click the 'Delete Selected' button. Repeat if necessary.
All annotations (of the current document with the current annotator and label group) may be deleted by clicking the 'Delete All' button.

NB: Don't forget to save the annotations after deleting some or all of them, otherwise they will reappear if the document is loaded the next time.

Tips

It may simplify work if the annotator has an idea of how the phrases she likes to annotate look like. Then she may conduct a content search (see search examples below) and iterate over the messages in the search results.
The matches are highlighted and may be double clicked to be selected (just like an annotation) and then assigned a label. (Hint: when searching for word parts it may be useful to surround the search pattern by '\w*' to extend the match to the whole word).

Adding an annotator

In order to add a new annotator, click on the

button right of the annotator selector. In the for that appears enter the full name and the login of the new annotator. Don't forget to click the 'Add' button, otherwise the data will not be entered into the database. The form can also be removed without adding a new annotator by clicking

again.

Adding a Label Group

In order to add a new label group, click on the

button right of the label group selector. Enter the description of the label group. Labels can be added by clicking the

button right of the 'Labels:' header. See 'Editing a Label Group' below for further details.

Editing a Label Group

In order to edit an existing label group, click on the

button right of the label group selector (after selecting the labe; group to be edited). A form with the label group description and the defined labels appears. Labels can be added by clicking the

button right of the 'Labels:' header.
For each label, 3 fields are to be defined:

the label name (without blanks! only alphabetic characters, numbers and underscore are allowed! Avoid accented characters and umlauts)
the label description (don't leave it out, since the description serves as a tooltip for the label buttons and may be helpful during annotating)
the label color (click into the field and select a color from the color editor appearing).

Deleting Labels:
Online deletion of labels is not implemented, since this may result in loss of data. Instead, rename the label to 'delete'!. During database maintenance the label and it's associated annotations may be removed after confirmation.

NB. Don't forget to save your edits by clicking the 'Save' button. If you want to cancel the editing operation, click the

button again and the editing form will be removed from the screen.

Search

Search mode is entered if the pseudo-resource 'Search' is selected.
Generally, search is performed on corpus documents (user postings) yielding a subset of the corpus documents. However, there are two different ways to display the results:

the documents themselves are displayed (just like a subcorpus resource), or
a list of annotated phrases found in these documents, ordered by label group, label, and then alphabetically is displayed. These phrases are clickable, resulting in a search for documents containing the particular phrase (regardless of being annotated or not!)
The search phrase is highlighted in the found documents (see section on content search below).

The results of the last document search and the last annotation list are cached and available as a temporary resource.

Restricting the user

When checking the 'Restrict User' checkbox the search returns only documents that have been posted by a user matching the required criteria.
Two kinds of restrictions are implemented:

the user name (i.e. the screen name of the user that appears with the posting) can be restricted. The search is implemented via the LIKE operator, that means:
- the search is case insensitve
- '%' and '_' can be used as wildcard characters (see here for details)
the gender of the user can be restricted.
NB: it is possible that several gender records are attached to a single user that come from different sources and may be even conflicting! Furthermore, there may be users without any gender information.
The match succeeds for a user if s/he has at least one matching gender record.

Restricting the resource

When the 'Restrict Resource' checkbox is checked, the list of all currently available resources is displayed. Only postings from checked resources will be selected.

Restricting the markup

It is possible to select only postings that

have no annotation at all, or
that contain a certain annotation.

When checking the 'Restrict Markup' checkbox, the annotations sought can be clicked.

Restricting the content

When checking the 'Restrict Content' checkbox it is possible to restrict the search to postings that contain certain words or phrases. Content matching is performed via Perl regular expressions.
A brief explanation of regular explanation can be found here, a more technical reference is available here.
NB: regexp search is case sensitive.

Examples

Say, you want to search and annotate gender index expressions. A first try would be to search for

ich als frau

This search returns only one document containing the phrase sich als frau, which is not really what we wanted.
So the next try would be to allow for case variation:

[iI]ch als [fF]rau

This query yields more matches but still the not specific phrase as the first one. In order to specify, that we search for the word "ich", we can use the word-boundary meta-character "\b":

\b[iI]ch als [fF]rau

This indeed yields only matches where the first word is "ich".
Finally, if we are interested in finding phrases starting with the words "ich" and "als" and ending with the word "Frau" with a maximum of 30 intervening characters, none of which is a sentence boundary we could use:

\b[iI]ch als\b[^\.\!\?]{0,30}\b[fF]rau\b

This search also finds the phrase ich als fu�ballbegeisterte Frau.

Match highlighting

Results from a search involving a content restriction provide highlighting of the matching phrase(s). A highlighted phrase can be selected by double-clicking it. So it is easy to assign it a label. However, if the matching phrase already has a label assigned (given the current annotator and label group), highlighting cannot be performed. Similarly, if the matching phrase overlaps an already annotated phrase, only the non-overlapping parts of the match are highlighted.

Tokenization

Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens may be the input of further processing stages (e.g. parsing). In FemSMA each token has a token category and may carry a set of additional features derived from lexical resources.

Token Categories

ELL	Ellipsis
EMO	Emoticon
EXCLAM	Exclamation
HASHTAG	Twitter Hashtag
LQUOT	Left quotation mark
NUM	Number
PUNCT	Punctuation
RQUOT	Right quotation mark
URL	URL
USER	Reference to user
WORD	Word

Word Features

Word features are either derived from the orthographic appearence of the token or by lookup in appropriate lexical resources. The following resources are currently used:

Classical and Japanese Emoticons (as regular expressions, derived from http://sentiment.christopherpotts.net/tokenizing.html and http://de.wikipedia.org/wiki/Emoticon)
SentiStrength_DE, a collection of German lexicon files to be used for sentiment classification with SentiStrength from the University of Wolverhampton. From http://www.ofai.at/research/interact/resources/SentiStrength_DE/download_form.html. This resource is implemented as a finite state transducer (FST).
A collection of interjections (as a list of regular expressions)
A list of internet abbreviations collected from http://netforbeginners.about.com/od/internetglossary/a/glossary-of-internet-jargon-and-abbreviations.htm, implemented as a FST.
A morphological lexicon holding part-of-speech information, developed by OFAI, implemented as a FST.
A list of first names with gender information derived from http://www.heise.de/ct/ftp/07/17/182/, implemented as a FST.
A German version of the Linguistic Inquiry and Word Count (LIWC) dictionary, implemented as a FST.

General word features

ABBR	Abbreviation
CAP	Allcap orthography
CCC	Character reduplication
FIRSTNF	Female firstname
FIRSTNM	Male firstname
ITJ	Interjektion
SWEAR	Swear word
QUEST	Question

Part-of-Speech Features

ADJA	attributive Adjektive
ADJD	pr�dikative oder adverbiale Adjektive
ADV	Adverbien
APPO	Postpositionen
APPR	Pr�positionen
APPRART	Pr�positionen mit Artikel
APZR	Zirkumposition rechts
ART	bestimmter/unbestimmter Artikel
CARD	Kardinalzahlen
CM	Komma
KOKOM	Vergleichspartikel
KON	nebenordnende Konjunktion
KOUI	unterordnende Konjunktion mit Infinitiv
KOUS	unterordnende Konjunktion mit Satz
NE	Eigennamen
NN	Nomina
PAV	Pronominialadverbien
PDAT	attribuierendes Demonstrativpronomen
PDS	substituierendes Demonstrativpronomen
PIAT	attribuierendes Indefinitpronomen ohne Determiner
PIDAT	attribuierendes Indefinitpronomen mit Determiner
PIS	substituierendes Indefinitpronomen
PPER	irreflexives Personalpronomen
PPOSAT	attributierendes Possessivpronomen
PPOSS	substituierendes Possessivpronomen
PRELAT	attributierendes Relativpronomen
PRELS	substituierendes Relativpronomen
PRF	reflexives Personalpronomen
PTKA	Partikel bei Adjektiv oder Adverb
PTKANT	Antwortpartikel
PTKNEG	Negationspartikel
PTKVZ	abgetrennter Verbzusatz
PTKZU	zu vor Infinitiv
PWAT	attributierendes Interrogativpronomen
PWAV	adverbiales Interrogativpronomen
PWS	substituierendes Interrogativpronomen
SENT	Interpunktion am Satzende
VAFIN	finite Auxiliarverben
VAIMP	Auxiliarverben im Imperativ
VAINF	Auxiliarverben im Infinitiv
VAPP	Partizip Perfekt von Auxiliarverben
VMFIN	finite Modalverben
VMINF	Modalverben im Infinitiv
VMPP	Partizip Perfekt von Modalverben
VVFIN	finite Vollverben
VVIMP	Vollverben im Imperativ
VVINF	Vollverben im Infinitiv
VVIZU	Vollverben im Infinitiv mit zu
VVPP	Partizip Perfekt von Vollverben

Sentiment Features

SENT1	positive sentiment
SENT2	strong positive sentiment
SENT3	very strong positive sentiment
SENT4	extremely strong positive sentiment
SENT1	negative sentiment
SENT2	strong negative sentiment
SENT3	very strong negative sentiment
SENT4	extremely strong negative sentiment

LIWC features

LIWC1	Pronoun
LIWC2	I
LIWC3	We
LIWC4	Self
LIWC5	You
LIWC6	Other
LIWC7	Negate
LIWC8	Assent
LIWC9	Article
LIWC10	Preps
LIWC11	Numbers
LIWC12	Affect
LIWC13	Positive emotion
LIWC14	Positive feeling
LIWC15	Optimism
LIWC16	Negative emotion
LIWC17	Anxiety
LIWC18	Anger
LIWC19	Sad
LIWC20	Cognitive mechanism
LIWC21	Cause
LIWC22	Insight
LIWC23	Discrepancy
LIWC24	Inhibition
LIWC25	Tentative
LIWC26	Certain
LIWC27	Senses
LIWC28	See
LIWC29	Hear
LIWC30	Feel
LIWC31	Social
LIWC32	Communication
LIWC33	Other reference
LIWC34	Friends
LIWC35	Family
LIWC36	Humans
LIWC37	Time
LIWC38	Past
LIWC39	Present
LIWC40	Future
LIWC41	Space
LIWC42	Up
LIWC43	Down
LIWC44	Incl
LIWC45	Excl
LIWC46	Motion
LIWC47	Occup
LIWC48	School
LIWC49	Job
LIWC50	Achieve
LIWC51	Leisure
LIWC52	Home
LIWC53	Sports
LIWC54	TV
LIWC55	Music
LIWC56	Money
LIWC57	Metaph
LIWC58	Relig
LIWC59	Death
LIWC60	Physical
LIWC61	Body
LIWC62	Sex
LIWC63	Eat
LIWC64	Sleep
LIWC65	Grooming
LIWC66	Swear
LIWC67	Non-fluency
LIWC68	Filler

Token display as pseudo-annotation

In order to view the results of tokenization and explore the word features, token features can be displayed in a similar way as annotations by selecting the label group Tokens. Then buttons for all token classes and token features appearing in the current document are displayed.

By hovering the mouse cursor over a token class or feature button, the explanation of that feature is displayed as a tooltip.
By clicking a token class or feature button, all tokens having that class or feature are highlighted in the text.
By hovering the mouse cursor over the document, the token class and features of the token under the cursor are displayed as a tooltip.

Implementation Details

FemSMA Database

All FemSMA corpus data is contained in an SQLite relational database. This allows for

easy addition of new resources
easy maintenance of annotations originating from different annotators and for different label sets
easy searching of the corpus data with flexible search criteria

Database Schema

CREATE TABLE annotation (
  msg_id        integer,
  label_id      integer,
  annotator_id  integer,
  start         integer,
  end           integer
);
CREATE TABLE annotator (
  id            integer primary key autoincrement,
  name          text,
  login text
);
CREATE TABLE gender (
  user_id       integer,
  gender        char,
  how           text,         -- how the gender information was established
  assigned_by   text
);
CREATE TABLE label (
  id            integer primary key autoincrement,
  group_id      integer,
  name          text,
  descr         text,
  color         text
);
CREATE TABLE label_group (
  id            integer primary key autoincrement,
  descr         text
);
CREATE TABLE msg (
  id            integer primary key autoincrement,
  resource_id   integer,
  relpos        integer,    -- relative position within the resource
  date          datetime,
  user          text,
  text          text
);
CREATE TABLE quotation (
  msg_id        integer,
  relpos        integer,    -- relative position within the msg
  text          text
);
CREATE TABLE resource (
  id            integer primary key autoincrement,
  type          text,
  topic         text,
  descr         text,
  code          text,
  host          text,
  dominance     char,       -- F, M, G
  length_restr  integer,
  url           text,
  file          text
);
CREATE TABLE user (
  id            integer primary key autoincrement,
  host          text,
  screen_name   text,
  name          text
);