There are a number of data files involved in this challenge. Each type of file is available for each language.
First and foremost, there is a list of word forms. The words have been extracted from a text corpus, and each word in the list is preceded by its frequency in the corpus used.
For instance, a subset of the supplied English word list looks like this:
... 1 barefoot's 2 barefooted 6699 feet 653 flies 2939 flying 1782 foot 64 footprints ...
The participants' task is to return a list containing exactly the same words as in the input, with morpheme analyses provided for each word. The list returned shall not contain the word frequency information.
A submission for the above English words may look like this:
... barefoot's BARE FOOT +GEN barefooted BARE FOOT +PAST feet FOOT +PL flies FLY_N +PL, FLY_V +3SG flying FLY_V +PCP1 foot FOOT footprints FOOT PRINT +PL ...
There are a number of things to note about the result file: Each line of the file contains a word (e.g., "feet") separated from its analysis (e.g., "FOOT +PL") by one TAB character. The word needs to look exactly as it does in the input; no capitalization or change of character encoding is allowed. The analysis contains morpheme labels separated using space. The order in which the labels appear does not matter; e.g., "FOOT +PL" is equivalent to "+PL FOOT". The labels are arbitrary: e.g., instead of using "FOOT" you might use "morpheme784" and instead of "+PL" you might use "morpheme2". However, we strongly recommend you to use intuitive labels, when possible, since they make it easier for anyone to get an idea of the quality of the result by looking at it.
If a word has several interpretations, all interpretations should be supplied: e.g., the word "flies" may be the plural form of the noun "fly" (insect) or the third person singular present tense form of the verb "to fly". The alternative analyses must be separated using a comma, as in: "FLY_N +PL, FLY_V +3SG". The existence of alternative analyses makes the task challenging, and we leave it to the participants to decide how much effort they will put into this aspect of the task. In English, for instance, in order to get a perfect score, it would be necessary to distinguish the different functions of the ending "-s" (plural or person ending) as well as the different parts-of-speech of the stem "ly" (noun or verb). As the results will be evaluated against reference analyses (our so-called gold standard), it is worth reading about the guiding principles used when constructing the gold standard.
As far as we understand, you can use any characters in your morpheme labels except whitespace and comma (,). However, we cannot guarantee that the evaluation scripts will work properly, if your labels contain some "strange" characters.
The word list (input data) has been constructed by collecting word
forms occurring in text corpora. The text corpora have been obtained
by combining collections from
collection at the University of Leipzig
(Germany), CLEF, and
Europarl corpus. The
corpus sizes are 18.8 million sentences for English, 6.6 million for
Finnish, 9.7 million for German, and 1 million for Turkish. The
corpora have been preprocessed for the Morpho Challenge (tokenized,
lower-cased, some conversion of character encodings).
If the participants like to do so, they can use the corpora in order to get information about the context in which the different words occur.
The desired "correct" analyses for a random sample of circa 1000 words are supplied for each language. These samples can be used for as a training set for a semi-supervised algorithm. Also given is a development set that can be used in order to get a rough estimate of the performance of the participants' morpheme-analyzing algorithm. If independent training and development sets are not needed by the participant, a combined set can be used as a larger training or development set.
The format of the gold standard file is exactly the same as that of the result file to be submitted. That is, each line contains a word and its analysis. The word is separated from the analysis by a TAB character. Morpheme labels in the analysis are separated from each other by a space character. For some words there are multiple correct analyses. These alternative analyses are separated by a comma (,). Examples:
|English||baby-sitters baby_N sit_V er_s +PL|
indoctrinated in_p doctrine_N ate_s +PAST
|Finnish||linuxiin linux_N +ILL|
makaronia makaroni_N +PTV
|German||choreographische choreographie_N isch +ADJ-e|
zurueckzubehalten zurueck_B zu be halt_V +INF
|Turkish||kontrole kontrol +DAT|
popUlerliGini popUler +DER_lHg +POS2S +ACC, popUler +DER_lHg +POS3 +ACC3
The English and German gold standards are based on the CELEX data base. The Finnish gold standard is based on the two-level morphology analyzer FINTWOL from Lingsoft, Inc. The Turkish gold-standard analyses have been obtained from a morphological parser developed at Boğaziçi University; it is based on Oflazer's finite-state machines, with a number of changes. We are indebted to Ebru Arısoy for making the Turkish gold standard available to us.
The morphological analyses are morpheme analyses. This means that only grammatical categories that are realized as morphemes are included. For instance, for none of the languages will you find a singular morpheme for nouns or a present-tense morpheme for verbs, because these grammatical categories do not alter or add anything to the word form, in contrast to, e.g., the plural form of a noun (house vs. house+s), or the past tense of verbs (help vs. help+ed, come vs. came).
The morpheme labels that correspond to inflectional (and sometimes also derivational) affixes have been marked with an initial plus sign (e.g., +PL, +PAST). This is due to a feature of the evaluation script: in addition to the overall performance statistics, evaluation measures are also computed separately for the labels starting with a plus sign and those without an initial plus sign. It is thus possible to make an approximate assessment of how accurately affixes are analyzed vs. non-affixes (mostly stems). If you use the same naming convention when labeling the morphemes proposed by your algorithm, this kind of statistics will be available for your output (see the evaluation page for more information).
The morpheme labels that have not been marked as affixes (no initial plus sign) are typically stems. These labels consist of an intuitive string, usually followed by an underscore character (_) and a part-of-speech tag, e.g., "baby_N", "sit_V". In many cases, especially in English, the same morpheme can function as different parts-of-speech; e.g., the English word "force" can be a noun or a verb. In the majority of these cases, however, if there is only a difference in syntax (and not in meaning), the morpheme has been labeled as either a noun or a verb, throughout. For instance, the "original" part-of-speech of "force" is a noun, and consequently both noun and verb inflections of "force" contain the morpheme "force_N":
|forces||force_N +3SG, force_N +PL|
Thus, there is not really a need for your algorithm to distinguish between different meanings or syntactic roles of the discovered stem morphemes. However, in some rare cases, if the meanings of the different parts-of-speech do differ clearly, there are two variants, e.g., "train_N" (vehicle), "train_V" (to teach), "fly_N" (insect), "fly_V" (to move through the air). But again, if there are ambiguous meanings within the same part-of-speech, these are not marked in any way, e.g., "fan_N" (device for producing a current of air) vs. "fan_N" (admirer). This notation is a consequence of using CELEX and FINTWOL as the sources for our gold standards. We could have removed the part-of-speech tags, but we decided to leave them there, since they carry useful information without significantly making the task more difficult. There are no part-of-speech tags in the Turkish gold standard.
If you want to run evaluation with the development set, you need to download a randomly generated so-called word pairs file for each language to be tested. The word pairs files for the development set contain 300 randomly selected words that are used for estimating recall. For each morpheme of these words, there is another randomly selected word that contains the same morpheme. Read more about this on the evaluation page.
For some of the languages, we also provide segmentations correspoding to the surface forms of the morpheme labels. The gold standard segmentations should be useful for semi-supervised training of algorithms based solely on segmentation.
The segmentation files contain the segments and their forms combined. The format is similar to that of the result file, but for each morpheme, there are two parts separated by colon (:). The surface form is in the left side of the colon, and the morpheme label is in the right side. If either the surface form or the morpheme does not exist, it is marked by tilde (~). For example, from the English file you can find the following lines:
... adversaries advers:adverse_A ari:ary_s es:+PL ... bit bit:bite_V ~:+PAST, bit:bit_N ...where the past tense of "bite" does not have a corresponding segment. If there is a colon in the word form, it is backslashed (\:) in the segment:
... mtk:hon mtk\::mtk hon:+ILL ...
If you want only the segments, the following UNIX shell command should do the trick:
cat goldstd_trainset.segmentation.eng | sed 's/\([^\]\):[^ , ]*\( \|,\|$\)/\1\2/g' | \ sed 's/\\:/:/g' | sed 's/~ *//g' | sed 's/ *,/,/g' | sed 's/ *$//' > goldstd_trainset.segments.engNote: The whitespace between "," and "]" in the first sed command should be a single TAB character.
While the words are the same as in the files containing only the morpheme labels, please note that the given analyses may be different, due to the morphemes that do not correspond to any segment and the segments that do not correspond to any morpheme.
English and Finnish segmentations are based on the Hutmegs 1.0 package. Ebru Arısoy has provided the Turkish segmentations. Unfortunately, there are no segmentations available for German.
In the source data used for the different languages, there is variation in how accurately certain distinctions are made when letters are rendered. This makes it hard to apply a unified character encoding scheme for all the languages (such as UTF-8). Thus, the following encodings have been used, in which all letters are encoded as one-byte (8-bit) characters:
For strictly unsupervised learning, you need only the word lists. For semi-supervised learning, you can use either the gold standard segmentations (for those languages that they are available) or directly the gold standard labels (for any language). Furthermore, there are two independent sets for semi-supervised learning: Training set is meant to be used directly in training, whereas development set is meant to be used for tuning the (possible) parameters of the learning method. If you need only one of them, you can use the combined set instead. In the extended abstract of your submission, you should clearly state which of the data sets are used and to which purpose.
|Language||Word list||Training set (Gold standard labels)||Training set (Gold standard segmentations)||Development set (Gold standard labels)||Development set (word pairs file)||Combined Training+Development set (Gold standard labels)||Combined Training+Development set (Gold standard segmentations)|
|German||Text||Text gzipped||Text||(not available)||Text||Text||Text||(not available)|
You are at: CIS → Unsupervised Morpheme Analysis -- Morpho Challenge 2010
Page maintained by webmaster at cis.hut.fi, last updated Tuesday, 29-Jun-2010 09:47:03 EEST