Abstracts of Accepted Papers
- Morphology within the Multi-Layered Annotation Scenario of the Prague Dependency Treebank (invited talk)
Morphological annotation constitutes a separate layer in the multi-layered annotation scenario of the Prague Dependency Treebank. At this layer, morphological categories expressed by a word form are captured in a positional part-of-speech tag. According to the Praguian approach based on the relation between form and function, functions (meanings) of morphological categories are represented as well, namely as grammateme attributes at the deep-syntactic (tectogrammatical) layer of the treebank.
In the present paper, we first describe the role of morphology in the Prague Dependency Treebank, and then outline several recent topics based on Praguian morphology: named entity recognition in Czech, formemes attributes encoding morpho-syntactic information in the dependency-based machine translation system, and development of a lexical database of derivational relations based partially on information provided by the morphological analyser.
- Designing and Comparing G2P-Type Lemmatizers for a Morphology-Rich Language
We consider the statistical lemmatization problem in which lemmatizers are trained on (word form, lemma) pairs. In particular, we consider this problem for ancient Latin, a language with high degree of morphological variability. We investigate whether general purpose string-to-string transduction models are suitable for this task, and find that they typically perform (much) better than more restricted lemmatization techniques/heuristics based on suffix transformations. We also experimentally test whether string transduction systems that perform well on one string-to-string translation task (here, G2P) perform well on another (here, lemmatization) and vice versa, and find that a joint n-gram modeling performs better on G2P than a discriminative model of our own making but that this relationship is reversed for lemmatization. Finally, we investigate how the learned lemmatizers can complement lexicon-based systems, e.g., by tackling the OOV and/or the disambiguation problem.
- Morphological Disambiguation of Classical Sanskrit
Sanskrit, the sacred language of Ancient India, is a morphologically rich Indo-Iranian language that has received some attention in NLP during the last decade. This paper describes a system for the tokenization and morphosyntactic analysis of Sanskrit. The system combines a morphological rule base with a statistical selection of the most probable analysis of an input text. After an introduction into the research history and the linguistic peculiarities of Sanskrit that are relevant to the task, the paper describes the present architecture of the system and new extensions that increase its accuracy when analyzing morphologically ambiguous forms. The algorithms are tested on a gold-annotated data set of 3.587.000 words.
- A Multi-Purpose Bayesian Model for Word-Based Morphology
This paper introduces a probabilistic model of morphology based on a word-based morphological theory. Morphology is understood here as a system of rules that describe systematic correspondences between full word forms, without decomposing words into any smaller units. The model is formulated in the Bayesian learning framework and can be trained in both supervised and unsupervised setting. Evaluation is performed on tasks of generating unseen words, lemmatization and inflected form production.
- Using HFST-Helsinki Finite-State Technology for Recognizing Semantic Frames
(Krister Lindén, Sam Hardwick, Miikka Silfverberg, and Erik Axelson)
To recognize semantic frames in languages with a rich morphology, we need computational morphology. In this paper, we look at one particular framework, HFST (Helsinki Finite-State Technology), and how to use it for recognizing semantic frames in context. HFST enables tokenization, morphological analysis, tagging and frame annotation in one single framework.
- Morpho-SLaWS: An API for the Morphosyntactic Annotation of the Serbian Language
(Toma Tasovac, Saša Rudan and Siniša Rudan)
Serbian Lexical Web Service (SLaWS) is a resource-oriented web service designed to offer multiple functionalities—including morphosyntactic, lexicographic and canonical text services—to create the backbone of a digital humanities infrastructure for the Serbian language. In this paper, we describe a key component of this service called Morpho-SLaWS, the atomic morphosyntactic component of the service infrastructure. The goal of Morpho-SLaWs is to offer a reliable, programmatic way of extracting morphosyntactic information about word forms using a revised version of the MULTEXT-East specification. As a service-oriented lexical tool, Morpho-SLaWS can be deployed in a variety of contexts and combined with other linguistic and DH tools.
- A Universal Feature Schema for Rich Morphological Annotation
(John Sylak-Glassman, Christo Kirov, David Yarowsky, and Roger Que)
Semantically detailed and typologically-informed morphological analysis that is broadly applicable cross-linguistically has the potential to improve many NLP applications, including machine transla- tion, n-gram language models, information extraction, and co-reference resolution. In this paper, we present a universal morphological feature schema, which is a set of features that represent the finest distinctions in meaning that are expressed by inflectional morphology across languages. We first present the schema’s guiding theoretical principles, construction methodology, and contents. We then present a method of measuring cross-linguistic variability in the semantic distinctions conveyed by inflectional morphology along the multiple dimensions spanned by the schema. This method relies on representing inflected wordforms from many languages in our universal feature space, and then testing for agreement across multiple aligned translations of pivot words in a parallel corpus (the Bible). The results of this method are used to assess the effectiveness of cross-linguistic projection of a multilingual consensus of these fine-grained morphological features, both within and across language families. We find high cross-linguistic agreement for a diverse range of semantic dimensions expressed by inflectional morphology.
- Morphological Analysis and Generation of Monolingual and Bilingual Medical Lexicons
(Serena Pelosi, Annibale Elia and Alessandro Maisto)
To efficiently extract and manage an extremely large quantity of meaningful data, into a delicate sector like the healthcare one, means to deal with the most sophisticated linguistic strategies and computational solutions. In this research we aim to approach the semantic dimension of the medical word’s formation elements into a monolingual and bilingual environment.The purpose is to automatically build Italian-English medical lexical resources, by grounding their analysis and generation on the manipulation of the morphemes which they are formed with. This approach has an significant impact on the automatic analysis of neologisms, typical of the medical domain. In detail, we created two electronic dictionaries of morphemes and a morphological Finite State Transducer which together, finding every possible combination of Prefixes, Confixes and Suffixes, are able to annotate/translate the terms contained into a medical corpus, according with the meaning of the morphemes that compose these words. In order to enable the machine to “understand” also the medical multiword expressions, we designed a syntactic grammar net that includes several paths based on different combinations of nouns, adjectives and prepositions.
- Dsolve – Morphological Segmentation for German using Conditional Random Fields
(Kay-Michael Würzner and Bryan Jurish)
We describe Dsolve, a system for the segmentation of morphologically complex German words into their constituent morphs. Our approach treats morphological segmentation as a classification task, in which the locations and types of morph boundaries are predicted by a Conditional Random Field model trained from manually annotated data. The prediction of morph-boundary types in addition to their locations distinguishes Dsolve from similar approaches previously suggested in the literature. We show that the use of boundary types provides a (somewhat counter-intuitive) performance boost with respect to the simpler task of predicting only segment locations.
- Morphological Analysis and Generation for Pali
(David Alfter and Jürgen Knauth)
In this paper we describe a system that performs morphological generation and analysis for Pali. We discuss the morphological aspects of the tasks our system performs with emphasis on Pali specific characteristics and difficulties and present insights into how this system is integrated into a technical infrastracture used in research about Pali.
- Grammar Debugging
Perhaps the dominant method for building morphological parsers is to use finite state transducer toolkits. The problem with this approach is that finite state transducers require one to think of grammar writing as a programming task, rather than as providing a declarative linguistic description. We have therefore developed a method for representing the morphology and phonology of natural languages in a way which is closer to traditional linguistic descriptions, together with a method for automatically converting these descriptions into parsers, thus allowing the linguistic descriptions to be tested against real language data.
But there is a drawback to this approach: the fact that the descriptive level is different from the implementation level makes debugging of the grammars difficult, and in particular it provides no aid to visualizing the steps in deriving surface forms from underlying forms. We have therefore developed a debugging tool, which allows the linguist to see each intermediate step in the generation of words, without needing to know anything about the finite state implementation. The tool runs in generation mode; that is, the linguist provides an expected parse, and the debugger shows how that underlying form is converted into a surface form given the grammar. (Debugging in the opposite direction—starting from an expected surface form—might seem more natural, but in fact is much harder if that form cannot be parsed, as presumably it cannot be if the grammar needs debugging.)
The tool allows tracing the application of feature checking constraints (important when there is multiple exponence) and phonological rules. It will soon allow viewing the application of suppletive allomorphy constraints, although we describe some theoretical linguistic issues with how the latter should work. The tool can be run from the command line (useful when repeatedly testing the same wordforms while tweaking the grammar), or from a Graphical User Interface (GUI) which prompts the user for the necessary information. The output can be displayed in a browser.
In addition to its use in debugging, the debugger could have an educational use in explicating the forms in a paradigm chart: each cell of the paradigm could be run through the debugger to produce the cell’s derivation, showing how forms which might seem counter-intuitive or irregular are derived. We have not yet implemented this.