ISSN: Artículos

Tamaño: px
Comenzar la demostración a partir de la página:

Download "ISSN: 1135-5948. Artículos"

Transcripción

1 Procesamiento del Lenguaje Natural, Revista nº 52 marzo de 2014 ISSN: Artículos Learning to map variation-standard forms using a limited parallel corpus and the standard morphology Izaskun Etxeberria, Iñaki Alegria, Mans Hulden, Larraitz Uria Fénix: a flexible information exchange data model for natural language processing José M. Gómez, David Tomás, Paloma Moreda Collecting and POS-tagging a lexical resource of Japanese biomedical terms from a corpus Carlos Herrero-Zorita, Leonardo Campillos-Llanos, Antonio Moreno-Sandoval TASS A Second Step in Reputation Analysis in Spanish Julio Villena-Román, Janine García-Morera, Sara Lana-Serrano, José Carlos González-Cristóbal Lexical Normalization of Spanish Tweets with Rule-Based Components and Language Models Pablo Ruiz, Montse Cuadros, Thierry Etchegoyhen From constituents to syntax-oriented dependencies Benjamin Kolz, Toni Badia, Roser Saurí Automatic prediction of emotions from text in Spanish for expressive speech synthesis in the chat domain Benjamin Kolz, Juan María Garrido, Yesika Laplaza Función de las secuencias narrativas en la clasificación de la polaridad de reviews John Roberto, Maria Salamó Llorente y Maria Antònia Martí Antonín New experiments on speaker diarization for unsupervised speaking style voice building for speech synthesis B. Martínez-González, J. M. Pardo, J.D.Echeverry-Correa, J. M. Montero Tesis Diseño y generación semi-automática de patrones adaptables para el Reconocimiento de Entidades Mónica Marrero Oral Expression in Spanish as a Foreign Language: Interlanguage and Corpus-Based Error Analysis Leonardo Campillos Llanos Etiquetación y desambiguación automáticas en gallego: el sistema XIADA Eva María Domínguez Noya Información General XXX Congreso de la Sociedad Española para el Procesamiento del Lenguaje Natural Información para los autores Impresos de Inscripción para empresas Impresos de Inscripción para socios Información adicional Sociedad Española para el Procesamiento del Lenguaje Natural

2

3 Procesamiento del Lenguaje Natural, Revista nº 52 marzo de 2014 ISSN: Comité Editorial Consejo de redacción L. Alfonso Ureña López Universidad de Jaén (Director) Patricio Martínez Barco Universidad de Alicante (Secretario) Manuel Palomar Sanz Universidad de Alicante Felisa Verdejo Maillo UNED ISSN: ISSN electrónico: Depósito Legal: B: Editado en: Universidad de Jaén Año de edición: 2014 Editores: Mariona Taulé Delor Universidad de Barcelona Mª Teresa Martín Valdivia Universidad de Jaén Publicado por: Sociedad Española para el Procesamiento del Lenguaje Natural Departamento de Informática. Universidad de Jaén Campus Las Lagunillas, EdificioA3. Despacho Jaén Consejo asesor José Gabriel Amores Universidad de Sevilla Toni Badía Universidad Pompeu Fabra Manuel de Buenaga Universidad Europea de Madrid Irene Castellón Universidad de Barcelona Arantza Díaz de Ilarraza Universidad del País Vasco Antonio Ferrández Universidad de Alicante Mikel Forcada Universidad de Alicante Ana García-Serrano UNED Koldo Gojenola Universidad del País Vasco Xavier Gómez Guinovart Universidad de Vigo Julio Gonzalo UNED José Miguel Goñi Universidad Politécnica de Madrid José Mariño Universidad Politécnica de Cataluña M. Antonia Martí Universidad de Barcelona M. Teresa Martín Universidad de Jaén Patricio Martínez-Barco Universidad de Alicante Raquel Martínez UNED Lidia Moreno Universidad Politécnica de Valencia Lluís Padró Universidad Politécnica de Cataluña Manuel Palomar Universidad de Alicante 2014 Sociedad Española para el Procesamiento del Lenguaje Natural

4 Ferrán Pla Universidad Politécnica de Valencia German Rigau Universidad del País Vasco Horacio Rodríguez Universidad Politécnica de Cataluña Kepa Sarasola Universidad del País Vasco Emilio Sanchís Universidad Politécnica de Valencia Mariona Taulé Universidad de Barcelona L. Alfonso Ureña Universidad de Jaén Felisa Verdejo UNED Manuel Vilares Universidad de A Coruña Ruslan Mitkov Universidad de Wolverhampton, Reino Unido Sylviane Cardey-Greenfield Centre de recherche en linguistique et traitement automatique des langues, Francia Leonel Ruiz Miyares Centro de Linguística Aplicada de Santiago de Cuba, Cuba Luis Villaseñor-Pineda Instituto Nacional de Astrofísica, Óptica y Electrónica, México Manuel Montes y Gómez Instituto Nacional de Astrofísica, Óptica y Electrónica, México Alexander Gelbukh Instituto Politécnico Nacional, México Nuno J. Mamede Instituto de Engenharia de Sistemas e Computadores, Portugal Bernardo Magnini Fondazione Bruno Kessler, Italia Juan-Manuel Torres-Moreno Laboratoire Informatique d Avignon Université d Avignon, Francia Revisores adicionales Fernándo Sanchez Vega Marina Lloberes Victor Darriba Yoan Gutierrez Vazquez Jesús Peral Cortés Juan Manuel Lucas Cuesta Doroteo Torre Toledano Eugenio Martínez Cámara Instituto Nacional de Astrofísica, Óptica y Electrónica, México Universidad de Barcelona, España Universidad de Vigo, España Universidad de Alicante, España Universidad de Alicante, España Universidad Politécnica de Madrid, España Universidad Autónoma de Madrid, España Universidad de Jaén, España 2014 Sociedad Española para el Procesamiento del Lenguaje Natural

5 Procesamiento del Lenguaje Natural, Revista nº 52 marzo de 2014 ISSN: Preámbulo La revista "Procesamiento del Lenguaje Natural" pretende ser un foro de publicación de artículos científico-técnicos inéditos de calidad relevante en el ámbito del Procesamiento de Lenguaje Natural (PLN) tanto para la comunidad científica nacional e internacional, como para las empresas del sector. Además, se quiere potenciar el desarrollo de las diferentes áreas relacionadas con el PLN, mejorar la divulgación de las investigaciones que se llevan a cabo, identificar las futuras directrices de la investigación básica y mostrar las posibilidades reales de aplicación en este campo. Anualmente la SEPLN (Sociedad Española para el Procesamiento del Lenguaje Natural) publica dos números de la revista, que incluyen artículos originales, presentaciones de proyectos en marcha, reseñas bibliográficas y resúmenes de tesis doctorales. Esta revista se distribuye gratuitamente a todos los socios, y con el fin de conseguir una mayor expansión y facilitar el acceso a la publicación, su contenido es libremente accesible por Internet. Las áreas temáticas tratadas son las siguientes: Modelos lingüísticos, matemáticos y psicolingüísticos del lenguaje. Lingüística de corpus. Desarrollo de recursos y herramientas lingüísticas. Gramáticas y formalismos para el análisis morfológico y sintáctico. Semántica, pragmática y discurso. Lexicografía y terminología computacional Resolución de la ambigüedad léxica. Aprendizaje automático en PLN. Generación textual monolingüe y multilingüe. Traducción automática. Reconocimiento y síntesis del habla. Extracción y recuperación de información monolingüe, multilingüe y multimodal. Sistemas de búsqueda de respuestas. Análisis automático del contenido textual. Resumen automático. PLN para la generación de recursos educativos. PLN para lenguas con recursos limitados. Aplicaciones industriales del PLN. Sistemas de diálogo. Análisis de sentimientos y opiniones. Minería de texto. Evaluación de sistemas de PLN. Implicación textual y paráfrasis El ejemplar número 52 de la revista de la Sociedad Española para el Procesamiento del Lenguaje Natural contiene trabajos correspondientes a dos apartados diferenciados: 2014 Sociedad Española para el Procesamiento del Lenguaje Natural

6 comunicaciones científicas y resúmenes de tesis. Todos ellos han sido aceptados mediante el proceso de revisión tradicional en la revista. Queremos agradecer a los miembros del Comité asesor y a los revisores adicionales la labor que han realizado. Se recibieron 25 trabajos para este número de los cuales 22 eran artículos científicos y 3 correspondían a resúmenes de tesis. De entre los 22 artículos recibidos 9 han sido finalmente seleccionados para su publicación, lo cual fija una tasa de aceptación del 40,9%. El Comité asesor de la revista se ha hecho cargo de la revisión de los trabajos. Este proceso de revisión es de doble anonimato, se mantiene oculta la identidad de los autores que son evaluados y de los revisores que realizan las evaluaciones. En un primer paso cada artículo ha sido examinado de manera ciega o anónima por tres revisores. En un segundo paso, para aquellos artículos que tenían una divergencia mínima de tres puntos (sobre siete) en sus puntuaciones sus tres revisores han reconsiderado su evaluación en conjunto. Finalmente, la evaluación de aquellos artículos que estaban en posición muy cercana a la frontera de aceptación ha sido supervisada por más miembros del comité. El criterio de corte adoptado ha sido la media de las tres calificaciones, siempre y cuando hayan sido iguales o superiores a 5 sobre 7. Marzo de 2014 Los editores 2014 Sociedad Española para el Procesamiento del Lenguaje Natural

7 Procesamiento del Lenguaje Natural, Revista nº 52 marzo de 2014 ISSN: Preamble The Natural Language Processing journal aims to be a forum for the publication of quality unpublished scientific and technical papers on Natural Language Processing (NLP) for both the national and international scientific community and companies. Furthermore, we want to strengthen the development of different areas related to NLP, widening the dissemination of research carried out, identifying the future directions of basic research and demonstrating the possibilities of its application in this field. Every year, the Spanish Society for Natural Language Processing (SEPLN) publishes two issues of the journal that include original articles, ongoing projects, book reviews and the summaries of doctoral theses. All issues published are freely distributed to all members, and contents are freely available online. The subject areas addressed are the following: Linguistic, Mathematical and Psychological models to language Grammars and Formalisms for Morphological and Syntactic Analysis Semantics, Pragmatics and Discourse Computational Lexicography and Terminology Linguistic resources and tools Corpus Linguistics Speech Recognition and Synthesis Dialogue Systems Machine Translation Word Sense Disambiguation Machine Learning in NLP Monolingual and multilingual Text Generation Information Extraction and Information Retrieval Question Answering Automatic Text Analysis Automatic Summarization NLP Resources for Learning NLP for languages with limited resources Business Applications of NLP Sentiment Analysis Opinion Mining Text Mining Evaluation of NLP systems Textual Entailment and Paraphrases The 52th issue of the Procesamiento del Lenguaje Natural journal contains scientific papers and doctoral dissertation summaries. All of these were accepted by the traditional peer reviewed 2014 Sociedad Española para el Procesamiento del Lenguaje Natural

8 process. We would like to thank the Advisory Committee members and additional reviewers for their work. Twenty-five papers were submitted for this issue of which twenty-two were scientific papers and three dissertation summaries. From these twenty-two papers, we selected nine (40.9% for publication). The Advisory Committee of the journal has reviewed the papers in a double-blind process. Under double-blind review the identity of the reviewers and the authors are hidden from each other. In the first step, each paper was reviewed blindly by three reviewers. In the second step, the three reviewers have given a second overall evaluation to those papers with a difference of three or more points out of 7 in their individual reviewer scores. Finally, the evaluation of those papers that were in a position very close to the acceptance limit were supervised by the editorial board. The cut-off criteria adopted was the average of the three scores given. March 2014 Editorial board 2014 Sociedad Española para el Procesamiento del Lenguaje Natural

9 Procesamiento del Lenguaje Natural, Revista nº 52 marzo de 2014 ISSN: Artículos Learning to map variation-standard forms using a limited parallel corpus and the standard morphology Izaskun Etxeberria, Iñaki Alegria, Mans Hulden, Larraitz Uria Fénix: a flexible information exchange data model for natural language processing José M. Gómez, David Tomás, Paloma Moreda Collecting and POS-tagging a lexical resource of Japanese biomedical terms from a corpus Carlos Herrero-Zorita, Leonardo Campillos-Llanos, Antonio Moreno-Sandoval TASS A Second Step in Reputation Analysis in Spanish Julio Villena-Román, Janine García-Morera, Sara Lana-Serrano, José Carlos González-Cristóbal Lexical Normalization of Spanish Tweets with Rule-Based Components and Language Models Pablo Ruiz, Montse Cuadros, Thierry Etchegoyhen From constituents to syntax-oriented dependencies Benjamin Kolz, Toni Badia, Roser Saurí Automatic prediction of emotions from text in Spanish for expressive speech synthesis in the chat domain Benjamin Kolz, Juan María Garrido, Yesika Laplaza Función de las secuencias narrativas en la clasificación de la polaridad de reviews John Roberto, Maria Salamó Llorente y Maria Antònia Martí Antonín New experiments on speaker diarization for unsupervised speaking style voice building for speech synthesis B. Martínez-González, J. M. Pardo, J.D.Echeverry-Correa, J. M. Montero Tesis Diseño y generación semi-automática de patrones adaptables para el Reconocimiento de Entidades Mónica Marrero Oral Expression in Spanish as a Foreign Language: Interlanguage and Corpus-Based Error Analysis Leonardo Campillos Llanos Etiquetación y desambiguación automáticas en gallego: el sistema XIADA Eva María Domínguez Noya Información General XXX Congreso de la Sociedad Española para el Procesamiento del Lenguaje Natural Información para los autores Impresos de Inscripción para empresas Impresos de Inscripción para socios Información adicional Sociedad Española para el Procesamiento del Lenguaje Natural

10

11 Artículos

12

13 Procesamiento del Lenguaje Natural, Revista nº 52 marzo de 2014, pp recibido revisado aceptado Learning to map variation-standard forms in Basque using a limited parallel corpus and the standard morphology Aprendizaje de correspondencias variante-estándar usando un corpus paralelo limitado y la morfología del estándar Izaskun Etxeberria*, Iñaki Alegria*, Mans Hulden**, Larraitz Uria* *IXA group, UPV-EHU. M. Lardizabal 1, Donostia **The Department of Modern Languages. PO Box 24, University of Helsinki Resumen: Este artículo explora tres diferentes métodos de aprendizaje de las variantes de un idioma (formas dialectales o diacrónicas) a partir de un pequeño corpus paralelo suponiendo que la morfología estándar está disponible. Palabras clave: normalización léxica, morfología, bibliotecas digitales Abstract: This paper explores three different methods of learning to map variant word form (dialectal or diachronic) to standard ones from a limited parallel corpus of standard and variant texts, given that a computational description of the standard morphology is available. Keywords: lexical normalization, morphology, digital libraries 1 Introduction In our work with the Basque language, a morphological description and analyzer is already available for the standard language, along with other tools for processing the language (Alegria et al., 2002). However, it would be convenient to be able to analyze variant forms as well. As the dialectal differences within the Basque language are largely lexical and morphophonological, analyzing the dialectal forms would require a separate morphological analyzer able to handle the unique lexical items in the dialect together with the differing affixes and phonological changes. Likewise, diachronic variants can not be analyzed by standard morphological analyzers and stemmers. For example, when searching in digital libraries containing old texts, it is impossible to find the corresponding old forms for a modern word without linguistic knowledge. Morphological analyzers are traditionally hand-written by linguists, most commonly using some variant of the popular finite-state morphology approach (Beesley and Karttunen, 2002). The construction of an analyzer entails having an expert who models a lexicon, inflectional and derivational paradigms as well as phonological alternations, and then producing a morphological analyzer/generator in the form of a finitestate transducer. ISSN As the development of such wide-coverage morphological analyzers is labor-intensive, the hope is that an analyzer for a variant could be automatically learned from a limited parallel standard/variant corpus, given that an analyzer already exists for the standard language. This is an interesting problem because a good solution to it could be applied to many other tasks as well: to enhance access to digital libraries (containing diachronic and dialectal variants), for example, or to improve the processing of informal registers such as microblogging texts (some techniques described here have been used in our participation on the TweetNorm es shared task at SEPLN2013). 1 In this paper we evaluate three methods to learn a model from a standard/variant parallel corpus that translates a given word of the dialect to its equivalent standard-form (called Batua). All the methods are based on finite-state phonology. The first two methods have been previously reported (Hulden et al., 2011). In this context, the use of statistical machine translation (SMT) technology is not adequate since there is not any big parallel corpus available. The variant we have so far used for our ex- 1 ActasSEPLN.pdf 2014 Sociedad Española para el Procesamiento del Lenguaje Natural

14 Izaskun Etxeberria, Iñaki Alegria, Mans Hulden, Larraitz Uria periments is Lapurdian, 2 a dialect of Basque spoken in the Lapurdi (fr. Labourd) region in the Basque Country. As Basque is an agglutinative and highly inflected language, we believe some of the results can be extrapolated to many other languages facing similar challenges. The differences between the dialect and the standard are minor overall; the word order and syntax are usually unaffected, and only a few lexical items differ. However, even such relatively small discrepancies cause great problems in the potential reuse of current tools designed for the standard forms. We have experimented with three approaches that attempt to improve on a simple baseline of memorizing word-pairs in the dialect and the standard. The first approach is based on the work by Almeida et al. (2010) on contrasting orthography in Brazilian Portuguese and European Portuguese. In this approach, differences between substrings in distinct wordpairs are memorized and these transformation patterns are then applied whenever novel words are encountered in the evaluation. To prevent overgeneration, the output of this learning process is later subject to a morphological filter where only actual standard-form outputs are retained. The second approach is an Inductive Logic Programming-style (ILP) (Muggleton and De Raedt, 1994) learning algorithm where phonological transformation rules are learned from word-pairs. The goal is to find a minimal set of transformation rules that is both necessary and sufficient to be compatible with the learning data, i.e. the word pairs found in the training data. The third approach uses Phonetisaurus, a weighted finite state transducer (WFST) driven phonology tool (Novak et al., 2012) in order to learn the changes using a noisy channel model. Based on the improved results using Phonetisaurus, we decided to explore morphophonological changes rather than phonological ones. The paper is organized as follows. Section 2 describes the related work. The characteristics of the corpus used for our experiments are described in section 3. Sections 4 and 5 describe the steps and variations of the methods we have applied and how they are 2 Sometimes also called Navarro-Labourdin or Labourdin. 14 evaluated. Section 6 presents the experimental results, and finally, section 7 discusses the results and presents possibilities for potential future work in this field. 2 Related work The general problem of supervised learning of dialectal variants or morphological paradigms has been discussed in the literature with various connections to computational phonology, morphology, machine learning, and corpus-based work. For example, Kestemont et al. (2010) presents a language-independent system that can learn intra-lemma spelling variation. Koskenniemi (1991) provides a sketch of a discovery procedure for phonological twolevel rules. The idea is to start from a limited number of paradigms, essentially pairs of input-output forms, where the input is the surface form of a word and the output a lemmatization plus analysis. Mann and Yarowsky (2001) present a method for inducing translation lexicons based on transduction models of cognate pairs via bridge languages. Bilingual lexicons within language families are induced using probabilistic string edit distance models. Inspired by that paper, Scherrer (2007) uses a generate-and-filter approach quite similar to our first method. He compares different measures of graphemic similarity applied to the task of bilingual lexicon induction between Swiss German and Standard German. 3 The corpus and the baseline 3.1 The corpus The parallel corpus used in this research was built in the TSABL project developed by the IKER research group in Baiona (fr. Bayonne). 3 It contains sentences written in the Lapurdian dialect as well as their equivalent sentences in standard Basque. Table 1 presents the details of the corpus, which consists of 2,117 parallel sentences, totaling 12,150 words (roughly 3,600 types). In order to provide data for our learning algorithms as well as to test their performance, we have divided the corpus into two parts: 80% of the corpus is used for the learning task (1,694 sentences) and the remaining 20% (423 3 Towards a Syntactic Atlas of the Basque Language, web site: -tsabl-towards-a-syntactic-atlas-of-.html

15 Learning to map variation-standard forms in Basque using a limited parallel corpus and the standard morphology Corpus Dev Test Sentences 2,117 1, Words 12,150 9,734 2,417 Unique words Standard Basque 3,553 3,080 1,192 Lapurdian 3,830 3,292 1,239 Filtered pairs 3,610 3,108 1,172 Identical pairs 2,532 2, Distinct pairs 1, Table 1: Characteristics of the parallel corpus used for experiments. sentences) for the evaluation. As the data show, roughly 23% of the word-pairs are distinct. 3.2 The baseline The baseline of our experiments is a simple method, based on a dictionary which contains a list of correspondences among words extracted from the learning portion (80%) of the corpus. This list of correspondences contains all different word pairs in the variant vs. standard corpus. The baseline approach consists simply of memorizing all the distinct word pairs seen between the dialectal and standard forms, and subsequently applying this knowledge during the evaluation task. That is, if an input word during the evaluation has been seen in the training data, we provide the corresponding previously known output word as the answer. 4 Previous work In our previous work, we employed two different methods to produce an application that attempts to extract generalizations from the training corpus to ultimately be able to produce the equivalent standard word corresponding to a given variant word. The first method is based on already existing work by Almeida et al. (2010) that extracts all substrings from lexical pairs that are different. From this knowledge we then produce a number of phonological replacement rules that model the differences between the input and output words. In the second method, we likewise produce a set of phonological replacement rules, using an ILP approach that directly induces the rules from the pairs of words in the training corpus. The core difference between the two methods is that while both extract replacement patterns 15 from the word-pairs, the first method does not consider negative evidence in formulating the replacement rules. Instead, the existing morphological analyzer is used as a filter after applying the rules to unknown text to prevent overapplication. The second method, however, uses negative evidence from the wordpairs in delineating the replacement rules as is standard in ILP-approaches, and the subsequent morphological filter for the output plays much less of a role. 4.1 Format of rules Two of the evaluated methods involve learning a set of string-transformation rules to convert words, morphemes, or individual letters (graphemes) in the dialectal forms to the standard variant. The rules that are learnt are in the format of so-called phonological replacement rules (Beesley and Karttunen, 2002) which we have later converted into equivalent finite-state transducers using the freely available foma toolkit (Hulden, 2009). The reason for this conversion of the rule set to finite-state transducers is twofold: first, the transducers are easy to apply rapidly to input data using available tools, and second, the transducers can further be modified and combined with the standard morphology already available to us as a finite transducer. In its simplest form, a replacement rule is of the format A B C D (1) where the arguments A,B,C,D are all single symbols or strings. Such a rule dictates the transformation of a string A to B, whenever the A is found between the strings C and D. Both C and D are optional arguments in such a rule, and there may be multiple, commaseparated, conditioning environments for the same rule. For example, the rule: h -> 0 p, t, l, a s o (2) would dictate a deletion of h in a number of contexts; when the h is preceded by a p, t, or l, or succeeded by the sequence aso, for instance transforming ongiethorri (Lapurdian) to ongietorri (Batua). 4.2 Method 1 (lexdiff) details The first method is based on the idea of identifying sequences inside word pairs

16 Izaskun Etxeberria, Iñaki Alegria, Mans Hulden, Larraitz Uria where the output differs from the input. This was done through the already available tool lexdiff which has been used in automatic migration of texts between different Portuguese orthographies (Almeida et al., 2010). The lexdiff program tries to identify sequences of changes from seen word pairs and outputs string correspondences such as, for example: 76 ait -> at ; 39 dautz -> diz (stemming from pairs such as (joaiten/joaten and dautzut/dizut), indicating that ait has changed into at 76 times in the corpus, etc., thus directly providing suggestions as to phonologically regular changes between two texts, with frequency information included. With such information about word pairs we generate a variety of replacement rules which are then compiled into finite transducers with the foma application. Even though the lexdiff program provides a direct stringto-string change in a format that is directly compilable into a phonological rule transducer, we have experimented with some possible variations of the specific type of phonological rule we want to output: We can restrict the rules by frequency and require that a certain type of change be seen at least n times in order to apply that rule. We can limit the number of rules that can be applied to the same word. We can control the application mode of the rules: sequential or parallel. We can compact the rules output by lexdiff by eliminating redundancies and constructing context-sensitive rules. For example: given a rule such as rkun -> rpen, we can convert this into k u -> p e r n (3) This has a bearing on the previous point and will allow more rewritings within a single word in parallel replacement mode since there are fewer characters overlapping. Once a set of rules is compiled with some instantiation of the various parameters discussed above and converted to a transducer, we modify the transducer in various ways to improve on the output. 16 Firstly, we restrict the output from the conversion transducer to only allow those words as output that are legitimate words in standard Basque. Secondly, in the case that even after applying the Batua filter we retain multiple outputs, we simply choose the most frequent word. 4.3 Method 2 (ILP) details The second method we have employed works directly from a collection of word-pairs (dialect/standard in this case). We have developed an algorithm that from a collection of such pairs seeks a minimal hypothesis in the form of a set of replacement rules that is consistent with all the changes found in the training data. This approach is generally in line with ILP-based machine learning methods (Muggleton and De Raedt, 1994). However, in contrast to the standard ILP, we do not learn statements of first-order logic that fit a collection of data, but rather, string-tostring replacement rules. The two parameters to be induced are (1) the collection of string replacements X Y needed to characterize the training data, and (2) the minimal conditioning environments for each rule, such that the collection of rules model the string transformations found in the training data. The procedure employed for the learning task is as follows: (1) Align all word pairs (using minimum edit distance by default). (2) Extract a collection of phonological rewrite rules. (3) For each rule, find counterexamples. (4) For each rule, find the shortest conditioning environment such that the rule applies to all positive examples, and none of the negative examples. Restrict rule to be triggered only in this environment. The following simple example should illustrate the method. Assuming we have a corpus of only two word pairs: emaiten igorri ematen igorri From this data we would gather that the only active phonological rule is i, since

17 Learning to map variation-standard forms in Basque using a limited parallel corpus and the standard morphology all other symbols are unchanged in the data. However, we find two counterexamples to this rule (step 3), namely two i-symbols in igorri which do not alternate with. The shortest conditioning environment that accurately models the data and produces no overgeneration (does not apply to any of the is in igorri) is therefore: i -> a (4) the length of the conditioning environment being 1 (1 symbol needs to be seen to the left plus zero symbols to the right). 5 Learning WFSTs using the noisy channel model We wanted to test the use of WFSTs (very popular in speech technology) in the task of learning Lapurdian/Standard Basque in order to obtain new methods and results and to compare them with the previous ones. The first tool we used was Carmel 4 but the results obtained were worse than the previous ones (and the tuning process was very challenging). Then, we experimented with a more modern tool for the purpose. The Phonetisaurus 5 tool was presented at the FSMNLP workshop of 2012 by J. Novak as a WFST-driven grapheme-to-phoneme (g2p) framework suitable for rapid development of high quality g2p or p2g systems. It is a new alternative, open-source, easy-to-use and authors report promising results. The framework include three functions: (1) Sequence alignment, (2) Model training and, (3) Decoding (Novak et al., 2012). The alignment algorithm is capable of learning many-to-many relationships and include three modifications to the basic toolkits: (a) a constraint is imposed such that only many-to-one and one-to-many alignments are considered during training. (b) During initialization a joint alignment lattice is constructed for each input entry, and any unconnected arcs are deleted. (c) All arcs, including deletions and insertions are initialized to and constrained to maintain a nonzero weight. The model training works as following: (a) Convert aligned sequence pairs to sequences of aligned joint label pairs, (b) Train an n- gram model from (a); (c) Convert the n-gram model to a WFST. Step (c) may be performed with any language modeling toolkit. The default decoder provided by the distribution simply extracts the shortest path through the phoneme lattice created via composition with the input word. 5.1 Using Phonetisaurus We have used the Phonetisaurus tool to obtain a grapheme-to-grapheme system, i.e. not a g2p or p2g tool. In practice, applying the tool is straightforward and can be described in two steps: 1. Prepare the data from which the model has to learn. In our case, this is a dictionary of word pairs that have been obtained from the corpus collecting equal and different word pairs such as: izan / izan, guziek / guztiek, and so on. 2. Train a model using this data. A Language Model training toolkit is necessary in this step for the n-gram calculations, and there are different possibilities as the author mentions in the tutorial (mitlm, NGramLibrary, SRILM, SRILM Max- Ent extension, CMU-Cambridge SLM ). We used NGramLibrary for our experiments. Once the model has been trained and converted to a WFST format, it can be used to generate correspondences for previously unseen words. In contrast to the Carmel tool, it is not necessary to infer an FST because the tool builds it through the alignments and n-gram training process. Only the data needs to be supplied in the appropriate format. There are two parameters to fix when we ask to the WFST to generate correspondences for new words: the number of transductions the WFST is going to return for each word and the size of the search beam. As is usual in n-gram based decoding, increasing the search beam evaluates more hypotheses but at a cost of decoding speed. The default value leads to a reduced number of hypotheses. We carried out a tuning process to decide the best values for those parameters. In order to perform the experiments, we divided the development corpus (the 80% of the total corpus) into 4 complementary subsets to

18 Izaskun Etxeberria, Iñaki Alegria, Mans Hulden, Larraitz Uria apply a cross-validation technique looking for the best values of the mentioned parameters. Dividing the corpus into four subsets allows us to make four experiments in which the test subset is the same size as the the final test. The conclusions of those experiments are presented in section 6.2. When there are multiple answers for a corresponding variant, it becomes necessary to perform some filtering. The first filter is obvious: we eliminate the answers that do not correspond to accepted standard words. Between the rest of the words, we select the most probable answer, according to Phonetisaurus. In total, we have performed three different experiments with Phonetisaurus giving different pairs to learn. 5.2 Word-word and word-morphemes pairs In the next three subsections the three different experiments are presented word-word In the first experiment, we have provided the tool with all the word pairs obtained from the development corpus including identical pairs and distinct pairs. For example: emaiten e m a t e n nehori n e h o r i word-morphemes In the second and the third experiments we provide Phonetisaurus with different pairs to train on. In the second part of the dictionary we have marked the morphological analysis of the corresponding standard word instead of the word itself. Using the morphological analysis we have performed two experiments. In the first one, the analysis includes morphophonemes and diacritics. For the words above, this would look like: emaiten e m a N + t e n nehori n e h o Q + R i Here, N, Q and R are morphophonemes expressing epenthetic n, r in lemmas, and r in suffixes. In the second experiment the analyses have been slightly simplified by converting morphophonemes to their equivalent grapheme form and by deleting diacritics. The result is the concatenation 18 of the morphemes using their canonical forms. For the words above this would be: emaiten e m a n + t e n nehori n e h o r + r i The hypothesis is that some morphophonemes and diacritics have a very low probability and are difficult to integrate into the learning process. In both experiments, due to the WFST generating a fixed number of candidates for a dialectal word, which in this case are morphological analyses, a new step is necessary in order to find the corresponding standard forms. In addition, an analysis may generate more than one standard form, and in this case we have to select only one of them (the most frequent one in our implementation). 6 Evaluation and results We have measured the quality of the different approaches by the usual parameters of precision, recall and the harmonic combination of them, the F 1 -score, and analyzed how the different options in the approaches affect the results. For the WFST solution, three different runs have been evaluated corresponding to the three possible representations of the standard form: word, morpheme sequence, and simplified morpheme sequence. As mentioned above, the learning process has made use of 80% of the corpus, leaving 20% of the corpus for evaluation of the abovementioned approaches. In the evaluation, we have only tested those words in the dialect that differ from words in the standard (which are in the minority). In total, in the evaluation part, we have tested the 301 words that differ between the dialect and the standard in the evaluation part of the corpus. The results for the baseline i.e. simple memorization of word-word correspondences are (in %): P = 95.62, R = and F 1 = As expected, the precision of the baseline is high: when the method gives an answer it is usually the correct one. But the recall of the baseline is low, as is expected: slightly less than half the words in the evaluation corpus have been encountered before. 6 6 The reason the baseline does not show 100% pre-

19 Learning to map variation-standard forms in Basque using a limited parallel corpus and the standard morphology 6.1 Previous results The results for the first two approaches were published in detail in a previous paper (Hulden et al., 2011) Results using lexdiff After experiments we may note that applying more than one rule within a word has a negative effect on the precision while not substantially improving the recall. Applying the unigram filter choosing the most frequent candidate yields a significant improvement: much better precision but also slightly worse recall. Choosing either parallel or sequential application of rules (when more than one rule is applied to a word) does not change the results significantly. Finally, compacting the rules and producing context-sensitive ones is clearly the best option. In all cases the F 1 -score improves if the unigram filter is applied; sometimes significantly and sometimes only slightly. The best result is shown in table 2. The options to obtain this result are: frequency 2; 2 rules applied; in parallel; with contextual conditioning Results using ILP The only variable parameter with the ILP method dictates how many times a word-pair must be seen to be used as a learning evidence for creating a replacement rule. As expected, the strongest result is obtained by using all word-pairs, i.e. setting the threshold to 1. This is the result shown in table 2. Interestingly, adding the unigram filter that improved results markedly in method 1 to the output of the ILP method slightly worsens the results in most cases, and gives no discernible advantage in others. In other words, in those cases where the method provides multiple outputs, choosing the most frequent one on a unigram frequency basis gives no improvement over not doing so. 6.2 Results using WFST The experiments done by cross-validation with the development corpus to decide the best values of the parameters for the test have consisted in increasing the number of retrieved answers (1, 3, 5, 10, 20 or 30) and cision is that the corpus contains minor inconsistencies or accepted alternative spellings, and our method of measuring the precision suffers from such examples by providing both learned alternatives to a dialectal word, while only one is counted as being correct. 19 P R F 1 Baseline Lexdiff ILP Table 2: The best results (per F 1 -score) obtained with the first two methods). P R F 1 WFST1 (N=5) WFST2 (N=20) WFST3 (N=20) Table 3: Average results obtained by crossvalidation on development corpus with the three WFSTs. N is the number of asked answers. WSFT1: word/word. WFST2: word/morph-seq. WFST3: word/simplmorph-seq. In all the cases the search beam is varying the search beam (default value or 5,000). In all the WFSTs, specifying a search beam of 5,000 is better than using the default beam. As regards the number of answers, retrieving more answers yields a better F 1 -score in the three WFSTs until an upper limit is reached. The upper limit is reached at 20 answers using morpheme sequences (WFST2 and WFST3). In WFST1, the plateau is reached at 5. Table 3 shows the results obtained. Another important conclusion is that managing this value N we can balance precision and recall. Finally, table 4 shows the final results obtained using the 80% of the corpus to train and the 20% to test. As the table shows, the best results are obtained using the last WFST. The differences among them are not statistically significant (p-values > 0.1 using Bhapkar s test). Anyway identifying morphemes is interesting for our future work (learning paradigms). These results are overall consistently better than the ones obtained with the previous methods (see table 2).

20 Izaskun Etxeberria, Iñaki Alegria, Mans Hulden, Larraitz Uria P R F 1 WFST WFST WFST Table 4: Results obtained in the final test with three WFSTs. 7 Conclusions and future work We have presented a number of experiments to solve a very concrete task: given a word in the Lapurdian dialect of Basque, produce the equivalent standard Basque word. As background knowledge, we have a complete standard Basque morphological analyzer and a limited parallel corpus of dialect and standard text. The approach has been based on the idea of extracting string-to-string transformation rules from the parallel corpus, and applying these rules to unseen words. We have been able to improve on the results of a naive baseline using three methods to infer phonological rules of the information extracted from the corpus and applying them with finite state transducers. When weights have been inferred the results have been improved. The results using noisy-channel model (implemented using the Phonetisaurus tool) and standard morphological analysis seems very promising. In order to improve on these results, we plan to study the combination of the previous methods with other ones which infer dialectal paradigms and relations between lemmas and morphemes for the dialect and the standard. These inferred relations could be contrasted with the information of a larger corpus of the dialect without using an additional parallel corpus. Acknowledgments We are in debt to Josef Novak for his useful help using Phonetisaurus. This research has been partially funded by the Spanish Science and Innovation Ministry (Tacardi project, TIN C02-01) and by the Basque Government (Ber2tek, Etortek-IE12-333). References 20 Alegria, I., Aranzabe, M., Ezeiza, N., Ezeiza, A., and Urizar, R. (2002). Using finite state technology in natural language processing of basque. In LNCS: Implementation and Application of Automata, volume 2494, pages Springer. Almeida, J. J., Santos, A., and Simoes, A. (2010). Bigorna a toolkit for orthography migration challenges. In Seventh International Conference on Language Resources and Evaluation (LREC2010), Valletta, Malta. Beesley, K. R. and Karttunen, L. (2002). Finite-state morphology: Xerox tools and techniques. Studies in Natural Language Processing. Cambridge University Press. Hulden, M. (2009). Foma: a finite-state compiler and library. In Proc. of the 12th Conference of the EACL, pages 29 32, Athens, Greece. ACL. Hulden, M., Alegria, I., Etxeberria, I., and Maritxalar, M. (2011). Learning word-level dialectal variation as phonological replacement rules using a limited parallel corpus. In Proc. of the Dialects EMNLP, pages Kestemont, M., Daelemans, W., and Pauw, G. D. (2010). Weigh your words memory-based lemmatization for Middle Dutch. Literary and Linguistic Computing, 25(3): Koskenniemi, K. (1991). A discovery procedure for two-level phonology. Computational Lexicology and Lexicography: A Special Issue Dedicated to Bernard Quemada, pages Mann, G. S. and Yarowsky, D. (2001). Multipath translation lexicon induction via bridge languages. In Proc. of the second meeting of the NAACL, NAACL 01, pages 1 8. Association for Computational Linguistics. Muggleton, S. and De Raedt, L. (1994). Inductive logic programming: Theory and methods. The Journal of Logic Programming, 19: Novak, J. R., Minematsu, N., and Hirose, K. (2012). WFST-based grapheme-tophoneme conversion: Open source tools for alignment, model-building and decoding. In Proc. of the 10th FSMNLP. Scherrer, Y. (2007). Adaptive string distance measures for bilingual dialect lexicon induction. In Proceedings of the 45th Annual Meeting of the ACL., ACL 07, pages ACL.

21 Procesamiento del Lenguaje Natural, Revista nº 52 marzo de 2014, pp recibido revisado aceptado Fénix: a flexible information exchange data model for natural language processing Fénix: un modelo de datos flexible para el intercambio de información en procesamiento del lenguaje natural José M. Gómez, David Tomás, Paloma Moreda Depto. de Lenguajes y Sistemas Informáticos - Universidad de Alicante Carretera San Vicente del Raspeig s/n Alicante (Spain) Resumen: En este artículo se describe Fénix, un modelo de datos para el intercambio de información entre aplicaciones en el campo del Procesamiento del Lenguaje Natural. El formato propuesto está pensado para ser lo suficientemente flexible como para dar cobertura a estructuras de datos, tanto presentes como futuras, empleadas en el campo de la Lingüística Computacional. La arquitectura Fénix está dividida en cuatro capas: conceptual, lógica, persistencia y física. Esta división proporciona una interfaz sencilla para abstraer a los usuarios de los detalles de implementación de bajo nivel, como los lenguajes de programación o el almacenamiento de datos empleado, permitiéndoles centrarse en los conceptos y procesos a modelar. La arquitectura Fénix viene acompañada por un conjunto de librerías de programación para facilitar el acceso y manipulación de las estructuras creadas en este marco de trabajo. También mostraremos cómo se ha aplicado de manera exitosa esta arquitectura en diferentes proyectos de investigación. Palabras clave: modelo de datos, herramientas de PLN, integración de recursos, intercambio de información Abstract: In this paper we describe Fénix, a data model for exchanging information between Natural Language Processing applications. The format proposed is intended to be flexible enough to cover both current and future data structures employed in the field of Computational Linguistics. The Fénix architecture is divided into four separate layers: conceptual, logical, persistence and physical. This division provides a simple interface to abstract the users from low-level implementation details, such as programming languages and data storage employed, allowing them to focus in the concepts and processes to be modelled. The Fénix architecture is accompanied by a set of programming libraries to facilitate the access and manipulation of the structures created in this framework. We will also show how this architecture has been already successfully applied in different research projects. Keywords: data model, NLP tools, resource integration, information exchange 1 Introduction Any research work should be motivated by the idea of sharing knowledge, tools and resources that can be employed by other researchers to jointly improve their area of expertise. The research carried out in the field of Natural Language Processing (NLP) relies heavily on resources and tools previously de- This research has been partially funded by the Spanish Ministry of Economy and Competitiveness under project LegoLangUAge (Técnicas de Deconstrucción en las Tecnologías del Lenguaje Humano, TIN ). ISSN veloped by other community members. For instance, a text classification system may depend on the output generated by morphological tools (e.g., part-of-speech taggers), syntactic tools (e.g., shallow parsers), and semantic tools (e.g., named entity recognizers). If this system followed a machine learningbased approach, it could also require as an input a corpus to train and validate the system. At some point in the process of developing almost any NLP application, every researcher faces the problem of integrating diffe Sociedad Española para el Procesamiento del Lenguaje Natural

22 José M. Gómez, David Tomás, Paloma Moreda 22 rent tools and resources in their frameworks. In these situations, researches and developers usually have to do a significant effort to adapt and integrate their products with previously existing ones, since different people employ different input and output formats. Moreover, this effort has to be done any time a component is changed by a different one. In a worst case scenario where n different inputs for m different tools are available, a total of n m conversions between formats must be done in order to process them all. This problem could be mitigated by establishing a common information exchange format, adapting the n different inputs to this new format. We could also adapt the m different tools to process this common format and reduce the need for conversions to just n + m different possibilities in this case. The problem of integrating tools and resources not just only affects the cooperation between different research groups, but also intra-group collaboration suffers from that problem (Moreno-Monteagudo and Suárez, 2005). Taking into account the amount of tools and resources available nowadays, it becomes increasingly necessary the development of frameworks to easily integrate heterogeneous sources of information to build up more complex NLP systems. Moreover, standardizing inputs and outputs not just facilitates researchers the consumption of resources and tools, but also the dissemination of their work and its citing and reuse by other community members. In this paper we present Fénix, an information exchange data model to facilitate sharing of information between different NLP processes. The purpose of this model is to provide a standard to encode inputs and outputs for different process types in the field of computational linguistics (partof-speech taggers, syntactic parsers, text classifiers, etc.), facilitating in this way the integration of different NLP resources and tools. Although this paper focuses in the application of Fénix in the area of NLP, the model is flexible enough to be applied in any process communication context. Our proposal tries to bring together the most relevant features included in previous models, covering the gaps in existing work. The most relevant feature of Fénix, which distinguishes it from other approaches, is the adaptability. The model proposed is not limited to a fixed set of predefined types, since new data types can be defined for new processes as necessary. This flexibility does not have an impact in the usability of the model, since Fénix provides a simple interface based on a four layer architecture that abstracts the user from implementation details, such as data structures and storage. The remainder of this article is organized as follows. Next section reviews the related work in the field of NLP processes communication and integration. Section 3 describes all the components involved in the Fénix architecture. Section 4 provides details on implementation experiences already carried out with Fénix in different research projects. Finally, Section 5 summarises conclusions and future work. 2 Related Work There are two main approaches in the existing research works carried out for the integration of NLP resources and tools: (i) projects that only define the data format used by processes to communicate between them; (ii) projects that take into account data and tools in a unique platform. Regarding data integration, we can highlight the Annotation Graph Toolkit (Maeda et al., 2001) and the Atlas architecture (Bird et al., 2000). Both systems propose a three level architecture, comprising logical, physical, and application levels. The logical level implements a generalization of the annotation graph model presented by Bird and Liberman (2001). Although this logical level provides independence of the application and the physical storage, it does not allow separating the information in different layers. Thus, every process has to upload all the previous annotations to complete the task. Another relevant system is EMU (Cassidy and Harrington, 2001), a system intended for labelling, managing, and retrieving data from speech databases. Although EMU is portable to major computing platforms and provides integration of hierarchical and sequential labelling, the area of application is limited to speech data. With respect to data and tools integration, two systems have been widely employed by the NLP community: GATE (Cunningham et al., 2011) and UIMA (Ferrucci and Lally, 2004). The first one offers a framework and an environment for language engineering.

23 Fénix: a fleixble information exchange data model for natural language processing 23 As a framework, it provides a set of software components that can be used, extended, and customised for specific needs. As a development environment, it facilitates adding new components. The process of integrating new components is straightforward in the case of Java. However, for other programming languages this process is more complicated since each resource is treated as a Java class. On the other hand, UIMA is the result of the efforts carried out by IBM to create a common architecture and a robust software framework that would be able to reuse and combine results of its different working teams, accelerating the process of transferring the advances made in NLP into the IBM s product platform. Although it provides a common framework for combining NLP components, these components are always limited to IBM products. Thus, although some efforts have been made to develop integration platforms in the field of NLP, none of them is flexible and general enough to provide a definite, easy, and adaptable information exchange model to the NLP community. In this sense, the proposal described in this paper provides a simple interface based on a four-layer architecture, comprising conceptual, logical, persistence, and physical levels. Previous layer-based models usually define three layers, jointly considering physical and persistence layers. This distinction in our proposal allows the storage of information in different formats by just modifying the persistence layer (see Section 3). Another relevant feature of Fénix is the possibility of distributing the information in different sources. For instance, the result of a part-of-speech (POS) tagger could be stored in a file, whereas the original text could remain in a different file, providing links between the initial tokens and the POS labels assigned. In this way, unlike many previous approaches, it is not necessary to load all the information for every process, focusing only on the data necessary to accomplish a particular task. Fénix was originally conceived as part of the InTime architecture (Gómez, 2008), an integration platform for NLP tools and resources. In this platform, Fénix provides the data model to facilitate the information exchange between heterogeneous processes. As part of this architecture, there is a set of libraries available for developers to create, access, and modify Fénix objects. 3 Fénix Architecture Fénix is a data model for information exchange between computational linguistics processes. Due to the heterogeneity of systems and tasks in this research area, it is very difficult to define all the possible types of data structures that may be necessary in this field. That is why Fénix s philosophy is based on a logic model flexible enough to incorporate both current and possible future data structures. In order to achieve this goal, Fénix is divided into four separate layers: conceptual, logical, persistence, and physical. Figure 1 shows how different layers are related in our model. Phisical layer Persistence layer Figura 1: Fénix four-layer architecture. The conceptual layer is in the top level and it is used to define the conceptual objects, called object wrappers in Fénix. These objects will provide the input and output public interfaces in order to abstract the logical layer and its structure to the end user. For instance, if we add a text string into Fénix, the end user will use a wrapper interface to interact with it. In this case, a Fénix object is created, which represents an instance of the text concept in the model. The interfaces of this object will have the necessary public methods, for instance gettext() and settext(), to access from or store into Fénix text objects. Each type of wrapper has its own interface and involves different model concepts. For example, an input text, a classification result, tokens from a given text, or a search result will be considered different concepts, and thus different wrapper types in the Fénix model.

24 José M. Gómez, David Tomás, Paloma Moreda The conceptual layer is based on the logical layer, which defines complex information elements and their structure. The logical layer consists of information elements called unit. These elements are indivisible and represent the result of a process, where a process can generate more than one information unit. Each unit represents a type of data that could be considered simple (e.g., a string) or complex (e.g., the result of a text classification or an information search), containing a type that reveals its structure and what information is included. That is, unit elements of the same type have the same structure. For example, the unit type plain text could store a text string, an optional source, and the start and final position of the text (the relative position from the beginning of the document where the string was located). The source is a reference to a related set unit elements, indicating from what unit elements was the information obtained. For instance, we could obtain a text plain element without stopwords from the original text which included these terms. The persistence and physical layers are in the lowest level. The physical layer defines how the Fénix model is implemented and which programming languages can be employed to process the model. On the other hand, the persistence layer defines how to import and export each Fénix object and in what formats can the data be persistent. In fact, different objects can be stored in different formats, also offering the possibility of distributing the information on disk and memory to optimize the use in several tasks. Moreover, the user can decide which objects are finally stored and which are not. Figure 2 shows the structure of the different modules of Fénix model and its components. For clarity, the physical layer is not shown in this scheme, but it is the basis to implement the entire structure of the model. A unit is composed of one or more complex information structures called item. This item represents a part of the information contained in a unit, but they become useful only when considered together with other item elements of the unit. An XML representation of this model is shown as follows: <fenix version="1.0.0"> <unit id="unit_id" type="unit_type" [tool="tool_name"]> <item id="item_id_1" data_type="simple"> <info id="id_1.1" data_type="info_type"> 24 value </info> </item> <item id="item_id_2" data_type="vector"> <item id="0" data_type="item_type">... </item> <info id="1" data_type="info_type"> value </info> <item id="2" data_type="item_type">... </item> <info id="3" data_type="info_type"> value </info>... </item> <item id="item_id_3" data_type="struct"> <item id="id_3.1" data_type="item_type">... </item> <info id="id_3.2" data_type="info_type"> value </info> <item id="id_3.3" data_type="item_type">... </item> <info id="id_3.4" data_type="info_type"> value </info>... </item>... </unit>... </fenix> The unit has the attribute tool that indicates from which tool has been obtained the data of this unit. This attribute is optional and cannot be set if it is unknown or the unit represents the input data. All unit, item and info elements contain an identifier. Whereas the unit identifier must be unique (cannot exist two unit with the same identifier), the item or info identifier must be unique only at the level of its container (a unit or another item). For example, the following code represents the output of a question answering system that returns three information units: the input question, the question language detected by the system, and the answers found. <fenix version="1.0.0"> <unit id="input_question" type="plain_text"> <item id="text" data_type="simple"> <info id="value" data_type="string">who is the president of Spain?</info> </item> </unit> <unit id="question_lang" type="lang" tool="jirs"> <item id="sources" data_type="struct"> <info id="text" data_type="id">input_query</ info> </item> <item id="lang" data_type="simple"> <info id="value" data_type="string">en</info> </item>

25 Fénix: a fleixble information exchange data model for natural language processing XML BD Objects Others Persistence Import Export Unit Object Wrappers Unit Item Item Item Item Inf Inf Inf Inf Inf Item Inf Item Inf Item Inf Item Inf Inf Inf Inf Figura 2: Module structure in Fénix. </unit> <unit id="answers" type="answers" tool="jirs"> <item id="sources" data_type="struct"> <info id="question" data_type="id"> input_question</info> <info id="lang" data_type="id">question_lang</ info> </item> <item id="results" data_type="vector"> <item id="0" data_type="struct"> <info id="text" data_type="string">mariano Rajoy</info> <info id="score" data_type="float">1.0</info > </item> <item id="1" data_type="struct"> <info id="text" data_type="string">jos Luis Rodrguez Zapatero</info> <info id="score" data_type="float">0.8</info > </item> </item> </unit> </fenix> category in the Fénix model. It should be noted that these are only a small sample of the established unit types, and the model is open to include new types with its own internal structure and wrappers. As we can see in the previous example, all the identifiers (id) of a unit are different. 25 Nevertheless, the item sources appears in question lang and answers entries, whereas item identifiers text and score occur in several subitems of results. Although any id is repeated in the same scope, cannot exist neither two item sources as children of the unit answers, nor two info text as children of the item answers.results.1. The label unit can be assigned many different types and it is open to new types of information to be incorporated in the future. Whenever a new unit type is created, the XML specification is added in the project Wiki. 1 Some unit types already implemented are: plain text, for plain texts; categories, to store different categories for a classification process; and classification which relates a sample with its category in the Fénix model. It should be noted that these are only a small sample of the established unit types, and the model is open to include new types with its own internal structure and wrappers. The item data type, however, can only be one of three different types: simple, vector, 1

26 José M. Gómez, David Tomás, Paloma Moreda and struct. An item of type simple contains only one info element; an item of type vector contains a sequence of item or info; finally, an item of type struct contains a complex structure formed by other item or info elements, which could be referenced by an identifier id. As shown in the previous example, item may contain other item elements or basic format information info, the terminal nodes of the model. Therefore, information units may be composed of various combinations of item and info elements, considering two limitations: info elements must have an item parent and should be the terminal nodes of the model, i.e., they must contain a basic data type and cannot include other item or info elements. The info elements can only pertain to one of the following basic types: character: individual characters string: sequence of characters integer: integer value without decimal part float: simple precision floating point number double: double precision floating point number date: date/time in different formats depending on location object: programming object id: reference to another Fénix element It is worth noting the object and id data types. Since Fénix, apart from a model is a framework for data exchanging between different processes, we considered necessary to allow the storage of programming objects in the info elements. For instance, we could store in a process the database connection as a JAVA object, passing it to another process instead of opening a new database connection every time. The last basic type of Fénix information is id. All Fénix elements (unit, item and info) have an identifier and can be referenced by an info element of type id. These identifiers are hierarchical, being unit the top level element that can be referenced, and info the lowest one. Therefore, references of type id are formed by concatenating the identifier of all parent nodes to the element you want to reference, separated by dots. Consider the following example: 26 <unit id="search_result" type="snippets" [tool=" tool_name"]> <item id="sources" data_type="struct"> <info id="query" data_type="id">input_query</ info> </item> <item id="results" data_type="vector"> <item id="0" data_type="struct"> <info id="url" data_type="string">url_1</ info> <info id="title" data_type="string">title_1< /info> <info id="snippet" data_type="string"> snippet_1</info> <info id="score" data_type="float">score_1</ info> </item> <item id="1" data_type="struct"> <info id="url" data_type="string">url_2</ info> <info id="title" data_type="string">title_2< /info> <info id="snippet" data_type="string"> snippet_2</info> <info id="score" data_type="float">score_2</ info> </item>... </item> </unit> The unit type snippets is employed for storing the results of a search engine. If we would like to reference the URL of the second snippet, the identifier will be search result.results.1.url, where search result is the information unit identifier, results is the identifier of the second item which contains the snippet list, 1 is the item number inside of the result vector results, and url is the info element with the information to retrieve. In the previous example, an item with the identifier sources is also present. This special item can occur in any information unit and it is employed to know from which information unit or units it was obtained. Following the previous example, thanks to this item we know that the search result has been obtained from a query whose text is in the unit with the input query identifier. Thus, if needed, we can track back the information to the source and retrieve intermediate results which would otherwise be missed. For example, if we model a POS tagger, the result of this tool would be a list of values that correspond to each POS tag from the initial text. But if we want to display the final result (the POS tags) and the original terms together, we could use these backward references.

27 Fénix: a fleixble information exchange data model for natural language processing 4 Applications Fénix has already been successfully applied to several projects developed in our research group. The two main applications where this model were used are Java Process Manager 2 (JPM) and InTime. 3 structure of Project 1 was notably simplified, allowing to make simpler and more independent processes. JPM is a development framework for creating processes in the area of NLP. It is focused on developing modular and customizable tools for researchers to easily test different modules for the same NLP task, just by changing the system parameters. Moreover, JPM allows integrating both native and external processes, independently from the program language or the operating system employed. A JPM application defines its behaviour thanks to a configuration file that indicates which processes are executed, in which order and conditions. One of the main advantages of this framework is the possibility of changing the tool process architecture by just modifying the configuration file. For example, we could convert a pipeline to a client/server architecture, a parallel processing, a distributed processing or a combination of them. One of the most relevant problems of JPM, before the inclusion of the Fénix model, was how to share the data between processes. Thanks to Fénix, the configuration file structure of JPM was notably simplified, allowing to make simpler and more independent processes. On the other hand, InTime is a distributed integration and exchange platform of tools and resources based on P2P technology. The goal of InTime is to provide researchers with a simple shared platform to discover and use tools developed by other researchers. Employing Fénix was basic in order to provide a shared space of data exchange between processes. Other applications in which Fénix has been successfully employed are: GPLSI Dossier, 4 an application to classify news based on customers criteria; GPLSI Classifier, a text classifier based on different NLP processing tools (tokenization, lemmatization, n-gram extraction, etc.); MONEI, 5 a meta search engine for business opportunities in foreign markets; Pyramid, an Internet crawler which is able to process hundreds of thousands of web pages per day; and Social Observer, 6 an application which monitorizes tweets and gives them a value according to their sentiment polarity. Finally, Fénix is currently being employed in the LegoLangUAge 7 project, as the basis to build up the basic information units called L-Bricks. These units define the data structures and their relations between other L-Bricks and the ontology of the system. Fénix was chosen among other data models due to its flexibility and coverage to all the needs of the LegoLangUAge project. Regarding the application of Fénix to these projects, building Fénix objects was just a matter of building a wrapper based on the basic structures described before: units, structs and infos. Objects created in this way were shared by means of Subversion 8 in a Sourceforge repository, 9 being immediately available for any user. We developed templates for Netbeans 10 providing the basic wrapper s structure and methods to work with it. Developing a wrapper can be performed in less than an hour for someone with a reasonable knowledge of the model. Once created, the further use of the wrapper by other researchers is straightforward. 5 Conclusions and Future Work In this paper we have presented Fénix, a data model designed for information exchange between NLP processes. The data model proposed is flexible and scalable, intended to provide a generic data representation relying on a reduced set of basic tags to codify a wide coverage of NLP tools and corpus. The Fénix architecture is accompanied by a set of programming libraries to facilitate the access and manipulation of the structures created in this framework. We have also presented a set of research projects and tools where this architecture has been already successfully applied. The application of Fénix allowed simplifying the integration and communication between processes in all the con https://netbeans.org/. 27

28 José M. Gómez, David Tomás, Paloma Moreda texts described. As future work, we plan to continue with the application of the model as the core of the information exchange in our current and next developments, with a particular focus on setting the basis in LegoLangUAge to build up the basic information units called L-Bricks. We also plan to further release new libraries in different programming languages to facilitate accessing Fénix to the NLP research community. para el procesamiento del lenguaje natural. Procesamiento del Lenguaje Natural, 35. References Bird, S., D. Day, J. Garofolo, J. Henderson, C. Laprun, and M. Liberman Atlas: A flexible and extensible architecture for linguistic annotation. In Proceedings of the second international conference on Language Resources and Evaluation, LREC 00. Bird, S. and M. Liberman A formal framework for linguistic annotation. Speech Communication, 33(1-2): Cassidy, S. and J. Harrington Multilevel annotation in the emu speech database management system. Speech Communication, 33(1 2): Cunningham, H., D. Maynard, K. Bontcheva, V. Tablan, N. Aswani, I. Roberts, G. Gorrell, A. Funk, A. Robert, D. Damljanovic, T. Heitz, M. A. Greenwood, H. Saggion, J. Petrak, Y. Li, and W. Peters Text Processing with GATE (Version 6). Ferrucci, D. and A. Lally Uima: an architectural approach to unstructured information processing in the corporate research environment. Natural Language Engineering, 10(3-4): Gómez, J. M Intime: Plataforma de integración de recursos de pln. Procesamiento del Lenguaje Natural, 40: Maeda, K., S. Bird, X. Ma, and H. Lee The annotation graph toolkit: software components for building linguistic annotation tools. In Proceedings of the first international conference on Human language technology research, HLT 01, pages 1 6, Stroudsburg, PA, USA. Association for Computational Linguistics. Moreno-Monteagudo, L. and A. Suárez Una propuesta de infraestructura 28

29 Procesamiento del Lenguaje Natural, Revista nº 52 marzo de 2014, pp recibido revisado aceptado Collecting and POS-tagging a lexical resource of Japanese biomedical terms from a corpus Recogida y etiquetado morfológico de un lexicón de términos biomédicos en japonés a partir de corpus Carlos Herrero Zorita, Leonardo Campillos Llanos, Antonio Moreno Sandoval Laboratorio de Lingüística Informática. Departamento de Lingüística Facultad de Filosofía y Letras. Universidad Autónoma de Madrid c\ Francisco Tomás y Valiente, 1. Campus de Cantoblanco. Madrid {carlos.herrero, leonardo.campillos, Resumen: El artículo resume el proceso de recopilación de un lexicón de términos biomédicos en japonés etiquetados morfológicamente. En primer lugar se han considerado para esta tarea las características morfosintácticas del japonés así como el origen y formación de los términos médicos en esta lengua. Posteriormente la lista se ha recopilado utilizando el corpus japonés MultiMedica, las etiquetas especiales de un etiquetador morfológico y varios diccionarios médicos especializados. Para el siguiente proceso de etiquetado se han considerado tres etiquetadores japoneses (ChaSen, Mecab, Juman), de los cuales se ha escogido este último. Una vez etiquetado, se ha corregido el problema de la sobresegmentación de los términos japoneses y se han simplificado las etiquetas para el propósito de nuestra tarea. Este recurso es la base para la creación de un extractor de términos médicos en japonés. Palabras clave: terminología médica, japonés, recurso léxico, análisis morfológico. Abstract: The following paper explains the methodology followed for the creation of a morphologically tagged medical lexicon in Japanese. In order to build this medical resource we have taken into account the morphosyntactic characteristics of the language as well as the origins and formation of the medical terms. Following this, we have compiled a list using the Japanese MutiMedica corpus, special tags from a POS tagger, and several specialised medical dictionaries. After considering three different taggers (ChaSen, Mecab, Juman) we finally chose Juman for the tagging of the lexicon. The problem of the oversegmentation of the language was then corrected and the tags have been normalised. This resource is the base component for the creation of a medical term extractor. Keywords: medical terminology, Japanese, lexical resource, POS tagging. 1 Introduction Natural language processing tasks for domain-specific texts (e.g. biomedicine) rely upon comprehensive lexical resources. The Unified Medical Language System (UMLS) Specialist lexicon and Metathesaurus (Donnelly, 2006) are the major resources available for English (Bodenreider, 2006 provides further references). Despite the lack of thesaurus for other languages, multilingual lexical databases such as EuroWordNet have also been applied in the field of medical terminology (Vivaldi and Rodríguez, 2002). In the following pages we will present a morphologically tagged Japanese lexicon of the medical domain. This list of terms will be used for developing a Japanese automatic term recognition (ATR) system. The article explains the methodology followed towards the creation of the lexicon. For this purpose, we took two steps: firstly, we translated 548 Graeco-Latin medical affixes into Japanese; and secondly, we compiled the medical lexicon. The list of terms was compiled ISSN Sociedad Española para el Procesamiento del Lenguaje Natural

30 Carlos Herrero Zorita, Leonardo Campillos Llanos, Antonio Moreno Sandoval using three resources: a medical corpus, Japanese morphological analysers, and specialised dictionaries. The lexicon was morphologically tagged, taking into account the morphosyntactic challenges that this language entails. The paper is divided as follows. Section 2 will describe the MultiMedica project and corpus. Section 3 will provide a theoretical background, including the formation of Japanese medical terms and the origins of the medical terms in Japanese and in Western languages. Finally, Section 4 will explain the steps given towards the creation of the lexicon. 2 Description of the MultiMedica project and corpus The data of this work is based on the Japanese texts from the MultiMedica corpus. This collection was compiled by the Computational Linguistics Laboratory at the Autonomous University of Madrid (LLI-UAM) 1, as part of the MultiMedica project (Martínez et al., 2011) 2. It is a specialised comparable corpus formed by biomedical texts written in Spanish, Arabic, and Japanese. The corpus assembles 51,476 documents and more than seven and a half million words in three languages, Arabic, Japanese and Spanish (Moreno-Sandoval and Campillos-Llanos, 2013). Documents were gathered from professional books and journals that were written by medical doctors, as well as from articles drafted by health professionals and edited by journalists. Thus, the corpus collects both technical and informative articles. Texts cover most medical specialties The Japanese corpus The Japanese corpus is made up of abstracts from medical journals on different specialties (e.g. Oriental Medicine, Obstetrics, and Gynecology). The total corpus size is 1,131,304 Japanese characters (kanji and kana) (Table 1) MultiMedica is a coordinated project involving the Universidad Carlos III de Madrid, the Universidad Politécnica de Madrid, and the Universidad Autónoma de Madrid. It aims at research on natural language processing for the biomedical domain. Further information is available at: The query interface The following step of the project at the LLI-UAM was the development of a query interface that allows the user to consult and concordance the corpus. This tool, which is still in beta phase, will allow multiple search options, including the distinction between form and lemma, beginnings and endings of words, and morphological category, as well as other functionalities such as frequency extraction and collocations. Figure 1 shows an example of the result of a query made of the kanji ( liver ) in the Japanese corpus. The next section will provide a brief description of previous work related to this matter, as well as the origins of the medical terms in Spanish, English, and Japanese. Japanese corpus Texts Characters Kampo Medicine (Oriental medicine in ,757 Japan) Kansenshogaku Zasshi (Infectious diseases ,879 Journal) Kanzo (Liver diseases 1, ,674 Journal) ORLTokyo (Japanese ,705 otolaryngology) Sanfujinka no shinpo (Advances in obstetrics) ,289 Table 1: Description of the Japanese corpus 3. Theoretical background 3.1. Previous work Japanese proves to be a challenging language for Natural Language Processing tasks, due to its morphosyntactic characteristics. One of the most noticeable problems is the so-called oversegmentation (Hisamatsu and Nitta, 1996). Morphological taggers have problems to tokenize Japanese words due to the fact that characters are not separated by blank spaces, and the agglutinative nature of this language. This is especially problematic in formal or specialised discourses. 30

31 Collecting and POS-tagging a lexical resource of Japanese biomedical terms from a corpus Figure 1: The search interface This problem has been widely studied in works concerning compound word analysis and recognition of unknown terms (Nagata 1999; Masaaki,1999; Han et al. 2002; Kudo, 2007; Murawaki and Kurohashi, 2010, among others) with different approaches towards the development of automatic term extractors (Nakagawa and Mori, 2002 and Oh et al. 2000). As Murawaki and Kurohashi (2010: 832) explain, dictionaries are indispensable for Japanese morphological analysis because not only part-of-speech (POS) tagging is required, but also a process of segmentation. In our project, we have collected a Japanese medical lexicon that has been morphologically tagged. Following this, the oversegmentation has been manually corrected. The next section provides a reflection on Japanese medical terminology and on the challenges that arose when we processed it in comparison to other languages Formation and types of terms Medical terms in Western languages have their origins in Ancient Greece in the Hippocratic Corpus. These terms would then be adapted in Rome by Galen, whose medical practice would dominate the medical knowledge until the beginning of the modern era (Longrigg, 2002: 29-39). For this reason, even though medical practices have changed today, the language or, more specifically, the language of medical technicalities still has its origin in ancient Greek and Latin. Terms are, therefore, formed by the addition of Graeco-Latin affixes. For example, gastritis ( inflammation of the lining of the stomach ) is constructed with the root gastr- (from Latin gastro-, stomach, which originally evolved from Greek grastro-), and with the Graeco-Latin suffix -itis ( diseases characterized by inflammation ). In Japanese, on the other hand, the picture is quite different. From the early beginnings of the Japanese culture, Chinese medicine was the major way of medical practice in Japan. Hence, terms belong to Chinese characters, adapted over the years into the Japanese kanji (Izumi and Isozumi, 2001: 91). However, the vast majority of the medical terms employed today were borrowed from Western languages. Since the Sakoku era, the first medical terms from the West arrived through the medicine books traded with Dutch merchants (Irwin, 2011: 37). These words finally rooted officially in the language in the 19 th century, when Japan opened up and the Meiji government adopted the German medical educational system (2011: 51). The initial loanwords were introduced by means of two different processes: on the one hand, (1) the translation and coining into Sino-Japanese compounds using kanji and, on the other, (2) their transcription into the katakana 3 alphabet. We will find, therefore, the following types of terms in our corpus: 3 Japanese combines three writing systems: Kanji (ideograms of Chinese origin), hiragana (a syllabic system from Japan) and katakana (also a syllabic alphabet, mainly used for transcribing foreign words). A fourth, non-japanese alphabet is used, named romaji, that uses Latin characters. 31

32 Carlos Herrero Zorita, Leonardo Campillos Llanos, Antonio Moreno Sandoval Usage of Japanese kanji characters for Chinese Medicine and Western Medicine terms, e.g. 4 : egg-tube-cancer ran-kan-gan fallopian tube cancer Transcriptions into katakana, e.g.: akinesia a-ki-ne-ji-a akinesia Terms using both kanji and katakana, e.g.: behind schedule-ness/dyskinesia chihatsu-sei/ji-su-ki-ne-ji-a tardive dyskinesia Borrowings: e.g. DNA Since Japanese is an agglutinative language, we can assume that the majority of terms written in kanji will be formed by composition using free morphemes. This process is very different from affixation in English and Spanish. 4. Methodology and results 4.1. Resources For the development of this project, we used several tools for the processing of the Japanese language. First of all, we used the Juman 5 morphological analyzer, developed at the Kurohashi Lab of Kyoto University. We also considered using the ChaSen 6 and Mecab 7 taggers. As we will see in Section 4.4, we selected Juman for this purpose due to the extensive morphological information it provides from each word, such as specialised tags that 4 The following examples include: (1) the word in Japanese, (2) the literal translation, (3) the reading in romaji and (4) the translation in English. In the case of katakana words, (2) and (4) overlap, since they are phonological transcriptions. 5 AN doc/index.html automatically recognises medical and anatomy terms. Secondly, we used two medical dictionaries, the Online Life Science Dictionary, which belongs to the Life Science Project developed at Kyoto University, and the Japanese-English-Chinese Dictionary, from publications (1994). Our work can be divided into two steps: (1) the translation and analysis of medical Graeco-Latin affixes from English to Japanese, and (2) the creation of a dictionary of Japanese medical terms. Both stages will be conditioned by the agglutinative nature of Japanese language, and the fact that there are no blank spaces between words Translation of medical affixes Using medical affixes for recognising medical terms has ended in a high level of precision (Estopà et al. 2000, Moreno-Sandoval et al., 2013). We took this approach for Japanese: the starting point was a list of 467 Graeco-Latin medical affixes collected by the LLI-UAM. Each of them was translated into Japanese using the online Japanese-English medical dictionary Online Life Science Dictionary, which allows the user to search for beginnings and endings of words. We discarded those that did not appear in the corpus. Afterwards, we tagged them using the Juman morphological analyser. We observed that this step would not achieve the same results as in Spanish. First of all, the Japanese medical terms are predominantly formed by composition, adding free 8 morphemes, instead of affixation (Herrero-Zorita, 2013) (See Figure 2). Only a 7.30% are affixes (3.47% prefixes, and 3.83%, suffixes). These include affixes that do not necessarily belong exclusively to the medical domain. Secondly, these free morphemes do not classify the word into a medical term. Whereas a word containing cardio- in Spanish will refer to a term related to the heart, the translation of this prefix into Japanese results in the free morpheme heart, mind. Although the word is equally used for medical terms, e.g. ( cardiopathy ), it is also used in order compounds that do not necessary belong to a medical term, for example: ( adoration ). 8 Free morphemes, morphemes that can stand alone as independent words, are differentiated from prefixes and suffixes (bound morphemes) that appear as part of a larger word. 32

33 Collecting and POS-tagging a lexical resource of Japanese biomedical terms from a corpus (3.47%) (3.83%) Free Morphemes Prefixes Suffixes Figure 2: Japanese morphemes using translated Graeco-Latin affixes according to Juman 4.3. Compilation of the lexicon The solution was to create a lexicon of Japanese medical terms and assign to each of them a grammatical category. We compiled by hand a list of 31,458 terms taken from the two specialised dictionaries previously mentioned. Following this, we completed this listing with words that were extracted from the corpus, automatically recognised by means of the specialised tags included in Juman. Figure 3 represents an example of the list (the translation has been provided in this paper for the sake of clarity). Table 2 shows the distribution of the terms according to the writing system. We can see that the usage of kanji is predominant. Figure 3: Sample of the Japanese lexicon Writing System Terms % Kanji 23, Kanji + Katakana 6, Katakana 2, Borrowings Hiragana (92.70%) ( prediabetic state ) ( psychotic state ) ( preneoplastic condition ) ( precancerous condition ) There are three widespread taggers in Japanese that we considered using for this task: the Juman, ChaSen, and Mecab. The main problem in this step was oversegmentation. In this case, medical terms formed by two or more kanji that do not appear in common dictionaries are split into recognisable morphemes. The degree of the segmentation depends on the tagger. For example, the term liver biopsy is divided as the following: Juman splits it into two terms: ( liver ) and ( biopsy ). ChaSen and Mecab, into three words: ( liver ), ( raw ), and ( examination ). Also, each tagger provides different degrees of linguistic information. To choose the appropriate tagger, we carried out three comparisons between the three taggers. We took into account the problem of oversegmentation and the morphological information provided. First, we tagged the MultiMedica corpus using each program. A word list was obtained, and we looked up the terms in our lexicon (Table 3). Secondly, we tagged the lexicon and observed how many words were generated by each tagger after the segmentation (Figure 4). Thirdly, we observed the degree of information given by each one of them. We take as an example the word ( school ) (Table 4). Types in Words found % word list in lexicon ChaSen 10,020 2, Juman 10,819 2, Mecab 11,575 2, Table 3: Terms from the corpus in the lexicon Number of terms in lexicon Mecab 31,458 91,537 Table 2: Distribution of the terms Juman 94, Tagging process The list was then tagged, since we needed additional linguistic information from the terms. ChaSen 96,289 Figure 4: Words obtained after tagging the lexicon 33

34 Carlos Herrero Zorita, Leonardo Campillos Llanos, Antonio Moreno Sandoval ChaSen Word Reading Lemma Tag Subtag Mecab *,*,*,* Word Tag Subtag Lema Reading Reading variation Juman 6 1 * 0 * 0 " : / : - : " Word Reading Lemma Tag Subtag Reading variation Domain Table 4: Information given by each tagger The three taggers retrieve similar results regarding the segmentation of the words. However, Juman provides a wider range of morphological information, including the specialised tags indicating the domain of the words. For this reason, we have chosen it for the tagging of the lexicon. After the tagging and segmentation were completed by Juman, we re-joined the morphemes creating once again the complete term. That means 63,079 morphemes were erroneously split and corrected, a 66.72% of the total (94,031) (see Section 4.5). Then, we assigned the category given to the morpheme at the further right, and finally we translated the category. Figure 5 shows an example with the word ( pituitary gland ): 9 Since Japanese is a right-headed language (Miyaoka and Tamaoka, 2005: 46), the head situated at the right position determines the category of the complete compound. Through this procedure, we created a dictionary of Japanese terms that were morphologically tagged. This list includes long terms such as ( Auto-crine Motility Factor receptor ) 10. These types of terms would have been segmented and not recognised automatically by any of the three morphological analysers (Figure 6). ( Episcleritis ) N ( Isoelectric point ) N ( Protozoa ) N ( Pseudoaneurysm ) N ( Inoperable ) N ADJ c-met ( C-Met protein ) N ( Cell volume ) N ( Histochemical ) ADJ ( Gene transfer ) N ( Cytoplast ) N ( Suppository ) N ( To implant ) V ( Virus integration ) N 12 ( Twelfth cranial nerve ) N Figure 6: Sample of the Japanese lexicon (including translation) Figure 5: Correction and tagging of the segmentation 9 sahenmeishi, is a type of noun that can be attached to the auxiliary suru ( to be ) to form a verb. For example, from the noun benkyou ( stu-dy ), we can form the verb benkyou-suru ( to study ) Dealing with oversegmentation Following this, the compilation of such lexicon allowed us to correct the medical terms that were oversegmentated after the tagging of the MultiMedica corpus. For this purpose, we first 10 Formed by (jikobunbi, autocrine ) + (saibouundou, cell mobility ) + (shigekiinshin, stimulating factor ) + (juyoutai, receptor ). 34

35 Collecting and POS-tagging a lexical resource of Japanese biomedical terms from a corpus looked for the terms from the lexicon that appear in the corpus (6,811 types, see Table 5). We then tagged the corpus with Juman. Lastly, we followed the same process as with the lexicon: we corrected the segmented terms and assigned them their POS tag. Table 5 shows the results of this operation: Types % of the total corpus Table 5: Results of correcting oversegmentation in the MultiMedica corpus We can observe the overall importance of the oversegmentation problem: the tagger split more than 66% of the morphemes of the lexicon (Section 4.4); this led to a correction of around 84% of the terms extracted, a 20.37% of the total corpus. In other words, the reliability of the current taggers for Automatic Term Recognition is very low. Both outcomes should be taken into account when processing complex lexical units in non-segmenting languages such as Japanese or Chinese. 5. Conclusions and future work % of the terms extracted MultiMedica (tagged) 28, Lexicon terms in corpus 6, Corrections 5, In this paper we have presented a morphologically tagged lexicon of Japanese medical terms. First, we have explored the origins and formation of medical terms; secondly, we have presented the problems of the morphological segmentation; and finally, we have explained the process of compiling the lexicon. From our experience, it seems imperative to take into account the morphosyntactic characteristics of Japanese when performing a natural language processing task especially, when dealing with automatic tagging. The agglutinative nature of the language and the lack of white spaces between words are the main problems for these types of tasks. In order to compile the lexicon we have used two medical dictionaries and the special tags of the Juman tagger. After the tagging process, we have overcome the oversegmentation problem by manually joining together the separated morphemes, and have translated the tags to a universal codification. This lexicon will not only serve as a lexical resource and a reliable source of information, as it will become the foundation for a medical automatic term extractor. Acknowlegments This research has been funded by the MINECO (under the grant TIN C03-03) and by the Madrid Regional Government (grant MA2VICMR). Bibliography Bodenreider, O Lexical, Terminological and Ontological Resources for Biological Text Mining. In S. Ananiadou, and J. McNaught (eds.) Text mining for biology and biomedicine, Boston: Artech House. Donnelly, K SNOMED-CT: The Advanced Terminology and Coding System for ehealth. In L. Bos et al. (eds.) Medical and Care Compunetics, 3. Estopà, R., J. Vivaldi, and Mª. T. Cabré Use of Greek and Latin forms for term detection. Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000). Athens, Greece. Han, D., T. Ito, and T. Furugoori A Deterministic Method for Structural Analysis of Compound Words in Japanese. Language, Information and Computation: Proceedings of the 16 th Pacific Asia Conference Jeju, Korea. Herrero-Zorita An initial approach on medical term formation in Japanese through the usage of corpora. Proceedings of the 7th Corpus Linguistics Conference 2013, , Lancaster University, Lancaster, United Kingdom), July. Hisamitsu, T., and Y. Nitta Analysis of Japanese compound nouns by direct text scanning. In Proceedings 16th Conference on Computational Linguistics, 1: Stroudsburg, PA, USA. Irwin, M Loanwords in Japanese. Amsterdam: John Benjamins Publishing. Izumi, Y., and K. Isozumi Modern Japanese medical history and the European 35

36 Carlos Herrero Zorita, Leonardo Campillos Llanos, Antonio Moreno Sandoval influence. The Keio journal of medicine, 50 (2): Japanese-English-Chinese dictionary (Asakura Shoten) Kudo, M A lexical semantic study of four-character Sino-Japanese compounds and its application to machine translation PhD Thesis. Dept. of Linguistics - Simon Fraser University. Longrigg, J Medicine in the Classical World. In Loudon, I. (ed.) Western Medicine. Oxford: Oxford University Press. Martínez, P., J.C. González Cristobal, and A. Moreno-Sandoval MULTIMEDICA: Extracción de información multilingüe en Sanidad y su aplicación a documentación divulgativa y científica. Procesamiento del Lenguaje Natural, 47, Retrieved from ex.php/pln/article/view/1003 Masaaki, N A part of speech estimation method for Japanese unknown words using a statistical model of morphology and context. Proceedings ACL, Miyaoka, Y. and K. Tamaoka Investigation of the Right-hand Head Rule Applied to Japanese Affixes. Glottometrics, 10: Moreno-Sandoval, A., Campillos-Llanos Design and Annotation of MultiMedica A Multilingual Text Corpus of the Biomedical Domain. Procedia - Social and Behavioral Sciences, 95 (25): Moreno-Sandoval, A., L. Campillos-Llanos, A. González-Martínez, and J. M. Guirao An affix-based method for automatic term recognition from a medical corpus of Spanish. In Proceedings of the 7th Corpus Linguistics Conference 2013, , Lancaster University, Lancaster, United Kingdom), July. Murawaki, Y. and S. Kurohashi Online Japanese Unknown Morpheme Detection using Orthographic Variation. LREC, European Language Resources Association. Nakagawa, H. and T. Mori A simple but powerful automatic term extraction method. Proceedings COMPUTERM 2002: second international workshop on computational terminology, 14. Oh, J-H., J. Lee, K-S. Lee, and K-S. Choi Japanese term extraction using dictionary hierarchy and machine translation system. Special Issue of Terminology, 6 (2): Online Life Science Dictionary ( ) Available at: ja/service/weblsd/index.html. Vivaldi, J., and H. Rodríguez, "Medical Term Extraction using EWN ontology". In Proceedings of Terminology and Knowledge Engineering 2002 (TKE 02). Nancy:

37 Procesamiento del Lenguaje Natural, Revista nº 52 marzo de 2014, pp recibido revisado aceptado TASS A Second Step in Reputation Analysis in Spanish TASS Un Segundo Paso en Análisis de Reputación en Español Julio Villena-Román Janine García-Morera Daedalus, S.A. Av. de la Albufera Madrid, Spain {jvillena, Sara Lana-Serrano José Carlos González-Cristóbal Universidad Politécnica de Madrid E.T.S.I. Telecomunicación Ciudad Universitaria s/n Madrid, Spain Resumen: TASS 2013 es la segunda edición del taller de evaluación experimental en el congreso anual de la SEPLN dedicado al análisis de reputación en español. El principal objetivo es fomentar la investigación en técnicas y algoritmos avanzados para realizar análisis de sentimientos y clasificación automática de opiniones extraídas de mensajes cortos en medios sociales en español. Este artículo describe en profundidad, en comparación con la edición anterior, las tareas propuestas este año, el contenido, formato y las estadísticas principales de los corpus generados, los participantes y los diferentes enfoques planteados, así como los resultados generales obtenidos y las lecciones aprendidas en estos dos años. Palabras clave: TASS 2013, análisis de reputación, análisis de sentimientos, clasificación automática de texto, medios sociales, español. Abstract: TASS 2013 is the second edition of the experimental evaluation workshop within the SEPLN annual Conference focused on reputation analysis in Spanish language. The main objective is to foster the research on advanced algorithms and techniques for performing sentiment analysis and automatic text categorization on opinions extracted from short social media messages in Spanish. This paper fully describes the proposed tasks, the contents, format and main figures of the generated corpus, the participant groups and their different approaches, and, finally, the overall results achieved and lessons learned in these two years. Keywords: TASS 2013, reputation analysis, sentiment analysis, text categorization, social media, Spanish. 1 Introduction TASS is an experimental evaluation workshop on reputation analysis focused on Spanish language, organized as a satellite event of the SEPLN Conference. After a successful first edition in 2012 (Villena-Román et al., 2013), TASS was held on September 20th, 2013 at Universidad Complutense de Madrid, Spain. The long-term objective of TASS is to foster research in the field of reputation analysis, i.e., the process of tracking, investigating and reporting an entity's actions and other entities' opinions about those actions, in Spanish language. As a first approach, reputation 1 analysis has at least two technological aspects: sentiment analysis and text classification. Sentiment analysis is the application of natural language processing and text analytics to identify and extract subjective information from texts. It is a major technological challenge and the task is so hard that even humans often disagree on the sentiment of a given text, as issues that one individual may find acceptable or relevant may not be the same to others. And the shorter the text is (for instance, Twitter messages or short comments in Facebook), the harder the task becomes. On the other hand, automatic text classification (or categorization) is used to guess the topic of the text, among those of a predefined set of categories, so as to be able to assign the reputation level into different axis or ISSN Sociedad Española para el Procesamiento del Lenguaje Natural

38 Julio Villena-Román, Janine García-Morena, Sara Lana-Serrano, José Carlos González-Cristóbal points of view of analysis. Text classification techniques, although studied for a long time, still need more research effort to be able to build complex models with many categories with less workload and increase the precision and recall of the results. In addition, these models should deal with specific text features in social media messages (such as spelling mistakes, abbreviations, etc.). Within this context, the aim of TASS is to provide a forum for discussion where the latest research work in these fields can be discussed by scientific and business communities. The setup is based on a series of challenge tasks intended to provide a benchmark forum for comparing different approaches. In addition, with the creation and open release of the fully tagged corpus, the aim is to provide a common reference dataset for the research community. The rest of the paper is organized as follows. Section 2 describes the corpus provided to participants and used for the challenge tasks. The third section describes the different tasks proposed this edition. Section 4 describes the participants and the overall results are presented in Section 5. The last section draws some conclusions and future directions. 2 Corpus Experiments were based on two corpus. After the workshop, both were made freely available for research purposes to the community. The only requirement was to make a request to with the , affiliation and a brief description of the research objectives, and include a proper citation in publications. 2.1 General corpus The general corpus, the same used in 2012, contains over Twitter messages, written in Spanish by about 150 well-known personalities and celebrities of the world of politics, economy, communication, mass media and culture, between November 2011 and March Each message includes its ID (tweetid), the creation date (date) and the user ID (user). According to the Twitter API Terms of Service 2, text contents and user information had to be removed for the corpus distribution. The general corpus was divided into two sets: training (about 10%) and test (90%). The 2 https://dev.twitter.com/terms/api-terms training set was released so that participants could train and validate their models. The test corpus was provided without any tagging and was used to evaluate the results provided by the different systems. Table 1 shows a summary of the training and test sets. Attribute Value Tweets Tweets (train) (11%) Tweets (test) (89%) Topics 10 Users 154 Date start T00:03:32 Date end T23:47:55 Table 1: General corpus statistics Each message in both the training and test set was tagged with its global polarity, indicating whether the text expresses a positive, negative or neutral sentiment, or no sentiment at all. 5 levels have been defined: strong positive (P+), positive (P), neutral (NEU), negative (N), strong negative (N+) and one additional no sentiment tag (NONE). In addition, the level of agreement of the expressed sentiment within the text was also included, to make out whether a neutral sentiment comes from neutral keywords (AGREEMENT) or else the text contains positive and negative sentiments at the same time (DISAGREEMENT). Moreover, the polarity at entity level, i.e., the polarity values related to the entities that are mentioned in the text, was also included for those cases when applicable. These values were similarly divided into 5 levels and include the level of agreement as related to each entity. On the other hand, a selection of a set of topics was made based on the thematic areas covered by the corpus, such as politics, literature or entertainment. Each message in both the training and test set was assigned to one or several of these topics. The list of selected topics is shown later in Table 7. All tagging was carried out semi automatically: a baseline machine learning model was first run (Villena-Román et al., 2011) and then all tags were manually checked by two human experts. For polarity at entity level, due to the high volume of data to check, this tagging was done just for the training set. Figure 1 shows the information of two sample tweets. The first tweet is only tagged 38

39 TASS A Second Step in Reputation Analysis in Spanish with the global polarity (P+) and the agreement level (AGREEMENT), as it contains no mentions to any entity, but the second one is tagged with both the global polarity (P), the agreement level (AGREEMENT) and the polarity associated to each of the entities that appear in the text (UPyD and Foro Asturias, both tagged as P). The following figure shows the information of one sample tweet. The global polarity is N with AGREEMENT, and the polarity at entity level for the whose source is PP is also N with AGREEMENT. Figure 2: Sample tweet (Politics corpus) 3 Tasks This year four tasks were proposed, extending the two tasks that were offered in TASS 2012, covering different aspects of sentiment analysis and text classification. Figure 1: Sample tweets (General corpus) 2.2 Politics corpus The Politics corpus, new in this edition, contains 2500 tweets, gathered 3 during the electoral campaign of the 2011 General Elections in Spain from Twitter messages mentioning any of the four main national-level political parties: PP, PSOE, IU and UPyD. Similarly to the General corpus, the global polarity and the polarity at entity level for those four entities was manually tagged for all messages. However, in this case, due to the lack of time and the high amount of work that the tagging required, only 3 levels were used: positive (P), neutral (NEU), negative (N), and one additional no sentiment tag (NONE). The format was the same as the General corpus, but the entity element includes a source attribute indicating the political party the entity refers to. 3 This corpus was completely built by E. Martínez-Cámara (SINAI group, Universidad de Jaen), member of the organization of TASS Task 1: Sentiment Analysis at Global Level This task consisted on performing an automatic sentiment analysis to determine the global polarity of each message in the test set of the General corpus. Participants were provided with the training set of the General corpus so that they could train and validate their models. There are two different evaluation criteria: i) fine-grained polarity using 5 levels, and ii) coarse-grained polarity with just 3 levels. The standard metrics of precision, recall and F-measure calculated over the test set are used to evaluate and compare the different systems. 3.2 Task 2: Topic Classification The challenge of this task was to automatically identify the topic of each message in the test set of the General corpus. Participants could use the training set of the General corpus to train and validate their models. 3.3 Task 3: Sentiment Analysis at Entity Level This task was similar to Task 1, but sentiment polarity (using 3 levels) should be determined 39

40 Julio Villena-Román, Janine García-Morena, Sara Lana-Serrano, José Carlos González-Cristóbal at entity level of each message in the Politics corpus. In this case, the polarity at entity level included in the training set of the General corpus could be used by participants to train and validate the models (converting from finegrained to coarse-grained polarity). Entities were tagged in the corpus to make participant focus on the sentiment analysis and not on entity recognition. The difficulty of the task arises from the fact that messages can contain more than one sentence with more than one entity per sentence, so more advanced text processing techniques are needed. 3.4 Task 4: Political Tendency Identification This task moves one step forward towards reputation analysis and the objective is to estimate the political tendency of each user in the test set of the General corpus, in four possible values: LEFT, RIGHT, CENTRE and UNDEFINED. Participants could use whatever strategy they decide, but a first approach could be to aggregate the results of the previous tasks by author and topic. 4 Participants 31 groups registered (compared to 15 groups last year) and finally 14 groups (9 last year) sent their submissions. The list of active groups is shown in Table 2, including the tasks in which they have participated. Group CITIUS-Cilenis X X DLSI-UA X Elhuyar X ETH-Zurich X X X X FHC25-IMDEA X ITA X JRC X LYS X X X SINAI-EMML X SINAI-CESA X X X X Tecnalia-UNED X UNED-JRM X X UNED-LSI X X UPV X X X X Total groups Table 2: Participant groups (Díaz Esteban, Alegría y Villena Román, 2013) Along with the experiments, all participants were invited to submit a paper to describe their experiments and discuss the results with the audience in the workshop session. These papers should follow the usual SEPLN template and could be written in Spanish or English. Papers were reviewed by the program committee and were included in the workshop proceedings (Díaz Esteban, Alegría y Villena Román, 2013). In these two years, the trend has been to adopt a machine learning supervised approach to sentiment analysis, mainly using Weka (Hall et al., 2009), with a text processing often using Freeling (Padró and Stanilovsky, 2012). For instance, CITIUS-Cilenis runs achieved a good performance using a Naive-Bayes binary classifier to distinguish between just two sharp polarity categories (positive and negative) and used experimentally set thresholds for detecting the fine grain polarity values. Another supervised approach based in SVM is used by Elhuyar, including linguistic knowledge-based processing with Freeling and tagging of polarity words, emoticons, negation and spelling errors. Similarly, UPV used a SVM approach (based on libsvm library for Weka) and submitted runs for all tasks that are often at the top results. In addition, as the type of language used in social networks (non-grammatical phrases, lack or misuse of punctuation symbols, specific terminology, etc.) is not covered by the standard publicly available tools, they made specific adaptations for improving the tokenization. Both Freeling and Tweetmotif 4 adapted to Spanish were used. JRC also adapted a supervised approach based on different feature combinations, originally designed for English to Spanish, using several in-house built dictionaries and machine translated data. UNED-JRM also deals with both Task 1 and 2 as purely-classification tasks, developing a classifier indifferently for both tasks, with similar results. Tecnalia-UNED also rely on advanced linguistic process (again based on Freeling) to deal with complex issues such as negation detection and emphatiser treatment (aiming at distinguishing the range of polarity levels). LYS present the best-performing approach in the topic classification task. In addition to an ad-hoc normalization process, POS tagging and dependency parsing algorithms are applied and psychological resources are used to exploit the psychometric properties of human language (Vilares, Alonso and Gómez-Rodríguez, 2013). 4 https://github.com/brendano/tweetmotif 40

41 TASS A Second Step in Reputation Analysis in Spanish Quite differently, SINAI-EMML group opted for a completely unsupervised strategy based on the combination of three linguistic resources, SentiWordNet, Q-WordNet and isol. The polarity value is calculated with the normalized addition of the differences between the positive and negative values of each term. Sentiment lexicons are also present in most systems. For instance, the contribution from DLSI-UA consisted of two different graphbased approaches: a modified version of a ranking algorithm (RA-SR) using bigrams, used on the Task 2 of the Semeval 2013 competition 5, and a new proposal using a skipgrams scorer. Both approaches create sentiment lexicons able to retain the context of the terms, and employ machine learning techniques to detect the polarity of a text. All their runs appear in the top 10 best results and their combination reaches the first position. Another graph-based approach for topic classification is presented by FHC25-IMDEA. They used a technique based on graph similarity to classify Twitter messages. Their assumption is that any text can be represented as a graph. For a given text, their system places the terms (actually the stems) in the vertexes of a graph and creates links with a given weight among them. Then their hypothesis is that graphs belonging to texts of the same topic usually form unique structures (i.e., a topic graph). Thus, a metric is used for calculating the similarity between the text graph to classify and the different topic graphs. Other interesting approaches are based on Information Retrieval (IR) techniques. For instance, SINAI-CESA propose a solution using Latent Semantic Analysis. Train data is taken from the continuous stream of posts from Twitter, capturing those that are likely to include affective expressions and generating a corpus of "feelings" labeled according to their polarity, and without using any training data from controlled corpora to avoid suffering from domain related limitations. Similarly, UNED-LSI adopt an IR approach where the classes are modeled according to the textual information of the tweets belonging to each class, and used as queries (Castellanos, Cigarrán and García-Serrano, 2012). ITA group made some experiments with the Non-Axiomatic Reasoning System 6, a general https://sites.google.com/site/narswang/ purpose reasoning system, as a tool to dynamically discover content words and phrases with opinion. Their idea is to use a seed dictionary to look for similar polarity words. Last but not least, ETH-Zurich present an interesting study of political discourse and emotional expression by analyzing the political position of four major parties through their Twitter activity, revealing that Twitter political discourse depends on subjective perception, and resembles the political space of Spain. 5 Results Participants were expected to submit one or several runs for one or several of the tasks. Results should be submitted in a plain text file with the following format: id \t output \t confidence where: id is the tweet ID for Tasks 1 and 2, the combination of tweet ID and entity for Task 3, and the user ID for Task 4. output refers to the expected output of each task (polarity, topic, political tendency). confidence is a number ranging [0, 1] that indicates the confidence as assigned by the system (not currently used for evaluation). After the submission deadline, runs were collected by the organization and the evaluation results were made available to the participants to allow them to prepare their reports. Results included a spreadsheet with the overall evaluation figures for each task, and also detailed results per experiment for all the 5 evaluations (as explained before, Task 1 was evaluated using both 5-level and 3-level setups), the confusion matrix with all labels to allow error analysis, and finally the gold standard for the task itself. The PHP script used for the evaluation of each submission was also included for their convenience. 5.1 Task 1: Sentiment Analysis at Global Level 56 runs (10 of them specific for 3-level evaluation) were submitted by 13 different groups. Results for the best ranked experiment from each groups are listed in the tables below. All tables show the precision (P), recall (R) and F1 value achieved in each experiment. Table 3 considers 5 polarity levels. Precision values range from 61.6% to 12.6%. The average values are 43.3% for all metrics. 41

42 Julio Villena-Román, Janine García-Morena, Sara Lana-Serrano, José Carlos González-Cristóbal Run Id P R F1 DLSI-UA-pol-dlsiua3-3-5l Elhuyar-TASS2013 _run UPV_ELiRF_task1_run CITIUS-task1_ lys_global_sentiment_task_6c JRC-tassTrain-base-DICT-5way ITA_ResultadosAnalisisOpiniónAlg LSI_UNED_2_TASK1_RUN_ UNED-JRM-task1-run TECNALIA-UNED ETH-task1-Warriner sinai_emml_task1_6classes sinai_cesa-task1_raw Table 3: Results for Task 1 with 5 levels Table 4 gives results considering the classification only in 3 levels. In this case, precision values improve, as expected as the task seems to be easier. The precision obtained now ranges from 68.6% to 23.0%. The average values for all metrics in this case is 53.0%. Run Id P R F1 Elhuyar-TASS2013 _run UPV_ELiRF_task1_run CITIUS-task1_ DLSI-UA-pol-dlsiua3-3-5l lys_global_sentiment_task_6c JRC-tassTrain-base-DICT-3way ITA_ResultadosAnalisisOpiniónAlg TECNALIA-UNED UNED-JRM-task1-run LSI_UNED_2_TASK1_RUN_ ETH-task1-Warriner sinai_emml_task1_3classes sinai_cesa-task1_raw Table 4: Results for Task 1 with 3 levels Initially a gold standard was generated by pooling all submissions with a voting scheme and then an extensive human review of the ambiguous decisions was carried out. However, as some groups had submitted many runs and other groups had only submitted a few, some concern arose about a possible bias. To avoid any systemic problem, the gold standard creation should be repeated or at least carefully evaluated for correctness. Due to the summer holidays and lack of human resources for the task, finally the gold standard of TASS 2012, which was not subject to this bias as the number of submissions was balanced, was used to evaluate the submissions. The distribution of labels in both the training and test corpus is shown in Table 5. Obviously, the distribution is not evenly balanced in both corpus, i.e., the gold standard may be not well built. This fact causes that, for example, given a system that is able to correctly classify P+ and NONE with a high precision (both count 70% of tweets in test corpus), and maybe, not so good at classifying the other labels, may achieve better results on the test corpus than the training corpus, as it is actually reported by some participants (CITIUS-Cilenis and Elhuyar). Obviously this has to be taken into account for future initiatives. Label Frequency (Train) Frequency (Test) P % 34.12% P 4.12% 2.45% NEU 8.45% 2.15% N 16.91% 18.56% N % 7.5% NONE 23.58% 35.22% Table 5: Sentiment distribution This is for example the case of CITIUS-task1_1 run, which achieves better results than lys_global_sentiment_task_6c, but is worse balanced (Table 6). Label CITIUS LYS DLSI Elhuyar P P NEU N N NONE all Table 6: Precision per sentiment label Another interesting comparison is the top ranked run, DLSI-UA-pol-dlsiua3-3-5l, vs the second ranked, TASS2013_Elhuyar_run1. Results from Elhuyar are quite balanced and can be compared to the LYS run, but they are better ranked as they achieve greater precision for all labels but N+ and NEU. In turn, results 42

43 TASS A Second Step in Reputation Analysis in Spanish from DLSI are better than Elhuyar run because their system performs better for P+ and NONE that are the most frequent labels. This issue must be studied for eventual future editions. 5.2 Task 2: Topic Classification This task was evaluated as a single label classification. The most restrictive criterion has been applied: a success is achieved only when all the test labels have been returned. As in Task 1, the gold standard finally considered was the one used in TASS The distribution of topics in both the train and test corpus is shown in Table 7. The total count is greater than the number of tweets as several topics could be assigned per tweet. Topic Frequency (Train) Frequency (Test) Politics (33%) (43%) Other (24%) (40%) Entertainment (17%) (8%) Economy 942 (10%) (3%) Music 566 (6%) (2%) Soccer 252 (3%) 823 (1%) Films 245 (3%) 596 (1%) Technology 217 (2%) 287 (0%) Sports 113 (1%) 135(0%) Literature 103 (1%) 93(0%) all Table 7: Topic distribution 20 experiments were submitted in all. Table 8 shows the results for this task. The average values are 62.4% precision, 44.4% recall and 49.6 F1. Precision ranges from 80.4% to 16.1%. As in Task 1, different submissions from the same group usually have similar values. No approach (learning, graph or IR-based) clearly stand out among the others. Run Id P R F1 lys_topic_task_with_user_info LSI_UNED_2_TASK2_RUN_ UPV_ELiRF_task2_run ETH-task FHC25-IMDEAults_PR_GD_TT FHC25-IMDEAults_PR_TT UNED-JRM-task2-run sinai_cesa-task2_normalized Table 8: Results for Task 2 Some participants such as FHC25-IMDEA pointed out that, as shown in Table 7, the distribution is quite balanced between both corpus but not on different topics. This may cause that the trained systems tend to be biased towards the most frequent topics (politics and other). Systems that are optimized for those categories, even at the cost of a low performance in the less frequent topics, will seem to achieve a better overall result than a system that is more balanced system. 5.3 Task 3: Sentiment Analysis at Entity Level The evaluation was made over the Politics corpus, which was tagged manually, so the gold standard was created with no pooling. Finally 6 runs were submitted for this task. Results are shown in Table 9. Average precision is 37.2%, recall is 36.5% and F1 is 36.9%. These figures are much lower than in Task 1. This is because this task is harder than Task 1 and systems do not reach the adequate level of development as learning-based approaches are not able to represent enough knowledge about the text semantic contents. Run Id P R F1 CITIUS-task3_CITIUS.txt UPV_ELiRF_task3_run0.txt sinai_cesa-task3_normalized.tsv ETH-task3.txt Table 9: Results for Task Task 4: Political Tendency Identification The gold standard was built manually by reviewing each user's political tendency, as defined by himself/herself, or assigning UNDEFINED if not stated or unknown. 11 runs were submitted. Results are shown in Table 10. Run Id P R F1 ETH-Task4-Crowdsource.txt [MANUAL] UPV_ELiRF_task4_run1.txt sinai_cesa-task4_nound_raw.tsv lys_political_tendency_task_model Table 10: Results for Task 4 43

44 Julio Villena-Román, Janine García-Morena, Sara Lana-Serrano, José Carlos González-Cristóbal Average values for precision, recall and F1 are 57.7%, 51.7% and 54.1% respectively. Run from ETH is based on a manual assignment of political tendency to each user, made with crowdsourcing, so it is supposed to achieve the best result in the gold standard, as it happens. 6 Conclusions and Future Work TASS is the first workshop about reputation analysis specifically focused on Spanish. This second edition of TASS has been even more successful than the first one, as the number of participants has increased up to 31 groups registered (15 groups last year) and 14 groups (9 last year) sent their submissions. The number of participants and the quality of their work has met and gone beyond all our expectations. It is still necessary to perform a more detailed analysis of the results, which is in our short-term roadmap. However, reports from participants and the developed corpora are already valuable available resources, helpful for other research groups approaching these tasks. Furthermore, the reuse of the General corpus in these two years allows to analyze the evolution in the field and provides a benchmark for future research. TASS 2012 corpus has been downloaded by more than 50 research groups, 20 out of Spain. We hope to reach a similar impact with the new corpus. Some ideas for future editions gathered during the workshop involve solving the corpus uneven distribution, the inclusion of text normalization issues, the development of new corpus with different varieties of Spanish, and some tasks related to irony detection, mixed sentiments (disagreement within the text), subjectivity and the speaker point of view (first person vs eyewitness vs hearsay witness). Acknowledgements This work has been supported by several Spanish R&D projects: Ciudad2020: Hacia un nuevo modelo de ciudad inteligente sostenible (INNPRONTA IPT ), MA2VICMR: Improving the access, analysis and visibility of the multilingual and multimedia information in web for the Region of Madrid (S2009/TIC- 1542) and MULTIMEDICA: Multilingual Information Extraction in Health domain and application to scientific and informative documents (TIN C03-01). References Castellanos, A., J. Cigarrán, y A. García- Serrano Generación de un corpus de usuarios basado en divergencias del Lenguaje. II Congreso Español de Recuperación de Información. Valencia, June Díaz Esteban, A., I. Alegría, y J. Villena Román (eds) Actas del XXIX Congreso de la Sociedad Española de Procesamiento de Lenguaje Natural. IV Congreso Español de Informática September 2013, Madrid, Spain. ISBN: Hall, M., E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I.H. Witten The WEKA Data Mining Software: An Update. SIGKDD Explorations, Volume 11, Issue 1. Padró, L. and E. Stanilovsky Freeling 3.0: Towards wider multilinguality. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), pp , Istanbul, Turkey. Vilares, D., M.A. Alonso, y C. Gómez- Rodríguez Una aproximación supervisada para la minería de opiniones sobre tuits en español en base a conocimiento lingüístico. Revista de Procesamiento del Lenguaje Natural, 51, pp , sep ISSN Villena-Román, J., S. Collada-Pérez, S. Lana- Serrano, and J.C. González-Cristóbal Hybrid Approach Combining Machine Learning and a Rule-Based Expert System for Text Categorization. In Proceedings of the 24th International Florida Artificial Intelligence Research Society Conference (FLAIRS-11), May 18-20, 2011, Palm Beach, Florida, USA. AAAI Press Villena-Román, J., S. Lana-Serrano, E. Martínez-Cámara, and J.C. González- Cristobal TASS - Workshop on Sentiment Analysis at SEPLN. Revista de Procesamiento del Lenguaje Natural, 50, pp 37-44, mar ISSN

45 Procesamiento del Lenguaje Natural, Revista nº 52 marzo de 2014, pp recibido revisado aceptado Lexical Normalization of Spanish Tweets with Rule-Based Components and Language Models Normalización léxica de tweets en español con componentes basados en reglas y modelos de lenguaje Pablo Ruiz, Montse Cuadros and Thierry Etchegoyhen Vicomtech-IK4 Mikeletegi Pasealekua 57 Parque Tecnológico de Gipuzkoa, Donostia/San Sebastián Abstract: This paper presents a system to normalize Spanish tweets, which uses preprocessing rules, a domain-appropriate edit-distance model, and language models to select correction candidates based on context. The system is an improvement on the tool we submitted to the Tweet-Norm 2013 shared task, and results on the task s test-corpus are above-average. Additionally, we provide a study of the impact for tweet normalization of the different components of the system: rule-based, edit-distance based and statistical. Keywords: Spanish microtext, lexical normalization, Twitter, edit distance, language model Resumen: Este artículo presenta un sistema para la normalización de tweets en español, que usa reglas de preproceso, un modelo de distancias de edición adecuado al dominio y modelos de lenguaje para seleccionar candidatos de corrección según el contexto. Se trata de un sistema mejorado basado en el que presentamos en la tarea compartida Tweet-Norm El sistema obtiene resultados superiores a la media en el corpus de test de la tarea. Presentamos además un estudio del impacto en la normalización de los diferentes componentes del sistema: basados en reglas, en distancia de edición, y estadísticos. Palabras clave: microtexto, español, castellano, normalización léxica, Twitter, distancia de edición, modelo de lenguaje 1 Introduction Studies on the lexical normalization of Spanish microtext are scarce, e.g. Armenta et al. 2003, which predates Twitter and focuses on SMS. Newer studies are (Pinto et al., 2012) and (Oliva et al., 2013), which also focus on SMS. Other recent studies are (Mosquera et al., 2012), which discusses the normalization of Spanish user-generated context in general, and (Gómez Hidalgo et al., 2013), which presents a detailed microtext tokenization method that can be employed for normalization. A larger body of literature exists for English microtext normalization (see Eisenstein, 2013 for a review). Some approaches rely on large amounts of labelled training data, e.g. (Beaufort et al., 2010) and (Kaufmann and Kalita, 2010), which examine SMS normalization. However, such resources are not available for Spanish. An approach that performs normalization of English Tweets without the need of annotated data is Han and Baldwin, As an initiative to explore the application of different microtext normalization approaches, and to help overcome the lack of resources and tools for such a task in Spanish, SEPLN 2013 hosted the Tweet-Norm Workshop 1 (Alegría et al. 2013a). The system for Spanish tweet normalization presented in this study comprises data resources to model the domain, as well as analysis modules. It is an improvement on the tool we submitted (Ruiz et al., 2013) to the Tweet- Norm 2013 shared task. The paper is organized as follows: the system s architecture and components are presented in Section 2, resources employed in Section 3, and settings and results-evaluation in Section 4. Conclusions and future work are discussed in Section ISSN Sociedad Española para el Procesamiento del Lenguaje Natural

46 Pablo Ruiz, Montse Cuadros, Thierry Etchegoyhen Figure 1: System Architecture 2 Architecture and components The system s architecture and components are shown in Figure 1 and explained in following. 2.1 Rule-Based preprocessing The preprocessing module was rule-based, relying on 110 hand-crafted mappings between patterns that match out-of-vocabulary (OOV) items and a correction for the expressions matched by the patterns. The mappings were implemented as case-insensitive regular expressions. The first set of mappings (46 rules) was used to identify abbreviations, and expand them if needed. A second set was used to resegment tokens commonly written together in microtext (21 rules). The final set of mappings (43 rules) detected emoticons and delengthened OOV items with repeated characters, besides mapping OOVs to DRAE 2 onomatopoeias. Repeated letters were reduced to a single letter, unless a word with a repeated letter was found in Aspell s Spanish inflected form dictionary (v1.11.3) 3. E.g. vinoo 2 Spanish Academy dictionary, 3 aspell -l es dump master aspell -l es expand was preprocessed to vino, but creeeen was reduced to creen. These regex-based mappings were based on the most common errors in a corpus of 1 million tweets crawled by ourselves and spellchecked with Hunspell (v1.3.2). Microtext expressions such as RT (retweet) or HT (hat tip) were considered in-vocabulary. 2.2 Correction-candidate generation The correction candidates generated were validated against a dictionary for in-vocabulary (IV) items, and against entity lists Dictionary candidates The base-form (Base ED ) to generate candidates from was either the original OOV or the preprocessed form of the OOV. Prior to candidate generation, Base ED was lowercased if all of its characters were in uppercase and it had a length of more than three characters. Candidates were generated for Base ED using two methods: minimum edit distance and regular expressions. With both methods, the candidates that were not found in Aspell s dictionary were rejected and did not proceed to further steps in the normalization workflow. 46

47 Lexical Normalization of Spanish Tweets with Rule-Based Components and Language Models Using minimum-edit distance (Damerau, 1964), up to two case-insensitive character edits (insertions, deletions or substitutions) were performed on the edit-base form Base ED. The cost of each edit operation was not uniform: edits that result in correcting a common error were given a lesser cost than edits that correct uncommon errors. This method is contextinsensitive: the cost of an edit operation did not take into account the characters adjacent to those undergoing the edit, or the position in the word of the characters being edited (wordinitial, word-final, etc.). However, context sensitivity is useful in candidate generation and candidate scoring, since the frequency of certain errors depends on context; e.g., d-deletion is more frequent in participle endings -ado, -ido than elsewhere. To add context-sensitivity at character level to the model, we generated candidates via regexes that repair common errors. A custom distancescoring scheme was created for these regexbased candidates. If both the edit-distance and the regex-based method returned the same candidate, and distance scores differed, the score chosen for the candidate was the smaller one Entity Candidates For each OOV, a caps-initial variant and a variant with all characters in uppercase were generated, and looked up in entity lists. The OOV itself was also looked up. Matches were stored as entity candidates. 2.3 Candidate selection The goal of candidate selection is to choose a single correction for each OOV, among the set containing the candidates created in the preprocessing and candidate-generation steps, as well as the original form of the OOV itself. The original OOV is one of the forms to consider: It is part of the normalization workflow to decide whether to keep the unmodified OOV as the normalized form, or to propose an edited variant. The output of the candidate selection method is a single candidate, C Nopos, which stands for final candidate pending postprocessing. The terminology used in the description of the algorithm (below) is the following: Trusted Candidates: candidates from the Abbreviations or Resegmentation mappings in the preprocessing step. Untrusted Candidates: candidates obtained with the methods in a through c below. a. DelenCand: obtained in preprocessing with Delengthening rules. b. DistCands: candidates, along with their distance to their Base ED form, generated with either context-sensitive or contextinsensitive character-edits (see section 2.2.1) c. EntCand: a candidate from entitydetection heuristics (section 2.2.2). LMCands: When more than one untrusted candidate exists for an OOV, LMCands is the subset of the OOV s candidates which is ultimately assessed against the language model, in order to choose an optimal candidate for the OOV. Accented Variant: for this algorithm, a string S 1 is an accented variant of a string S 2 if they match in a case-and-accentinsensitive manner: mía is an accented variant of Mia, as is mañana of Manana. In essence, the algorithm first selects a subset of the correction candidates for each OOV in the tweet. Then, if more than one candidate exists for some OOV in the tweet, a language model (LM) scores candidate combinations at tweet level, assessing best fit. The algorithm is presented below, and explanations and examples follow it. The operations in A through C below take place for each OOV in the tweet. A. Initial Filtering 1. Filter the DistCands set in two steps: 1.1. Candidates at a distance higher than 1.5 (configurable threshold) from their Base ED are filtered out Among the remaining candidates in DistCands, all of the candidates at the smallest distance present in the set are retained. E.g. if candidates at distance 0.5 and 0.8 exist, candidates at distance 0.8 are filtered out. 47

48 Pablo Ruiz, Montse Cuadros, Thierry Etchegoyhen B. Trusted Candidates 2. If a correction candidate was obtained in preprocessing, via Abbreviation mappings, (see Section 2.1), it is selected as C Nopos (the final candidate pending postprocessing). 3. If a correction candidate was obtained in preprocessing via Resegmentation mappings, it is selected as C Nopos. C. Untrusted Candidates 4. If a correction candidate of type EntCand exists, add it to the LMCands set If among the candidates in DistCands, accented variants exist for an EntCand candidate, add them to LMCands. 5. If a correction candidate was obtained in preprocessing, via Delengthening regexes, and the candidate is IV, add it to the LMCands set If among the candidates in EditCands, accented variants exist for the Delenghtening candidate, add them to the LMCands set. 6. If no candidate has been selected so far (i.e. no trusted candidates exist, and the LMCands set is empty), add the content of the DistCands set (already filtered in step 1) to LMCands. 7. If LMCands is empty, select the original OOV form as C Nopos. After steps 1 to 7 have applied for each OOV in the tweet, candidates are assessed at tweet level. D. Tweet-Level Scoring Once each OOV in the tweet has been resolved into a trusted candidate, an LMCands set, or the original OOV form as a default, the following procedure applies, at tweet level. 1. If each OOV in the tweet has one candidate only, that candidate is chosen and moves to postprocessing. 2. Otherwise, with each combination of candidates from the different OOVs LMCands sets, tweet alternatives are created and scored against the language model. Candidates, the combination of which maximizes log probability for the whole tweet-alternative containing them, are chosen, and move to post-processing. In the initial filtering stage, step 1.1 eliminates candidates whose edit-distance from their Base ED is too high for them to be likely corrections. Step 1.2 is similar in the sense that it narrows down the candidates to another k-best subset in terms of distance. Accuracy on both development and test-sets improved significantly with both steps included in the workflow. Trusted candidates result from matches against mappings and rules created by a human domain-expert, for unambiguous cases. They can thus be reliably promoted to C Nopos status. Unlike the previous case, untrusted candidates represent ambiguous cases, and forms that have been generated through automatic means. Better accuracy is obtained when statistical methods and string comparison metrics are employed to assess their validity. Entity-candidates (EntCand) are added to the LMCands set when available. Additionally, since accent omission is a very frequent error, we also consider accented variants of EntCand. E.g. for EntCand Rio, accented variant río is considered. IV candidates output by Delengthening regexes may also require disambiguation. For instance, the correct variant of the form si, obtained from delengthening OOV siii, could be si, or sí, depending on context. Thus, accented variants for such IV items are added to LMCands. For EntCand and Delenghtening candidates, it is the language model s task to decide between accented or unaccented variants. For DistCand candidates, the language model disambiguates among the k-best candidates in terms of distance score. 2.4 Postprocessing (Recasing) Once the above processes have applied, the case of the candidate selected may still be incorrect; this can happen when the case of the original OOV was incorrect, and was not corrected earlier in the workflow (e.g. a tweet-initial OOV starting with lowercase). A candidate may have also undergone decasing via regex application or candidate-set generation, which were deployed in a case-insensitive manner. For these reasons, a postprocessing was performed, whereby the selected candidate was uppercased if one of the following four conditions applied. 48

49 Lexical Normalization of Spanish Tweets with Rule-Based Components and Language Models 1. If it was in tweet-initial position. 2. If it was the second token in the tweet, and the first token was a mention or hashtag (#topic). 3. If the previous token was a sentence delimiter In all other positions, the first character of the selected candidate was uppercased if the original OOV s first character was in uppercase. 3 Resources In-vocabulary (IV) items were determined using the Aspell dictionary (v1.11.3). Entity lists were obtained from the JRC Names 5 database. A list of named entities manually annotated in the Spanish subset of the SAVAS 6 corpus (Del Pozo et al., to appear) was also used. The Spanish subset of SAVAS consists of 200 hours of Spanish news broadcasts from It contains entities from current events, often discussed on Twitter. Normalization does not require entity classification or linking, but merely identifying whether a token belongs to an entity or not. Accordingly, in our entity lists multiword entities were split into their tokens. Tokens for which a lowercase variant exists in Aspell s dictionary were filtered out. For measuring candidate distance, a cost matrix for character edits was created. Additionally, a custom distance-scoring scheme was devised for candidates obtained with regular expressions at the candidate-generation stage (see Section 2.2.1). For the edit-cost matrix, costs were domainspecific, estimated by surveying the frequency of character substitutions in Spanish tweets. For instance, editing k as q (as in one of the editing steps needed to correct frequent error kiero as quiero) was assigned a lesser cost than uncommon edits. Costs were also inspired by (Ramírez and López, 2006), who found that 51.5% of spelling errors in Spanish were accent omissions. Accordingly, a cost model was created where replacing a non-accented character with its accented variant cost less than other substitutions. Table 1 provides example costs. Using the table, editing alli to allí costs 0.5; kiero to quiero costs The delimiters considered were.!? " 5 optima.jrc.it/data/entities.gzip 6 Error Correction Cost (each) a, e, i, o, u, n á, é, í, ó, ú, ñ 0.5 k, null q, u 0.75 p, a, z m, u, k 1 Table 1: Edit Costs Besides the edit-cost matrix, a set of regular expressions was created, to model contextsensitive corrections (for errors that are very frequent in specific contexts only, like d-dropping in participles), and for corrections involving one-to-many character edits. A custom scoring scheme was created to assess distance for these corrections. The goal of the custom scoring was for regex-based corrections to receive smaller costs than edit-distance would assign to them. For instance, consider correcting parxe as parche. Using regexes, this was modeled as a single x ch one-to-many character edit, with a cost of 0.5, rather than two one-to-one character edits x c and ø h, which would lead to a higher correction cost. Thus, editing parxe into parche (which repairs a very common error in the domain), costs 0.5, less than editing parxe into a less likely correction like parte, with a cost of 1. In the way just described, the custom scoring scheme was designed to favour corrections that are likely in the domain. Table 2 shows some of the corrections modeled via regexes, and their costs. Note that corrections for some spelling-pronunciations (i.e. correcting p as pe, or k as ca) were also modeled with regexes. Error Correction Cost (each) ki, x, wa, ni qui, ch, gua, ñ 0.5 ao$ ado 0.5 p, t, k pe, te, ca 0.5 Table 2: Context-Sensitive and One-to-Many character Edit Costs In terms of language models, we created a 5-gram case-sensitive language model with Kenlm 7 (Heafield, 2011), using an unk token. The model was based on the OpenSubs Spanish corpus, available at the Opus repository (Tiedmann, 2009), pruned to 31 million subtitles, merged with 1 million tweets containing IV tokens only, collected by 7 kheafield.com/code/kenlm/ 49

50 Pablo Ruiz, Montse Cuadros, Thierry Etchegoyhen ourselves according to the procedure described below. The tweets in the corpus were prepared as follows: tweets with language value es and European time zones were collected in the spring of Only tweets for which Hunspell (v1.3.2) detected no errors were accepted. In order to decrease false positives, Hunspell dictionaries were enriched with entity lists. Tweet tokenization largely treated emoticons, URLs and repeated punctuation as single tokens. For tweets where there was at least 70% of token-overlap with other tweets, only one exemplar was accepted. The choice to use subtitles was motivated by our experiments for the Tweet-Norm workshop, which showed that results for language models trained on subtitles showed similar accuracy to the results for language models trained with tweets containing IV-only items. 4 Results and evaluation Accuracy was 71.15% on the Tweet-Norm shared-task test-corpus (564 tweets and 662 annotated OOVs) 8. For reference, average accuracy on the task, based on the scores obtained by the 13 participating systems, was 56.16%, the range being 33.5% to 78.1%. The improvement that each module achieves over the baseline is provided in Table 3, in terms of accuracy and increase in percentage points (ptp). The baseline (19.78%) is the score attained when accepting all OOV forms as correct. The results on Table 3 support the conclusion that both the rule-based preprocessing and the edit-distance based candidate generation were useful. Applied in isolation, rule-based preprocessing achieved gains of 17.8 ptp over the baseline, and editdistance in isolation obtained an improvement of ptp. The results also support the conclusion that the candidate filtering procedure and the language model managed to disambiguate among candidates successfully, achieving gains of ptp over the baseline (without postprocessing; gains after post-processing are ptp). As regards the edit-distance component, a relevant result is that, using a cost-model that reflects common errors in the domain and 8 integrates context-sensitive edits obtains better results than using a cost model where all edit costs are uniform. For instance, as Table 3 shows, the edit-distance module alone 9, with costs adapted to the domain, achieves an improvement of 13.6 ptp over the baseline. The gain increases to ptp over the baseline if context-sensitive corrections are added. Modules ACCU (%) Baseline GAINS (ptp) Rule-Based Preprocessing Only Abbreviations + Resegmentations Abbreviations + Resegmentations + Delenghtening Edit Distance Only Generic Levenshtein Domain-Adapted Levenshtein (Context Insensitive) Domain-Adapted Levenshtein + Context-Sensitive Distance Entities Only All + Language Model No Postprocessing (recasing) With Postprocessing (recasing) Table 3: Normalization accuracy for each module in isolation and after LM application However, if we use a generic distance model where all edits have a cost of 1, improvement over the baseline is 9.67 ptp only: 5.14 ptp below the cost-model adapted to the domain. The finding that context-sensitive corrections improve accuracy agrees with results by (Hulden and Francom, 2013). Regarding candidate selection, one of the difficulties in applying language models in order to correct microtext is the abundance of other OOVs in the context of the OOV undergoing normalization in each case. Our previous normalization system compared each OOV s local context with the language model. In cases where other OOVs were part of a given 9 Regarding the edit-distance only results in Table 3, several candidates may exist at the same distance. Distance being the only factor in these results, a random choice between candidates at the same distance was avoided by ranking candidates with their distance score weighted at 90% and their language-model unigram logprob weighted at 10%, for all distance models. 50

51 Lexical Normalization of Spanish Tweets with Rule-Based Components and Language Models OOV s local context, the candidates scores were limited to unigram probabilities and backoff, which could decrease accuracy. The current language model implementation, which considers all possible candidate combinations in order to compute the best fit against the LM of the tweet s entire word sequence, is more successful at normalizing cases of adjacent OOVs than an LM workflow based on local context, as the following example illustrates. Consider a tweet (from the test-set) containing the sequence nainonainonahh me atozigah con tuh comentarioh los besoohh virtualeh. Local-context lookup in the LM corrects the bolded phrase as tu comentarios (withouth number agreement). This is expectable, since p(tu) > p(tus) in the model ( 2.66 vs. 3.45), and p(comentarios) > p(comentario), ( 5.77 vs. 6.02). Since OOVs tuh and comentarioh are surrounded by other OOVs, a local-context lookup will not benefit from contextual information, and will be restricted to a unigram probability. By contrast, the current LM workflow, which considers all possible candidatecombinations and assesses the complete tweet against the LM, successfully normalizes the sequence as tus comentarios, since it is able to find the higher probability for the sequence respecting agreement: 6.47, vs for the sequence with broken agreement. The LM also disambiguated successfully accented variants, such as si vs. sí. Another salient result is that, the simple postprocessing module, which deploys four recasing rules to capitalize the final candidate depending on sentence position and on the original OOV s case, yields an improvement of 7.26 ptp compared to results without postprocessing. This agrees with findings by (Alegría et al., 2013b), whose recasing rules are a subset of ours, and who report notable gains from applying recasing rules. Regarding the small gain that occurs when activating the entity heuristics, note that about half the entity-oovs in the corpus are already correct in the baseline. For the remaining entities, precision was acceptable: 75% in both sets. However, recall was weak: 41% in the development set and 52% in the test-set. For these reasons, entity detection yielded a smaller gain over the baseline than other modules. Finally, the system s upper bound (proportion of correct candidates generated, even if they were not selected as final) was 84.54%, similar to the upper bound of 85.47% reported by (Ageno et al. 2013) for the same corpus. Some of the OOVs for which no correct proposal was generated were entities. In some other cases, preprocessing rules that would map the original OOV to a viable candidate were missing. 5 Conclusions and future work We presented a system for the normalization of Spanish tweets. The system uses rules to expand abbreviations, resegment tokens and delengthen OOVs into forms closer to IV tokens. Candidates are generated based on weighted edit-distance. The edit-cost model was adapted to the domain: costs were estimated taking into account common errors in tweets. Besides context-insensitive edits, distancescoring had some context-sensitive rules, reflecting the likelihood that an edit would lead to a correction in a given context. The domainadapted cost-model was shown to be more accurate than a generic-domain unweighted edit-distance model. Candidates were also proposed based on entity lists. To disambiguate between candidates, the entire word sequence of tweet-alternatives containing all possible correction-candidate combinations (among k-best candidates) was checked for best fit against a language model. This global, tweet-level LM lookup method was more successful at normalizing sequences of adjacent OOVs than a lookup method that exploits an OOV s local context only. Regarding future work, the current resegmentation rules were hand-crafted and a statistical workflow (e.g. Alegría et al., 2013b) would be an improvement. Also, our entitydetection heuristics should be improved for recall. In terms of candidate selection, we used the language model to disambiguate candidates at the smallest distance available in the candidate set, as better accuracy was obtained that way. Extending the scope of LM disambiguation beyond k-best candidates, while also improving accuracy, is a topic for future research. Finally, only a small proportion of the current LM training corpus consisted of tweets. It would be relevant to verify if results improve with an LM trained on a large in-vocabulary corpus of tweets, with the language model reflecting domain-specific textual characteristics more closely. 51

52 Pablo Ruiz, Montse Cuadros, Thierry Etchegoyhen 6 References Ageno, A., P. R. Comas, L. Padró, J. Turmo The TALP-UPC approach to Tweet- Norm Proceedings of the Tweet Normalization Workshop at SEPLN Alegría, I., N. Aranberri, V. Fresno, P. Gamallo, L. Padró, I. San Vicente, J. Turmo, and A. Zubiaga. 2013a. Introducción a la tarea compartida Tweet-Norm 2013: Normalización léxica de tuits en español. Proceedings of the Tweet Normalization Workshop at SEPLN Alegría, I., I. Etxeberria, and G. Labaka. 2013b. Una cascada de transductores simples para normalizar tweets. Proceedings of the Tweet Normalization Workshop at SEPLN Armenta, A., G. Escalada, J.M. Garrido, and M. A. Rodríguez Desarrollo de un corrector ortográfico para aplicaciones de conversión texto-voz. Procesamiento del Lenguaje Natural, 31: Beaufort, R., S. Roekhaut, L. A. Cougnon, and C. Fairon A hybrid rule/model-based finite-state framework for normalizing SMS messages. 48th Annual Meeting of the Association for Computational Linguistics, , Uppsala, Sweden Damerau, F A technique for computer correction of spelling errors. Communications of the ACM, 7(3): Del Pozo, A, C. Aliprandi, A. Álvarez, C. Mendes, J. P. Neto, S. Paulo, N. Piccinini, M. Rafaelli. To appear. SAVAS: Collecting, Annotating and Sharing Audiovisual Language Resources for Automatic Subtitling. To appear in Proceedings of LREC Eisenstein, Jacob What to do about bad language on the internet. Proceedings of NAACL-HLT, pp Han, B. and T. Baldwin Lexical normalisation of short text messages: makn sens a #twitter. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Vol. 1: , Association for Computational Linguistics, Stroudsburg, PA, USA. Heafield, K KenLM: Faster and Smaller Language Model Queries. Proceedings of the Sixth Workshop on Statistical Machine Translation, Edinburgh, Scotland, UK. Hulden, M. and J. Francom Weighted and unweighted transducers for tweet normalization. Proceedings of the Tweet Normalization Workshop at SEPLN Kaufmann, J. and J. Kalita Syntactic normalization of twitter messages. In International Conference on Natural Language Processing, Kharagpur, India. Gomez Hidalgo, J. M., A. A. Caurcel Díaz, and Y. Iñiguez del Rio Un método de análisis de lenguaje tipo SMS para el castellano. Linguamática, 5(1):31-39 Mosquera, A., E. Lloret and and P. Moreda Towards facilitating the accessibility of web 2.0 texts through text normalization. In Proceedings of the LREC Workshop: Natural Language Processing for Improvign Textual Accessibility (NLP4ITA), pp 9-14, Istanbul, Turkey. Oliva, J., J. I. Serrano, M. D. Del Castillo, and A. Iglesias A SMS normalization system integrating multiple grammatical resources. Natural Language Engineering, 19: , 1. Pinto, D., D. Vilariño Ayala, Y. Alemán, Helena, N Loya, and H Jiménez-Salazar The Soundex phonetic algorithm revisited for SMS text representation. In P. Sojka, A. Horak, I. Kopecek, and Karel Pala (eds.). Text, Speech and Dialogue, LNCS Vol. 7499: Springer. Ramírez, F. and E. López Spelling Error Patterns in Spanish for Word Processing Applications. Proceedings of LREC 2006, Ruiz, P., M. Cuadros and T. Etchegoyhen Lexical Normalization of Spanish Tweets with Preprocessing Rules, Domain-Specific Edit Distances, and Language Models. Proceedings of the Tweet Normalization Workshop at SEPLN Tiedmann, J News from OPUS. A Collection of Multilingual Parallel Corpora with Tools and Interfaces. In N. Nicolov and K. Bontcheva (eds.) Recent Advances in Natural Language Processing, Vol. V: John Benjamins, Amsterdam. 52

53 Procesamiento del Lenguaje Natural, Revista nº 52 marzo de 2014, pp recibido revisado aceptado From constituents to syntax-oriented dependencies De constituyentes a dependencias de base sintáctica Benjamin Kolz, Toni Badia, Roser Saurí Universitat Pompeu Fabra c. Roc Boronat 138, Barcelona {benjamin.kolz, toni.badia, Resumen: El presente artículo describe el proceso automático de construir un corpus de dependencias basado en la estructura de constituyentes de Ancora. El corpus Ancora ya tiene una capa de información de dependencias sintácticas, pero la nueva anotación aplica criterios puramente sintácticos y ofrece de este modo un nuevo recurso a la comunidad investigadora en el campo del procesamiento del lenguaje. El artículo detalla el proceso de reanotación del corpus, los criterios lingüísticos empleados y los resultados que se han obtenido. Palabras clave: análisis de dependencias, etiquetario de funciones sintácticas, anotación de corpus, conversión de constituyentes a dependencias Abstract: This paper describes the automatic process of building a dependency annotated corpus based on Ancora constituent structures. The Ancora corpus already has a dependency structure information layer, but the new annotated data applies a purely syntactic orientation and offers in this way a new resource to the linguistic research community. The paper details the process of reannotating the corpus, the linguistic criteria used and the obtained results. Keywords: dependency parsing, syntactic function tagset, corpus annotation, conversion from constituents to dependencies 1 Introduction Syntax information, which is crucial in many NLP tools, can be represented by means of constituent structures or dependency relations. While each of these formalisms has its advantages and disadvantages and there is an ongoing debate on preferred uses of them, it is worth noting that dependency-based representations can also vary depending on the linguistic criteria they are based upon (Kübler, McDonald and Nivre, 2009:5-6): from purely syntactically oriented to semantically motivated. Most current approaches to dependency functions within NLP embrace an (at least partial) semantic orientation, e.g., most notably, the Stanford parser (De Marneffe and Manning, 2012) and, in the case of Spanish, the Ancora corpus (Taulé, Martí, Recasens, 2008) and any parser trained on that. By contrast, the current article focuses on the automatic creation of a corpus of dependency relations for Spanish based on purely syntactic criteria. The paper is structured as follows. The next section motivates this project, section 3 reviews the related works, section 4 presents the corpus on which the experiment was run, section 5 discusses the linguistic criteria applied, and the automatic annotation process is detailed in section 6. Finally results are presented in section 7. The article ends with some final considerations and a look into future work (section 8). 2 Motivation Dependency relations can be grounded on different criteria: from purely syntactic to semantically oriented. Take for example the noun phrase el resto de los chicos ( the rest of the boys ). A syntactic view will consider resto as its head, whereas a semantic approach will take chicos as the main element. The same tension between syntactic and semantic heads can be found in other constructions throughout the language, e.g., verbal periphrases, modification relations, etc. ISSN Sociedad Española para el Procesamiento del Lenguaje Natural

54 Benjamin Kolz, Toni Badia, Roser Saurí Choosing a specific dependency analysis depends on the future use of the data. For instance, semantic-oriented trees may be preferable for certain information extraction tasks. By contrast, a purely syntactic analysis offers a neutral ground for any task. However, in many cases there are no corpus resources compliant to the specific approach that is needed. Then, one can just build the NLP tool based on the available data, or create a neutral, syntax-based resource so that future, more semantics oriented and task based, dependency annotations can be generated. We chose this latter path as in our opinion the linguistic criteria in the input to any NLP tool should be adequate to it and not the other way around. For our research goals we worked with the corpus Ancora (Taulé, Martí, Recasens, 2008), which is annotated with both constituent and dependency structures. However, dependency relations in Ancora are semantics-oriented, and we wanted a purely syntax-based annotation. Thus, we decided to build a further layer of dependency relations based on this other approach. Considering the large size of Ancora, we proceeded by automatic means from the layer of constituent structure. The process consists of two individual tasks: dependency relation annotation and, afterwards, syntactic function labeling. 3 Related Work The conversion from constituent to dependency structures is not new. Magerman (1994) made use of a head driven approach, which is still used and enhanced in newer works such as Collins (1999), Yamada and Matsumoto (2003) and Johansson and Nugues (2007). The approach has shown good results but there is still ongoing research. As can be seen in such previous works, the resulting dependency tree structure depends highly on the focus of the annotation, which can apply either a syntactic or a semantic analysis. Johansson and Nugues (2007) mention the possibility to allow multipleheaded dependency structures to overcome this dichotomy. In the particular case of the Ancora corpus, it is worth noting that its dependency relations annotation was carried out automatically by a conversion from constituents (Civit, Martí and Bufí, 2006). Only a head and a function table were written manually. In many constructions, implicit semantic criteria are assumed in the linguistic decisions informing the conversion. Along similar lines, Mille et al. (2009) present a reannotation of Ancora dependencies, already heading towards a more syntaxoriented approach. Their reannotation has been carried out semiautomatically and currently covers only a section of Ancora (100,892 out of 517,269 tokens). Their function tagset consists of 69 tags and so is quite fine-grained for an automatic annotation. Given this and the fact that the resulting annotation is not yet available for the whole corpus, we decided to create our own tagset and proceed with an automatic annotation of the whole corpus. 4 Corpus For our experiments, we used the Spanish part of Ancora (Taulé, Martí, Recasens, 2008), which contains 17,376 sentences split over 1,636 files gathering a total count of 517,269 tokens. Ancora is annotated for different linguistic levels, including constituent structures and dependency relations. All sentences are tokenized, and tokens have information on their lemma and part-of-speech. Other annotation layers include: syntactic constituents and functions argument structure and thematic roles verb semantic classes denotative type of deverbal nouns WordNet synsets for nouns named entities coreference relations 5 Linguistic Criteria This section details the linguistic criteria we adopted for grounding the dependency relations in our automatic annotation. First we focus on the structure of the dependency relations and then on their function labeling. 5.1 Dependency relations The goal of this annotation is to obtain pure syntax-oriented dependency trees. Thus, our linguistic decisions are compliant to that. Periphrastic verbs. In our annotation, auxiliary and modal verbs are the head of the structure, as shown below. In this and the following examples, the upper graph shows the Ancora treatment and the lower one our decision. 1 1 The head of the arrow leads to the dependent. 54

55 From constituents to syntax-oriented dependencies (1) debía enviar los rollos (he had to send the reels) (2) ella ha estado viendo la exposición (she has been seeing the exhibition) Ancora applies here an approach based on semantic criteria, so that the head is the main verb, while the conjugated auxiliary verb is a dependent of former. As the auxiliary verb is in agreement to the subject, we wanted subjects to depend on the auxiliary or modal (as marked by the agreement relation) and other complements, on the main verb. (3) debía enviar los rollos (he had to send the reels) (4) ella ha estado viendo la exposición (she has been seeing the exhibition) Complex nominal phrases. The treatment of complex nominal phrases like el resto de los chicos ( the rest of the boys ) illustrates the differences between a semantic and a syntactic approach. (5) una docena de los participantes. (a dozen of the participants) Coordinations. A coordination structure contains at least two elements which are coordinated by one or more conjunctions. Head candidates are one of the coordinated items or one of the conjunctions. Ancora sees the first coordinated element as head, while we decided to identify as head the conjunction. (6) Juan y María (Juan and María) In case of coordinations with paired conjunctions (e.g., ni ni, neither nor ), we treated the last conjunction as the head of both the conjuncts and any former conjunction or comma. (7) Ni ministro ni excelencia. (Neither minister nor excellency.) Our approach has the advantage that all coordinated elements depend on the same node and can be found at the same level within the dependency tree. Subordinating conjunctions. The conjunction is the head of the subordinated clause, in full accordance to the surface syntactic structure. By contrast, Ancora identifies the verb of the subordinated clause as head and sees the conjunct as its dependent. (8) Amo Boston, aunque ahora vivo en York. (I love Boston, although I now live in York.) Relative clauses. The verb of the relative clause is also its head, while the relative pronoun is its dependent. This case has been treated differently than other subordinating structures given the double role of the relative pronoun (as connector and as argument of the main predicate in the subordinated clause). (9) una mirada que traspasaba el techo (a view which penetrated the roof) Our analysis corresponds to the same treatment as seen in Ancora. Comparative Structures. The comparative element (e.g., más below) depends on the adjective (correcta) and at the same time is the head of the embedded phrase (que la otra). (10) una decisión más correcta que la otra (a more correct decision than the other) Punctuation. Commas and full stops are seen as dependent of the higher constituent head. Brackets, quotation marks, etc. are seen as dependent of the head within their constituent range. 55

56 Benjamin Kolz, Toni Badia, Roser Saurí (11) Amo Boston, aunque ahora vivo en York. (I love Boston, although I now live in York.) 5.2 Function Tagset The syntactic functions tagset has to fulfill two requirements. It has to be as informative as possible and must be of reasonable size in order to guarantee a successful automatic annotation. The tagset used in Ancora has around 50 tags, thus being of a reasonable size. However, it has the problem of mixing dependency relations with part-of-speech and constituent structure tags. Some examples: Dependency function tags: suj (subject), cd (direct object), ci (indirect object). Constituent structure tags: sn (nominal phrase), s.a (adjectival phrase). Part-of-speech tags: v, n. On the other hand, the Stanford tagset (de Marneffe and Manning, 2012) seems to be adequate for both requirements. The size of 53 tags is reasonable for an automatic annotation and the individual tags are a good choice to represent dependency relations information. In addition tags are structured in a hierarchical way, thus allowing underspecified tags when required. In our proposal, we adapted Stanford s tagset for Spanish (e.g., reflec, reflexive) and enhanced it with some tags already available in Ancora (e.g., te, textual element) in order to increase its informativeness. Our tagset is presented in Table 1. It contains 42 function tags (including underspecified ones), which makes it fully adequate for automatic annotation (section 6.2). In the table, indentation shows the tagset hierarchical structure, conveying that general tags like obj or mod include more specific subclasses. In the annotation, the goal is obviously to be as specific as possible, as this leads to more informative data. Therefore the generic tags like dep, comp, obj, mod and prep are not expected to be of common use but only for cases where a more specific tag cannot be applied. Tag root dep arg comp attr cpred obj cobj dobj iobj oobj pobj vobj crobj subj nsubj csubj coord conj agent reflec te mod abbrev amod appos advcl det infmod partmod advmod neg rcmod nn tmod num prep prepv prepn prepa poss punct Full name root dependent argument complement attributive predicative complement object complementizer object direct object indirect object oblique object object of a preposition object of verb object of comparative subject nominal subject clausal subject coordination conjunct agent reflexive ( se ) textual element modifier abbreviation modifier adjectival modifier appositional modifier adverbial clause modifier determiner infinitival modifier participial modifier adverbial modifier negation modifier relative clause modifier noun compound modifier temporal modifier numeric modifier prepositional modifier prep. mod. of a verb prep. mod. of a noun prep. mod. of adjective possession modifier punctuation Table 1: Dependency function tagset 6 Automatic Dependency Annotation 6.1 Process Our system takes the constituent structure layer in Ancora as input and builds the syntaxoriented dependency trees supported by linguistic rules. The core of the process is identifying the head of each constituent, along the lines of Magerman (1994) and subsequent work. The dependent nodes can then be pointed to the identified head. One single main rule selects the head in all clearly headed constituents in 56

57 From constituents to syntax-oriented dependencies the corpus. However a remarkable number of constituent structures in Ancora are not clearly headed, because they are flat structures or conflate several nodes into one (e.g. the verbal group formed by the main verb and its auxiliaries or modals). To tackle these cases a set of nine finer grained rules are added (two for flat constructions and seven for divergence in head selection). Once the dependency structures are obtained, the syntactic function of each headdependent pair is determined. The function labeling process is informed with data from two sources: the part-of-speech of both nodes in each pair, and the argumentstructure function tags that had been manually annotated in the Ancora constituent structure layer (subject, direct and indirect object, oblique and textual element). Based on those two elements, rules can be established to automatically annotate the syntactic functions between head and dependent node. 6.2 Algorithm The algorithm we applied is as shown in Figure 1. 1 function DEPENDENCY_ANNOTATION(parsed_text): 2 for sentence in constituents: 3 read_constituents_tree(sentence) 4 for constituent in constituents_tree: 5 identify_head_of_constituent(constituent) 6 # uses a preference list for possible candidates 7 for terminal_node in constituents_tree: 8 walk_constituents_tree(terminal_node) 9 # bottom-up 10 # walks tree until not head anymore and 11 # connects there as dependent to head 12 for terminal_node in constituents_tree: 13 label_functions() Figure 1: Algorithm The procedure takes the parsed text as input (line 1), analyzes it sentence by sentence (line 2) and generates its dependency structures. In particular, the program reads the constituent tree of each sentence (line 3) and identifies the head of each constituent (line 5). The procedure then walks bottom-up from terminal nodes through the constituent structure and connects them to their head (line 8). Finally each relation between dependent and head is labeled according to the function tagset presented in Table 1 (line 13). 6.3 Issues The conversion from constituent structures to dependency structures is highly dependent on the input that comes from the constituents. Thus inconsistencies in the constituent annotation may lead to problems when applying the general procedure. Furthermore we encountered three specific issues: grouping of several lexical items as a single token (e.g., la_mayoría_de, the majority of ), in Ancora referred to as multiword, the depth of annotation in constituent trees (e.g., debía haberlo resuelto, should have solved it, as a flat structure), and the presence of empty tokens signaling subject ellipses. Flat structures. Flat structures posed a problem for identifying heads and their dependents as they often contain several constituent heads: the head of the constituent and another head of what should have been a lower constituent, as underlined in (12). (12) S=conj S grup.verb sa sn sp In this example we would expect a deeper analysis grouping together also grup.verb sa sn sp to an S. We tackled this problem by specific rules which detect flat structures and insert an intermediate structure introducing the different heads and their corresponding dependents. This way they can be treated as well-formed constituents. Multiwords. In Ancora these include complex prepositions or conjunctions, verb groups, complex determiners and proper names. They are challenging because many of them are treated sometimes compositionally and sometimes as a single token: (13) a. ya_que b. ya que For the moment, we have adapted our annotation to this multiword approach, but the deconstruction of them into individual tokens will be the next step in our project. Empty elements. Another modification to the original Ancora annotation is the suppression of empty tokens which correspond to dropped subjects in Spanish. As these items do not appear in the text, we decided to not include them in the dependency tree. 57

58 Benjamin Kolz, Toni Badia, Roser Saurí 7 Evaluation 7.1 Evaluation Corpus The evaluation corpus was annotated manually for both dependency relations and syntactic functions. We annotated a total of 256 sentences which were chosen partially randomly; that is, we made sure that the selected files included all linguistic phenomena described in section 5.1 above. The evaluation corpus contains a total of 6,160 tokens (out of the 517,269 tokens in Ancora, which corresponds to a 1.5 % of the whole corpus in terms of number of files). Figure 2 exemplifies the content and format of the evaluation corpus: 1#La #2#det 2#situación #10#nsubj 3#en #2#prepn 4#las #5#det 5#carreteras #6#coord 6#y #3#pobj 7#las #8#det 8#montañas #6#coord 9#se #10#reflec 10#normalizó #ROOT#root 11#en #10#prepv 12#todas #14#det 13#las #14#det 14#autonomías #11#pobj 15#afectadas #14#amod 16#. #10#punct Figure 2: Evaluation corpus fragment 7.2 Results The results obtained are highly satisfactory as the labeled attachment score (LAS) reached 0.85, the unlabeled attachment score (UAS) 0.92 and label accuracy (LA) a value of Accuracy Kappa LAS UAS LA Table 2: results As syntactic function labels are likely to get an incorrect result if the corresponding node s head was not set correctly, we also calculated the label accuracy of the correctly identified attachments, which was The Kappa coefficient Κ for agreement between coders has been calculated in order to exclude the factor of agreement by chance. Among the two main ways of calculating Kappa we followed Cohen (1960) because it is better suited for cases where categories have significantly different distributions. In this case the coders were a human annotator and our system. The kappa value for syntactic function labels of 0.88 is in the range of almost perfect agreement according to Landis and Koch (1977). Unfortunately, Civit, Martí and Bufí (2006) do not give results for their conversion from constituents to dependencies in their paper. These results would have been the best comparison for our results as they are based on the same corpus even if not tagged with the same function tagset. 7.3 Error Analysis The error analysis splits into errors observed in the dependency relation identification task and errors in the labeling of the relation The dependency tree creation Our data show that the system had problems with complex coordinated structures as, for example, citations which contain more than one sentence. (14) He said: Sentence 1. Sentence 2 In addition, the rules which treated flat constituent structures were not always able to create the correct dependencies for deeper nodes Function labeling The results and exact frequencies of agreement and disagreement between our manual annotation and the system s one are presented in a confusion matrix (table 3) which counts only the labels of correctly related dependencies. As the matrix shows, the system had problems with some coordination structures. 72 out of 348 cases showed an incorrect label. Problems came up especially in cases of complex structures, particularly with correlative conjunctions (like bien... bien either or ). In other cases the rules were too generic, as the one for labeling the function attr. The system looks at the head lemma and sets attr if it is ser ( to be ). Cases were found in which the label was wrongly used in passive contexts like han sido absueltos ( they were absolved ). The confusion matrix shows that in 10 out of 58

59 From constituents to syntax-oriented dependencies 64 cases the system wrongly identifies the function as being attr instead of vobj. In this and similar cases, the rule needs to be written in a more specific way. Furthermore, the system does not include rules for the use of generic labels like obj. Thus it always assigns a specific label and if this does not fit, it currently assigns the label dep. Some not so frequently used labels like nn or abbrev could not be tested as they did not appear within the evaluation corpus. 8 Final Considerations The approach presented in this work shows to work in a satisfactory way and the new annotation offers a further source of linguistic data for the research community. There is still work left as we want to deconstruct Ancora multiwords into individual tokens and train a parser with the resulting data to work over unseen text. Our new annotation adds value to the original Ancora annotation as dependency structures are now available according to two different points of views (semantic and now also syntactic) and can serve as basis for further research. We plan to improve the results by adjusting some of the identified problems in the rules, testing the approach in corpora of different domains and make the data publicly available in the coming future (accessible on 9 Bibliography Carletta, J Assessing agreement on classification task: the kappa statistic. Computational Linguistics, 22(2): Civit, M., Marti M. A., and Bufi, N Cat3LB and Cast3LB: From Constituents to Dependencies. In Proceedings of the 5 th International Conference on Natural Language Processing, FinTAL, p , Turku, Finland. Springer Verlag LNAI Cohen, Jacob A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20: Collins, M Head-driven statistical models for natural language parsing. Ph.D. thesis, University of Pennsylvania, Philadelphia. De Marneffe, M. and Manning, C. D Stanford typed dependencies manual. Technical report, Stanford University. Johansson, R. and Nugues, P Extended constituent-to-dependency conversion for English. In Proc. of the 16 th Nordic Conf. on Computational Linguistics (NODALIDA), p Kübler, S., McDonald, R. and Nivre, J Dependency Parsing. Morgan & Claypool. Landis, J.R. and Koch, G.G The measurement of observer agreement for categorical data. Biometrics 33 (1): Magerman, D Natural language parsing as statistical pattern recognition. Ph.D. thesis, Stanford University. Mille, S., Burga, A., Vidal, V. and Wanner, L Towards a Rich Dependency Annotation of Spanish Corpora. In Proceedings of SELPN 09, San Sebastian, p Taulé, M., Martí, M. A., and Recasens, M. (2008). AnCora: Multilevel Annotated Corpora for Catalan and Spanish. In ELRA (Ed.), LREC, Marrakech, Morocco, p Yamada, H. and Matsumoto, Y Statistical dependency analysis with support vector machines. In Proceedings of 8th InternationalWorkshop on Parsing Technologies, p

60 Benjamin Kolz, Toni Badia, Roser Saurí Table 3: confusion matrix for functions in evaluation corpus (only correct attachments) 60

61 Procesamiento del Lenguaje Natural, Revista nº 52 marzo de 2014, pp recibido revisado aceptado Automatic prediction of emotions from text in Spanish for expressive speech synthesis in the chat domain Predicción automática de emociones a partir de texto en español para síntesis de voz expresiva en el dominio del chat Benjamin Kolz, Juan María Garrido, Yesika Laplaza Universitat Pompeu Fabra Roc Boronat 138, Barcelona {benjamin.kolz, Resumen: El presente artículo describe un módulo para predecir emociones en textos de chats en castellano que se usará en sistemas de conversión texto-habla para dominios específicos. Tanto el funcionamiento del sistema como los resultados de diferentes evaluaciones realizadas a través de dos corpora de mensajes reales de chat están descritos detalladamente. Los resultados parecen indicar que el rendimiento del sistema es similar a otros sistemas del estado del arte, pero para una tarea más compleja que la que realizan otros sistemas (identificación de emociones e intensidad emocional en el dominio del chat). Palabras clave: procesamiento de texto, detección de emociones, texto a voz, habla expresiva Abstract: This paper describes a module for the prediction of emotions in text chats in Spanish, oriented to its use in specific-domain text-to-speech systems. A general overview of the system is given, and the results of some evaluations carried out with two corpora of real chat messages are described. These results seem to indicate that this system offers a performance similar to other systems described in the literature, for a more complex task than other systems (identification of emotions and emotional intensity in the chat domain). Keywords: text processing, emotion detection, text-to-speech, expressive speech 1 Introduction The generation of synthetic expressive speech for specific domains is currently a key topic in the speech synthesis field. It involves several research problems, such as the prediction of F0 and duration parameters, or the extraction of the necessary linguistic and paralinguistic information, such as emotions, from the input text. The automatic prediction of the underlying emotions associated to the production of the input text of a text-to-speech (TTS) system is not an easy task: in some cases, meaning (at word or sentence levels) can help, but, in many other, there is no information in the utterance to establish if it was produced expressing a given emotion: only context is able to provide some clues. For this reason, most existing specific-domain TTS systems accept tags in the input text to provide this information to the linguistic processing module. However, the use of TTS in specific applications, such as reading aloud chat messages, does not allow previous tagging of texts; in those cases, automatic detection of emotions seems to be an interesting challenge. Emotional classification of texts is a task that has been extensively attempted for different purposes, such as information extraction, text classification, sentiment analysis, and also TTS applications (García and Alías, 2008, for example). More specifically, emotion detection in online informal text, such as blogs, SMS, chat or social media texts, has also been attempted previously, especially in the field of sentiment analysis (Holzman and Pottenger, 2003; Thelwall et al., 2010; Paltoglou et al, 2013, among others). These works were mainly oriented to the classification of texts into ISSN Sociedad Española para el Procesamiento del Lenguaje Natural

62 Benjamin Kolz, Juan María Garrido, Yesika Laplaza positive or negative categories, the location of the text in the valence/arousal space, or the identification of a limited set of emotions (generally the basic emotions inventory), and not to larger sets, which would be the task in domains such as chats. Several approaches and techniques have also been applied -for example, Emotional Keyword Spotting (EKS) or SO-PMI-IR (Semantic Orientation from Pointwise Mutual Information and Information Retrieval; Turney, 2002)-, mixing knowledgebased and machine learning (Alm, Roth and Sproat, 2005, among many others) solutions. Most work on emotion classification using linguistic information is based on the use of emotional dictionaries, which provide lists of words associated with a given emotional label or parameter (valence or arousal). Some of these works use preexisting general emotional lists, such as ANEW (Affective Norms for English Words; Bradley and Lang, 1999) or WordNet-Affect (Strapparava and Valitutti, 2004) for English, or ANSW (Affective Norms for Spanish Words; Redondo et al., 2007) for Spanish, which are the result of a manual classification by experts of generic words. This paper describes EmotionFinder, a module for the detection of emotions in Spanish chat texts which has been implemented in TexAFon, the Python-based linguistic processing module for TTS applications developed at GLiCom (Garrido et al., 2012b). It has been developed to detect the emotional labels most represented in an annotated corpus of chat texts in Spanish, which has been used as base ( training ) material for this work. It uses lexicon-based techniques similar to the ones applied in previous works (Francisco, Hervás and Gervás, 2005, for example) to identify emotions, but includes also a set of knowledge-based heuristic rules derived from the analysis of the training corpus. The emotional dictionary used in this case has been derived from a corpus of chat material, the same communicative situation in which the TTS system is expected to be used. In the following pages, a brief description of the base material used for this work is given, the system is described, and the results of several evaluations are presented. 2 Training corpus: an annotated database of chat messages in Spanish The work presented here is based on the analysis of a set of 4207 utterances of real chat messages in Spanish, annotated with emotional tags, which is called here the training corpus. This training corpus is a subset of a more general corpus of chat conversations collected for an ongoing project on expressive synthesis in the chat domain (45 generic -without a specific topic- chat conversations, 8780 interventions). This general corpus was labeled with emotional tags by a single human annotator, using the inventory of emotions described in Garrido et al. (2012a), and then partially revised by two people different from the main annotator. The training corpus contains only the interventions representing the most frequent emotional labels found in the general corpus (16 out of 37, those showing a relative frequency beyond 1%). Table 1 presents the list of these 16 emotional tags, and the number of appearances of each one in the training corpus. This training corpus was used for three different tasks during the development of EmotionFinder: the definition of the set of 8 emotions currently detected by the module, which is a subset of the emotions included in the training corpus (16); the development of the emotional dictionary; the development of the heuristic rules. 62

63 Automatic prediction of emotions from text in Spanish for expressive speech synthesis in the chat domain Emotion Number of appearances Rejection 1185 Derision 547 Happiness 495 Interest 407 Anger 371 Affection 220 Disturbance 194 Surprise 122 Pride 124 Sadness 111 Negative surprise 95 Fun 90 Admiration 90 Resignation 64 Doubt 63 Disappointment 59 Table 1: List of the 16 emotion labels covered by the training corpus. 3 EmotionFinder Overview The current implementation of EmotionFinder is able to detect eight different emotions in the input text: admiration, affection, disappointment, interest, happiness, surprise, rejection, sadness. These labels are a subset of the most frequent emotions found in the training corpus. It works at sentence level: it tries to assign a single emotional label (or none, if the text is considered to be neutral ) to the sentences detected by TexAFon in the input text. It assumes a previous step of lemmatization of the words making up the input sentence (both the emotional dictionary and the rules include only lemmatized words, to improve its generalization power), which is carried out by a separate module (Lemmatizer) which has also been integrated in TexAFon as part of this project. The EmotionFinder module includes a set of functions, one per emotion, which combine searching for key words (taken from the emotional dictionary) and regular expressions with rule based emotion inference. All these functions are applied to the input sentence one by one to check for possible cues related to the considered emotions. If a function detects one or several cues for the corresponding emotion in the input sentence, it adds the following information to the list of emotion candidates of the sentence: the label of the candidate emotion; a number indicating the predicted intensity of the emotion (1, 2 or 3); an associated weight indicating how reliable is the cue for the detection of that emotion. If the function finds several different hits for the same emotion in one sentence, the final weight is the sum of all of them. The final intensity value corresponds to the highest one within the found intensities in the set of detected hits. So for example, the output of the function corresponding to happiness for the sentence Estoy feliz y encantado con el plan would be ALEGRIA(3):70 (happiness with intensity level 3, and weight 70), which would be the result of the combination of the information of two different cues detected in the sentence: ALEGRIA(2):40 and ALEGRIA(3):30. At the end of the process, the emotion label with the highest weight is selected as the sentence emotion. For example, in the case of the sample sentence Es un buen amigo, the final list of candidate emotions would be ADMIRACION(1):20 and ALEGRIA(1):40, and the final output label would be ALEGRIA(1), which is the one with the highest weight. Figure 1 illustrates the workflow of the emotion labeling procedure in EmotionFinder. 63

64 Benjamin Kolz, Juan María Garrido, Yesika Laplaza Figure 1: EmotionFinder workflow. 3.1 Emotional dictionary In its current implementation, the emotional dictionary contains 454 entries (lemmatized isolated words and fixed expressions). Each entry contains: the lemma of the word or expression; its associated emotional label; a number (1, 2 or 3) expressing the intensity of the associated emotion; and the weight of the entry. These entries were chosen after a manual analysis of the utterances labeled with the 8 considered emotions in the training corpus, and were manually annotated with emotion, intensity and weight associated information. Table 2 shows a summary of the contents of the dictionary and table 3 gives some sample entries. Emotion Lemmas Fixed Expressions/ Collocations Total entries Admiration Affection Happiness Disappointment Interest Rejection Surprise Sadness Anger Total Table 2: Summary of the contents of the emotional dictionary. Entry Intensity Emotion Weight estupendo 2 admiration 50 excepcional 2 admiration 50 extraordinario 2 admiration 60 fascinar 3 admiration 70 fascinación 3 admiration 70 fenómeno 2 admiration 70 formidable 2 admiration 60 forrarse 2 admiration 60 fuerte 2 admiration 60 genial 3 admiration 60 Table 3: Sample entries of the emotional dictionary. 3.2 Emotion prediction rules Emotion prediction rules have been implemented in the emotion recognition procedures to incorporate to the detection process some other additional information related to the identification of emotions, such as negation, comparative forms or the use of specific punctuation marks, which is not derivable by detecting single lexical items, but which can be relevant for the prediction of emotions and their intensity. For example, negation can make an emotion word not having the effect of creating an emotion like in No me interesa (emotion interest is not evoked here) or even create a contrary emotion like in No eres una buena persona (not admiration but rejection ). Then some rules to deal with polarity effects have been included. These rules have been developed using negation identifiers in regular 64

65 Automatic prediction of emotions from text in Spanish for expressive speech synthesis in the chat domain expressions which can already identify and treat correctly an important part of cases where negation is implied. However, other cases in which a larger scope linguistic analysis (at sentence level, for example) is needed cannot be correctly handled yet, because morphosyntactic analysis of the input sentence is not currently available. 4 Evaluation The system was submitted to two different evaluations: the first one was carried out using a subset of the training corpus already described in section 2, and a second one with a smaller corpus of chat messages, different from the training corpus (the evaluation corpus ). The evaluation data presented in García and Alías (2008) have been taken as reference: this work describes a system similar to the one presented here (oriented also to the identification of emotions in a TTS task), but considers a smaller set of emotion labels, all of them contained in the basic emotions inventory ( anger, happiness, fear, surprise and sadness ), plus the neutral label. Also, the evaluation was carried out on a different domain to the one chosen for this work: 250 headlines of English newspapers. The results of that evaluation are reproduced in table 4. Label Precision Neutral 0.84 Happiness 0.25 Anger 0.04 Surprise 0 Fear 0.28 Sadness 0 Table 4: Evaluation results of the system described in (García and Alías, 2008). 4.1 Training corpus evaluation Procedure The training evaluation corpus contained the subset of utterances labeled with the 8 implemented emotions (admiration, affection, disappointment, interest, happiness, surprise, rejection and sadness) within the training corpus, plus a set of 1756 neutral sentences coming from the same general corpus. The inclusion of this set of neutral sentences was motivated by to facts: first, the training corpus contains a large amount of neutral sentences, and it has been considered that their correct identification as neutral is as important as the recognition of the different considered emotions; second, the evaluation task described in García and Alías (2008) included also neutral sentences, so neutral sentences should also be considered for the evaluation of EmotionFinder in order to make evaluation results comparable. In addition, the set of sentences labeled as rejection included in the evaluation corpus was reduced to 731, instead of the 1185 of the original training corpus. Then, the total number of evaluated sentences was 3991, distributed as specified in table 5. Label Number of sentences Neutral 1756 Rejection 731 Happiness 495 Interest 407 Affection 220 Surprise 122 Sadness 111 Admiration 90 Disappointment 59 TOTAL 3991 Table 5: Contents of the training evaluation corpus. This corpus was processed with EmotionFinder to obtain a prediction of labels, which were then compared with the emotion labels of the human annotator of the training corpus. Precision and recall values were then calculated Results Table 6 presents the results obtained with the training evaluation corpus. A mean precision of 0.54 was obtained, with a recall of 0.49, but strong differences among emotional labels can be observed. Best results are obtained in the case of the interest label (0.67), followed by the neutral label (0.65). Labels showing the worst results are disappointment (0.05) and surprise (0.04). These results can be considered acceptable, but it has to be taken into account that they have been obtained from the same corpus from which data have been extracted to build the system. 65

66 Benjamin Kolz, Juan María Garrido, Yesika Laplaza True False False Recall Precision F1 Label positive positive negative Neutral Happiness Admiration Affection Rejection Surprise Interest Sadness Disappointment TOTAL Table 6: Results obtained with the training evaluation corpus. 4.2 Evaluation corpus Procedure The evaluation corpus was collected to test the performance of the system with a set of data different from the one used for its development. It was made of a set of 609 sentences, coming from the same source of the general corpus (real messages from chats in Spanish), but not included in the training corpus. This corpus was also annotated with emotional labels by a human annotator, different from the one who annotated the training corpus, using the same label inventory. The resulting annotation was partially revised by a second annotator, the same who labeled the training corpus, in order to check consistency in the use of the emotion labels. As the case of the training corpus, this corpus included a high amount of neutral sentences, which were used also in the evaluation task for the same reasons as in the previous evaluation. Table 7 shows the distribution of the sentences according to their label in this corpus. Label Number of sentences Neutral 380 Happiness 11 Admiration 10 Affection 67 Rejection 54 Surprise 34 Interest 25 Sadness 22 Disappointment 6 TOTAL 609 Table 7: Contents of the evaluation corpus. As before, the corpus was processed with EmotionFinder to obtain the prediction of labels, which were then compared with the labels added by the human annotator. Precision and recall values were again calculated Results Table 8 presents the results obtained with the evaluation corpus. A mean precision of 0.6 was obtained, with a recall of 0.58, even better results than those obtained with the training corpus. However, a closer look to the data allows to observe that this value is mainly due to the very good results obtained in the case of the neutral label (precision 0.81); the emotional labels shows clearly lower values than in the previous evaluation, with two labels ( happiness and surprise ) having a precision score of 0, and a maximum of 0.33 in the case of rejection. These results reveal a dependency on the training corpus of the dictionary and the rules. 66

67 Automatic prediction of emotions from text in Spanish for expressive speech synthesis in the chat domain True False False Recall Precision F1 Label positive positive negative Neutral Happiness Admiration Affection Rejection Surprise Interest Sadness Disappointment TOTAL Table 8: Results obtained with the evaluation corpus 5 Discussion and conclusions In this paper a new module for the prediction of emotions in chat text, oriented to the generation of emotional speech in the chat domain, has been presented. It makes use of a combination of lexical information (in the form of an emotional dictionary especially built for the system from a reference corpus) and hand-made expert rules, to attempt the identification of some of the most frequent emotions, as well as the intensity of the emotion, appearing in the emotional annotation of a corpus of chat messages. Both aspects (detection of emotions beyond the inventory of basic emotions and detection of the emotion intensity) are novel with respect to other previous systems. The results obtained in the performed evaluations are encouraging: they are slightly better than those of the system chosen as reference, for a more complex identification task (nine emotional labels instead of the six labels of the reference system). Also, the system shows a good performance in the correct discrimination of neutral from emotional sentences, an important task in the generation of synthetic expressive speech in a specific domain situation, where neutral and emotional sentences appear mixed and they have been properly handled. The observed differences in the results of both evaluations (better scores in emotion detection task with the training corpus than with the evaluation corpus) seem to indicate that the performance of the system is still quite dependent on the corpus used to develop the rules and the emotional dictionary. Further research should be done to enlarge the dictionary and to improve the rules to consider phenomena not included in the used training corpus. The use of morphosyntactic information could also improve the performance of the current rules, and allow the development of new ones. 6 References Alm, C. O., Roth, D. and Sproat, R Emotions from text: machine learning for text based emotion prediction, Proceedings of HLT/EMNLP. Bradley, M. and Lang, P Affective Norms for English Words (ANEW): Stimuli, Instruction Manual and Affective Ratings. Technical Report C-1, Gainesville, FL, The Center for Research in Psychophysiology, University of Florida. Francisco, V., Hervás, R. and Gervás, P Expresión de emociones en la síntesis de voz en contextos narrativos. Simposio de Computación Ubicua e Inteligencia Ambiental. García, D. and Alías, F Identificación de emociones a partir de texto usando desambiguación semántica, Procesamiento del Lenguaje Natural, 40: Garrido, J. M., Laplaza, Y., Marquina, M, Pearman, A., Escalada, J. G., Rodríguez, M. A. and Armenta, A. 2012a. The I3MEDIA speech database: a trilingual annotated corpus for the analysis and synthesis of emotional speech, LREC 2012 Proceedings: Online: er.pdf, accessed on 13 November

68 Benjamin Kolz, Juan María Garrido, Yesika Laplaza Garrido, J. M., Laplaza, Y., Marquina, M., Schoenfelder, C. and Rustullet, S. 2012b. "TexAFon: a multilingual text processing tool for text-to- speech applications", Proceedings of IberSpeech 2012, Madrid, Spain, November 21-23, 2012: Online: nlineproceedings, accessed on 13 November Holzman, L,. and Pottenger, W Classification Of Emotions in Internet Chat: An Application of Machine Learning Using Speech Phonemes. Technical Report LU-CSE , Lehigh University, Paltoglou, G. Theunis, M., Kappas, A. and Thelwall, M "Predicting Emotional Responses to Long Informal Text," IEEE Transactions on Affective Computing, 4, 1: Redondo, J., Fraga, I., Padrón, I. and Comesaña, M The Spanish adaptation of ANEW (Affective Norms for English Words). Behavior Research Methods, 39(3): Strapparava, C., and Valitutti, A Wordnet-affect: An affective extension of wordnet. Proceedings of the Fourth International Conference on Language Resources and Evaluation: Thelwall, M., Buckley, K., Paltoglou, G., Cai, D. and Kappas, A Sentiment Strength Detection in Short Informal Text, Journal of the American Society for Information Science and Technology, 61: Turney, P. D Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 02). Philadelphia, Pennsylvania, USA:

69 Procesamiento del Lenguaje Natural, Revista nº 52 marzo de 2014, pp recibido revisado aceptado Función de las secuencias narrativas en la clasificación de la polaridad de reviews The function of narrative chains in the polarity classification of reviews John Roberto CLiC-UB Gran Via 585, Bcn Maria Salamó Universidad de Barcelona Gran Via 585, Bcn M. Antònia Martí CLiC-UB Gran Via 585, Bcn Resumen: Los comentarios sobre productos o reviews son una fuente valiosa de información para entender las preferencias de los usuarios en los sistemas para la personalización de contenidos. En este artículo se analiza la función que desempeñan las secuencias narrativas en el cálculo de la polaridad de productos. Con esta finalidad hemos aplicado un algoritmo para extraer las oraciones que contienen eventos relacionados semánticamente y hemos realizado una serie de experimentos orientados a determinar el impacto que la omisión de dichas oraciones puede tener a nivel de la polaridad de los reviews. Los resultados obtenidos demuestran que las opiniones negativas de los productos se suelen expresar mediante secuencias narrativas mientras que las positivas son independientes de la narración. Palabras clave: Análisis de la polaridad, perfiles de usuario, minería de opiniones Abstract: Reviews are a powerful source of information about consumer preferences that can be used in personalization systems. In this paper we analyse the role played by narrative chains in determining the polarity of reviews. For this purpose, we applied an algorithm to remove sentences containing events semantically connected. We report experiments designed to evaluate the impact that the omission of those sentences has in determining the polarity of reviews. The results show that negative opinions are often expressed in terms of narrative chains while positive opinions are independent of narratives. Keywords: Polarity analysis, user profiles, opinion mining 1 Introducción Ante la enorme cantidad de información disponible en Internet, los sistemas para la personalización de contenidos (ej. Sistemas de Recomendación) se están convirtiendo en una herramienta indispensable para eliminar la sobrecarga de información. Los Perfiles de Usuario (PU) son un componente primordial de estos sistemas. Definimos el PU como una representación estructurada de los atributos de un usuario. Dichos atributos se pueden categorizar en dos clases (Vildjiounaite et al., 2007): Restricciones: información personal sobre el usuario como por ejemplo su edad, estado civil o su personalidad. Esta información se usa para limitar o restringir los productos a recomendar. ISSN Preferencias: información sobre los gustos, necesidades e intereses del usuario en relación con un determinado producto o servicio. Las preferencias pueden referirse al producto como totalidad ( me gusta el iphone ) o a algunas de sus características ( me gusta su diseño ). Las valoraciones expresadas como ratings y, más recientemente, los reviews suponen las formas habituales de obtener los PUs. En nuestro análisis sólo consideraremos los reviews, por el interés lingüístico que despiertan al estar constituidos íntegramente por texto en lenguaje natural. En un trabajo previo (Roberto, Salamó, y Martí, 2014), determinamos que las restricciones y las preferencias se expresan en segmentos más o menos independientes de los reviews. Así, basándonos en (Ricci y Wietsma, 2006), definimos el review como un texto breve y subjetivo que: 1. relata las expe Sociedad Española para el Procesamiento del Lenguaje Natural

70 John Roberto, Maria Salamó Llorente, Maria Antònia Martí Antonín riencias personales de un usuario en relación con un producto (es decir, las restricciones), 2. contiene una descripción más o menos detallada sobre las características del producto (es decir, las preferencias sobre las características) y 3. hace una valoración general del mismo (preferencia a nivel de ítem). Adicionalmente, y según nuestro modelo, cada uno de estos tres tipos de segmentos se expresan mediante diferentes modalidades textuales: Las experiencias se expresan mediante narraciones: ej. mi esposa y yo nos alojamos por 15 días. La descripción de las características del producto se expresan mediante descripciones: ej. las habitaciones son enormes. La valoración del producto se expresa mediante exhortaciones: ej. os recomiendo pasar unos días en este hotel. La extracción automática de estos tres tipos de segmentos es una tarea útil para enriquecer los PUs y mejorar los procesos de recomendación basados en el análisis de reviews. En este artículo analizamos la función que desempeñan las secuencias narrativas en el cálculo de la polaridad de productos. Para ello extraemos de los reviews las oraciones que relatan las experiencias del usuario adaptando el modelo de esquemas narrativos de Chambers y Jurafsky (2008b). Posteriormente, efectuamos varios experimentos orientados a determinar el impacto que la omisión de dichas oraciones puede tener en el cálculo de la polaridad de los reviews. Nuestro objetivo es demostrar que no todos los componentes de un review inciden de la misma manera en la detección de la polaridad. La estructura del artículo es como sigue. En la Sección 2 presentamos una adaptación del algoritmo para la detección de eventos y secuencias narrativas de Chambers y Jurafsky. En la Sección 3 presentamos los experimentos y los resultados. En la Sección 4 hacemos un breve repaso de los trabajos relacionados con la detección de secuencias narrativas. Finalmente, en la Sección 5 presentamos las conclusiones y el trabajo futuro Detección de eventos y secuencias narrativas Para efectos de nuestro trabajo, definiremos un review (R) como un texto compuesto por un conjunto de oraciones narrativas (O N ), descriptivas (O D ) y exhortativas (O E ), es decir, R = {O N + O D + O E } = {o 1, o 2, o 3,..., o n }. Adicionalmente, una secuencia narrativa (O en ) es un subconjunto de las oraciones que en O N relatan eventos relacionados semántica y temporalmente: O en O N. Cada evento (e) es una tupla conformada por el verbo y sus argumentos: e = v, arg donde arg {suj, obj, prep}. Un caso concreto de O en lo podemos ver en el Ejemplo (1): (1) My dad bought me a Saturn for my graduation in 1992 before all the marketing hype (how embarrassing to be constantly asked if I went to Tennessee!). Shortly thereafter the problems started.... Mi papá me compró un Saturn para mi graduación en 1992 antes de todo el boom publicitario (qué vergüenza que te pregunten constantemente si fuiste a Tennessee!). Poco después empezaron los problemas.... Con el fin de evidenciar la relación semántica entre los eventos de un review recurriremos a la presunción de la coherencia narrativa 1 de Chambers y Jurafsky (2008b). Según estos autores, los verbos que comparten argumentos correferentes están relacionados semánticamente en virtud de la estructura narrativa del discurso. Por ejemplo, en el fragmento (1) los verbos bought y went tienen pronombres correferentes (me e I), por lo que se establece una relación semántica entre ambos verbos: bought, X objeto y went, X sujeto. Siguiendo la notación gráfica utilizada por Chambers y Jurafsky: My dad bought me buy (bought X) I went to Tennessee go (X went) Figura 1: Relación semántica entre eventos (modelo básico). En la Figura 1 los círculos sombreados representan el elemento correferente X (en un caso X es objeto y en otro sujeto oracional). 1 Narrative Coherence Assumption.

71 Función de las secuencias narrativas en la clasificación de la polaridad de reviews Los círculos en blanco son las entidades que no tienen ningún nexo correferencial: my dad y T ennessee. El problema de esta representación es que solo captura 2 de los 6 eventos subrayados en el Ejemplo (1). Para incluir los 4 eventos restantes hemos de considerar, además de los verbos, las nominalizaciones deverbales (graduation y hype), las expresiones temporales (before y shortly thereafter) y los nexos coordinantes y subordinantes (if). El resultado del nuevo análisis se puede ver en la Figura 2. before if thereafter buy (bought X) graduate hype ask go start (X graduation) (hype) (asked) (X went) (started) Figura 2: Relación semántica entre eventos (modelo extendido). Como podemos observar, el sustantivo deverbal graduation se incorpora de forma directa (comparte el argumento X) a la narración gracias al pronombre posesivo my. Por su parte los eventos hype y ask son incorporados de forma indirecta (no comparten argumentos) por una expresión temporal (before) y por un nexo subordinante (if). Finalmente, el verbo start lo incluimos por la proximidad de ésta oración con las anteriores y porque está encabezada por una locución adverbial (shortly thereafter) que expresa un orden temporal. El Algoritmo 1 describe el procedimiento que usamos para extraer de los reviews las oraciones que conformarán las secuencias narrativas según el modelo expuesto en la Figura 2. La entrada del algoritmo es el review (R), concretamente, las oraciones que lo componen. Mediante una llamada al parser de la Universidad de Stanford 2 obtenemos las dependencias sintácticas (línea 8) y resolvemos las correferencias de cada oración (línea 9) A continuación, identificamos las nominalizaciones basándonos en un listado de nombres deverbales 3 que hemos extraído de NomBank (línea 10). Aplicamos un procedimiento similar para identificar las expresiones temporales a partir de TimeBank (Pustejovsky et al., 2003) y de un listado de adverbios, adjetivos y preposiciones extraído del mismo corpus (af ter, immediately, f ollows, meanwhile, etc.) (línea 11). En las líneas 12 a la 16 seleccionamos los pares núcleo - modificador y, si el núcleo es un verbo o una nominalización y el modificador un argumento correferente, o una expresión temporal, agregamos el núcleo a la lista de eventos (línea 14). Dado que una secuencia narrativa se compone como mínimo de dos eventos, sólo la oración que contiene dos o más eventos (línea 17) pasa a formar parte de la secuencia narrativa (línea 18). Adicionalmente, siempre que sea posible, capturamos la referencia a las dos oraciones siguientes (líneas 19 a la 24) y si alguna de estas dos oraciones contiene una expresión temporal, también la incluimos en la secuencia narrativa (líneas 26 y 34). Finalmente, el algoritmo retorna el conjunto de oraciones seleccionadas (O en ). En el Anexo A presentamos el Ejemplo (1) desarrollado en su totalidad y procesado de forma automática. Queremos aclarar que aunque en la Figura 2 incluimos el evento ask como parte del modelo, la implementación actual del algoritmo no captura dicho evento (ver Tabla 4). Una tarea pendiente como trabajo futuro es buscar alternativas al tratamiento de estos eventos aislados. 3 Experimentos y resultados Los experimentos que presentamos a continuación tienen como objetivo determinar la relación que existe entre el uso de las secuencias narrativas y la polaridad de los reviews. 3.1 Los datos En este trabajo hemos usado el corpus de opiniones en inglés de (Cruz, 2012). El corpus se compone de 2547 documentos de los cuales 972 corresponden a opiniones sobre coches, 587 sobre auriculares y 988 sobre hoteles. Cada opinión lleva asociada, en un archivo independiente en formato XML, la puntuación del review (línea 2 en el siguiente fragmento) y las palabras que expresan opinión sobre 3 Consideramos sólo los sustantivos que se derivan directamente de verbos.

72 John Roberto, Maria Salamó Llorente, Maria Antònia Martí Antonín Algoritmo 1 Detección automática de secuencias narrativas Require: R = {o 1, o 2, o 3,..., o n } 1: e list // lista de eventos 2: O en // secuencia narrativa 3: primero f alse // variable booleana auxiliar 4: segundo f alse // variable booleana auxiliar 5: o x // oración auxiliar 6: o y // oración auxiliar 7: while o i R do 8: deps = {par k : 1 k n k } donde par k = núcleo, modificador // se obtienen las dependencias 9: crefs = {c m : 1 m n m } donde c m = palabra // se resuelven las correferencias 10: nomin = {n s : 1 s n s } donde n s = palabra // se identifican las nominalizaciones 11: time = {t t : 1 t n t } donde t t = palabra // se identifican las expresiones temporales 12: for all par= núcleo, modif icador deps do 13: if ((núcleo = verbo núcleo = nomin) and (modif icador = cref s modif icador = time)) then 14: e list e list {núcleo} // se guarda el evento 15: end if 16: end for 17: if e list 2 then 18: O en O en {o i } // se agrega la oración a la secuencia narrativa 19: if (i + 1) n then 20: o x o i+1 21: end if 22: if (i + 2) n then 23: o y o i+2 24: end if 25: end if 26: while t t time and (o x o y ) do 27: if t t o x then 28: O en O en {o x } // se agrega la oración a la secuencia narrativa 29: o x 30: end if 31: if t t o y then 32: O en O en {o y } // se agrega la oración a la secuencia narrativa 33: o y 34: end if 35: end while 36: end while Ensure: O en el producto o alguna de sus características (líneas 7 a la 10):... <review id= 5 item= Amerisuites Busch Gardens rating= 2 >... <sentence id= 1 > Stains(1) on(2) carpet(3),(4) dirty(5) pool(6),(7) bad(8) elevator(9) /(10) housekeeper(11) setup(12).(13) <opinion polarity= - feature= swimming pool featwords= 6 opwords= 5 /> <opinion polarity= - feature= elevator feat- Words= 9 opwords= 8 /> </sentence> Configuraciones El análisis de la polaridad de los reviews se ha realizado considerando seis configuraciones diferentes, cinco de ellas basadas en la supresión de diferentes fragmentos del texto: R ref Es la configuración de referencia. Se utiliza el review en su totalidad para el 72 cálculo de la polaridad. ctr 40 Análisis de la polaridad usando un 40 % del review seleccionado aleatoriamente. ctr 30 Análisis de la polaridad usando solo un 30 % del review. Las configuraciones ctr 40 y ctr 30 son de control (ctr) ya que se hace una fragmentación aleatoria del texto. Hemos seleccionado el 40 % y el 30 % de los textos como configuraciones de control puesto que, como veremos más adelante, estos porcentajes se aproximan al obtenido con la configuración que intentamos evaluar (des ext ). nar ext Análisis de la polaridad usando las secuencias narrativas (O en ). Esta configuración utiliza el modelo extendido para detectar las secuencias narrativas (ver Figura 2). des bas Análisis de la polaridad usando los segmentos NO narrativos, es decir, la

73 Función de las secuencias narrativas en la clasificación de la polaridad de reviews Dominio pol. R ref ctr 40 ctr 30 nar ext des bas des ext ren Coches ren Hoteles Auriculares ren ren ren ren Longitud % pals Tabla 1: Resultados del cálculo de la polaridad en los reviews. parte del review que queda tras eliminar las secuencias narrativas (O en ). Por comodidad nos referiremos a los segmentos no narrativos como segmentos descriptivos (des = O D O E ). En esta configuración, la eliminación de las secuencias narrativas se realizó según el modelo básico (ver Figura 1). des ext Es la configuración objetivo, es decir, la que nos informa directamente sobre el impacto que tiene la omisión de las secuencias narrativas en el cálculo de la polaridad. des ext está basada en el modelo extendido de la Figura 2. Recalcamos que tanto des bas como des ext evalúan la polaridad de los reviews suprimiendo sus segmentos narrativos. Para el análisis de la polaridad hemos usado el Semantic Orientation CALculator (SO-CAL)(Taboada et al., 2011). El SO- CAL utiliza diccionarios de palabras anotadas con su orientación semántica en una escala que va del 5 para los términos más positivos (exquisite) a -5 para los más negativos (horrif ic). El SO-CAL también incorpora modificadores de la polaridad como son los intensificadores (most excellent) y la negación (not good). 3.3 Resultados 73 Con el fin de determinar el impacto que tiene la omisión de las secuencias narrativas en el cálculo de la polaridad de los reviews hemos realizado 36 evaluaciones del Algoritmo 1. Las 36 evaluaciones se efectúan considerando 3 dominios (coches, hoteles y auriculares), 2 polaridades (positiva + y negativa ) y 6 configuraciones (ver Sección 3.2). Los resultados se pueden ver en la Tabla 1. En esta tabla tenemos los niveles de precisión que se obtienen al predecir la polaridad de los reviews. También evaluamos el rendimiento (ren.), es decir, la relación entre la precisión y la cantidad de texto necesaria para obtener dicha precisión (ren = precisión/longitud texto). La cantidad o longitud de texto usada en cada configuración está descrita en la última fila de la Tabla 1, tanto en términos de porcentajes ( %) como de número de palabras (pals.). Las principales observaciones que podemos extraer de este análisis son las siguientes: R ref obtuvo las mejores precisiones en el cálculo de la polaridad pero su rendimiento es muy bajo puesto que requiere de todo el texto de los reviews (más de 1 millón de palabras) para alcanzar tales valores. La precisión más alta bajo esta configuración es de 95.7 % (hoteles) sin embargo en ningún momento el rendimiento supera el valor del 1.0. ctr 40 y ctr 30 representan el caso contrario a R ref : si bien su rendimiento llega a superar el 2.0, sus niveles de precisión son muy bajos en comparación con las otras configuraciones (alrededor del 72 %).

74 John Roberto, Maria Salamó Llorente, Maria Antònia Martí Antonín nar ext es la configuración menos eficaz: tiene los niveles de precisión más bajos de todas las configuraciones y su rendimiento es modesto. Por ejemplo, la precisión para el dominio de los auriculares con polaridad negativa es de tan solo el 60.3 % con un rendimiento del 0.9. des bas y des ext, de otra parte, presentan los niveles de precisión más próximos a R ref en cuanto al cálculo de la polaridad positiva utilizando tan solo el 39.4 % del review ( palabras) (ver celdas sombreadas en la Tabla 1). Estos resultados contrastan con su bajo rendimiento en el cálculo de la polaridad negativa. El contraste al que hacemos referencia es más evidente en des ext que en des bas. Por ejemplo, en el dominio de los auriculares la precisión de aciertos positivos es del 95.5 % frente a un 67.2 % de precisión en aciertos negativos. De la misma forma, si atendemos al promedio de las diferencias en el rendimiento (ren) entre des ext y las configuraciones de control, que trabajan con un número de palabras similar al de des ext, veremos una divergencia notable en el rendimiento debida al tipo de polaridad: des ext = 0.5 4, ctr 30 = 0.2 y ctr 40 = 0.2. Por tanto, los resultados indican que al omitir las secuencias narrativas de los reviews estamos descartando información relevante para entender las opiniones negativas. Desde el punto de vista lingüístico, este hallazgo nos revela que los usuarios suelen recurrir a la narración para describir aspectos negativos de los productos mientras que las valoraciones positivas son independientes de la narración: The management staff is the worst I ve ever encountered... The employees at this hotel were the rudest bunch of people I have ran into in a while.. 4 Trabajos relacionados En esta sección describimos algunos de los trabajos más relevantes relacionados con la detección de secuencias narrativas en lenguaje natural. Chambers y Jurafsky han estudiado la forma de inferir el orden de los eventos que aparecen en textos narrativos. En (Chambers y Jurafsky, 2008b; Chambers y Jurafsky, 4 des ext: coches( = 0.5); hoteles( = 0.7); auriculares( = 0.5) = a) los autores presentan las bases de su modelo. En trabajos posteriores, (Chambers y Jurafsky, 2009; Chambers y Jurafsky, 2010) amplian el modelo para incluir los diferentes roles que puede desempeñar el protagonista de un evento (ej. criminal, sospechoso) y para ampliar la lista de argumentos que hacen servir (sujeto, objeto, prep). También introducen el concepto de esquema narrativo con el que buscan relacionar en un mismo escenario todos los protagonistas de cada una de las cadenas de eventos narrativos. De otro lado, Regneri, Koller y Pinkal (2010) aplican aprendizaje no supervisado para detectar las frases que describen un mismo evento ( sentarse a la mesa, tomar asiento, etc.) 5 y el orden en que suelen presentarse en una narración (script). Su procedimiento se basa en el uso de una matriz de frases semánticamente relacionadas donde aplica el alineamiento múltiple de secuencias (Multiple Sequence Alignment, MSA). Sobre esta representación matricial se construye un grafo temporal que, mediante un algoritmo de agrupamiento, determina el orden en que se han de presentar los eventos. Hajishirzi et al. (2011) y Hajishirzi y Mueller (2012) analizan la forma de interpretar oraciones narrativas mediante la representación simbólica de los eventos, estados y entidades que ellas contienen. Su aproximación se vale de dos tipos de conocimiento: la descripción de los eventos y las entidades más importantes del dominio. Para la descripción temporal de los eventos (y estados) utilizan un lenguaje simbólico caracterizado por la presencia de condiciones y consecuencias: evento( x), condición( x), consecuencia( x). Li, Lee-Urban, y Riedl (2012) y Li et al. (2012) exponen una técnica para identificar los eventos característicos de una determinada situación y su disposición temporal más habitual. Primero, los autores obtienen ejemplos reales de secuencias narrativas en diferentes ámbitos. Posteriormente, agrupan las oraciones que están semánticamente relacionadas (ej. la policía detiene al criminal, los agentes arrestan al ladrón ) aplicando la similitud de cosenos. Por último, establecen restricciones entre eventos (precedencia, opcionalidad, exclusión) mediante un análisis 5 Aunque parte de su trabajo consiste en detectar las diferentes realizaciones lingüísticas de un mismo evento, Regneri, Koller, y Pinkal (2010) aclaran que no se trata de un problema de paráfrasis.

75 Función de las secuencias narrativas en la clasificación de la polaridad de reviews de su frecuencia y probabilidad de aparición. 5 Conclusiones y trabajo futuro En este artículo se analiza la función que desempeñan las secuencias narrativas en el cálculo de la polaridad de productos. Con este propósito adaptamos el modelo de esquemas narrativos de Chambers y Jurafsky (2008b) para extraer de los reviews las oraciones que relatan eventos y efectuamos un análisis de la polaridad de los textos bajo diferentes configuraciones. Los resultados obtenidos indican que la modalidad textual narrativa se suele emplear para valorar negativamente los productos mientras que las valoraciones positivas son independientes de la narración. Concluimos, por tanto, que la omisión de las secuencias narrativas sólo afecta a los reviews con polaridad negativa. Este conocimiento es útil para comprender la forma en que los usuarios evalúan productos en lenguaje natural. El trabajo futuro está enfocado a mejorar el rendimiento del algoritmo para detectar las secuencias narrativas mediante la incorporación de una fase de preprocesamiento de los reviews que corrija errores tipográficos, ortográficos y, especialmente, de segmentación de las oraciones. Adicionalmente, pensamos evaluar diferentes recursos para resolver la correferencia puesto que es un elemento importante para obtener buenos resultados en la detección de secuencias narrativas. Por último, creemos que sería productivo restringir el tipo de nominalizaciones que se hacen servir para la identificación de eventos ya que la función eventiva de algunas de ellas en el texto es cuestionable. Agradecimentos Esta investigación ha sido posible gracias a la financiación de los proyectos TIN C02 y TIN CO2 del Ministerio de Ciencia e Innovación así como a la Generalitat de Catalunya mediante una beca predoctoral FI (2010FI B 00521). Bibliografía Chambers, N. y D. Jurafsky. 2008a. Jointly combining implicit constraints improves temporal ordering. En Proc. of the Conference on Empirical Methods in NLP, páginas , Stroudsburg, USA. Chambers, N. y D. Jurafsky. 2008b. Unsupervised learning of narrative event 75 chains. En Proc. of ACL-08: HLT, páginas , Columbus, Ohio. Chambers, N. y D. Jurafsky Unsupervised learning of narrative schemas and their participants. En Proc. of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on NNLP of the AFNLP, volumen 2, páginas , Stroudsburg, USA. Chambers, N. y D. Jurafsky A database of narrative schemas. Cruz, F Extracción de opiniones sobre características: un enfoque práctico adaptable al dominio. Colección de monografías de la Sociedad Española para el Procesamiento del Lenguaje Natural. SEPLN. Hajishirzi, H., J. Hockenmaier, T. Mueller, y E. Amir Reasoning about robocup soccer narratives. En Fabio Gagliardi Cozman y Avi Pfeffer, editores, Proc. of the Conference on Uncertainty in AI, páginas Hajishirzi, H. y T. Mueller Question answering in natural language narratives using symbolic probabilistic reasoning. En G. Michael Youngblood y Philip M. McCarthy, editores, Proc. of the 25th International Florida Artificial Intelligence Research Society Conference, páginas Li, B., D. Appling, S. Lee-Urban, y M. Riedl Learning sociocultural knowledge via crowdsourced examples. Li, B., S. Lee-Urban, y M. Riedl Toward autonomous crowd-powered creation of interactive narratives. En Intelligent Narrative Technologies 5, Papers from the 2012 AIIDE Workshop, páginas Pustejovsky, J., P. Hanks, R. Saurí, A. See, R. Gaizauskas, A. Setzer, D. Radev, B. Sundheim, D. Day, L. Ferro, y M. Lazo The timebank corpus. En Proc. of Corpus Linguistics, páginas Regneri, M., A. Koller, y M. Pinkal Learning script knowledge with web experiments. En Proc. of the 48th Annual Meeting of the Association for Comput. Linguist., páginas , Stroudsburg, USA.

76 John Roberto, Maria Salamó Llorente, Maria Antònia Martí Antonín Ricci, F. y R. Wietsma Product reviews in travel decision making. Proceeding of Information and Communication Technologies in Tourism, páginas Roberto, J., M. Salamó, y M. Martí Genre-based stages classification for polarity analysis. Dialogue and Discourse, (en proceso de revisión). Taboada, M., J. Brooke, M. Tofiloski, K. Voll, y M. Stede Lexicon-based methods for sentiment analysis. Comput. Linguist., 37(2): , Junio. Vildjiounaite, E., O. Kocsis, V. Kyllönen, y B. Kladis Context-dependent user modelling for smart homes. En C. Conati K. McCoy, y G. Paliouras, editores, User Modeling, volumen 4511 de Lecture Notes in Computer Science, páginas Springer. A Anexo 1: Identificación de las secuencias narrativas del Ejemplo (1) o 1: My dad bought me a Saturn for my graduation in 1992 before all the marketing hype (how embarrassing to be constantly asked if I went to Tennessee!). o 2: Shortly thereafter the problems started. o 3: With a little research online you can find plenty of evidence that Saturns have a history of excessive oil consumption. o 4: My 92 SL2 is a total lemon - blown head gasket, six alternators and a plethora of other problems with less than 60,000. o 5: The dealer has not been helpful - saying that a blown head gasket at 57,000 miles is not uncommon (well for Saturns maybe!). Figura 3: La entrada al Algoritmo 1 lo constituye las oraciones del review (R), es decir, R = {o 1, o 2, o 3, o 4, o 5 }. root(root-0, started-5) advmod(thereafter-2, Shortly-1) advmod(started-5, thereafter-2) det(problems-4, the-3) nsubj(started-5, problems-4) Figura 4: Mediante el parser de Stanford se obtienen las dependencias con los pares núcleo-modificador (el ejemplo corresponde a o 2 ). (Ver Algoritmo 1 línea 8). o p crefs id crefs ent my I my I my saturn saturn sl lemon gasket plethora Tabla 2: Listado de las entidades (crefs ent ) que comparten un nexo correferencial (crefs id ) en el review. Se especifica la oración (o) a la que pertenecen y la posición que ocupa según el número de palabras en o. (Ver Algoritmo 1 línea 9). o p nominalización verbo 1 9 graduation graduate 1 15 marketing market 1 16 hype exaggerate 3 4 research research 3 11 evidence evidence 3 16 history record 3 20 consumption consume 4 10 head head 5 2 dealer deal 5 12 head head Tabla 3: Listado de las nominalizaciones presentes en el review. En la tercera columna tenemos las nominalizaciones y en la cuarta los verbos de los que proceden según el N ombank. (Ver Algoritmo 1 línea 10). o p crefs id time eventos buy graduation 1 16 before hype go 2 5 thereafter start Tabla 4: Secuencia narrativa (O en ) obtenida mediante la aplicación del Algoritmo 1. En la primera y quinta columnas tenemos las oraciones y los eventos seleccionados por el algoritmo. En la tercera y cuarta columna tenemos la correferencia (crefs id ) y/o la expresión temporal (time) que selecciona cada evento. 76

77 Procesamiento del Lenguaje Natural, Revista nº 52 marzo de 2014, pp recibido revisado aceptado New experiments on speaker diarization for unsupervised speaking style voice building for speech synthesis Nuevos experimentos en diarización de locutores para creación de voces para síntesis Beatriz Martínez-González, José Manuel Pardo, J.D. Echeverry-Correa, J. M. Montero Grupo de Tecnología del Habla, Universidad Politécnica de Madrid Avenida Complutense s/n Madrid {beatrizmartinez, Resumen: El uso universal de síntesis de voz en diferentes aplicaciones requeriría un desarrollo sencillo de las nuevas voces con poca intervención manual. Teniendo en cuenta la cantidad de datos multimedia disponibles en Internet y los medios de comunicación, un objetivo interesante es el desarrollo de herramientas y métodos para construir automáticamente las voces de estilo de varios de ellos. En un trabajo anterior se esbozó una metodología para la construcción de este tipo de herramientas, y se presentaron experimentos preliminares con una base de datos multiestilo. En este artículo investigamos más a fondo esta tarea y proponemos varias mejoras basadas en la selección del número apropiado de hablantes iniciales, el uso o no de filtros de reducción de ruido, el uso de la F0 y el uso de un algoritmo de detección de música. Hemos demostrado que el mejor sistema usando un algoritmo de detección de música disminuye el error de precisión 22,36% relativo para el conjunto de desarrollo y 39,64% relativo para el montaje de ensayo en comparación con el sistema base, sin degradar el factor de mérito. La precisión media para el conjunto de prueba es 90.62% desde 76.18% para los reportajes de 99,93% para los informes meteorológicos. Palabras clave: síntesis de voz expresiva, diarización de locutores, estilos de habla, síntesis de voz Abstract: Universal use of speech synthesis in different applications would require an easy development of new voices with little manual intervention. Considering the amount of multimedia data available on internet and media, one interesting goal is to develop tools and methods to automatically build multi-style voices from them. In a previous paper a methodology for constructing such tools was sketched, and preliminary experiments with a multi-style database were presented. In this paper we further investigate such approach and propose several improvements to it based on the selection of the appropriate number of initial speakers, the use or not of noise reduction filters, the use of the F0 feature and the use of a music detection algorithm. We have demonstrated that the best system using music detection algorithm decreases the precision error 22.36% relative for the development set and 39.64% relative for the test set compared to the baseline, without degrading the merit factor. The average precision for the test set is 90.62% ranging from 76.18% for reportages to 99.93% for meteorology reports. Keywords: expressive speech synthesis, speaker diarization, speaking styles, voice building 1 Introduction Universal use of speech synthesis in different applications would require an easy development of new voices with little manual intervention. One of the goals of the Simple4all Project (Clark and King, 2012) is to create the most portable speech synthesis system possible: one that could be automatically (or with limited manual supervision) applied to many domains and tasks. In order to use speech collected from the media or from media sharing sites, speech synthesis systems must be robust to the variation of ISSN Sociedad Española para el Procesamiento del Lenguaje Natural

78 Beatriz Martínez-González, José Manuel Pardo, J.D. Echeverry-Correa, J. M. Montero the acoustic and environmental conditions. The system must be able to robustly cope with noisy ASR processed corpora and with challenging data such as interviews, debates, home recordings, political speeches, etc. The use of diarization techniques for speaker turn segmentation will allow the system creating homogeneous voices from heterogeneous recordings, because the number of speakers would be automatically estimated in a fully unsupervised way, and language independent diarization techniques automatically could provide the temporal labels of the turns of a certain speaker (Anguera et al., 2012; Pardo et al., 2012). In a previous paper (Lorenzo-Trueba et al., 2012) a methodology for constructing such tools was sketched, and preliminary experiments with a multistyle database were presented. In this paper we further investigate such approach and propose several improvements to it based on the selection of the appropriate number of initial speakers, the use or not of noise reduction filters, the use of the F0 feature and the use of a music detection algorithm. A speaker diarization system is used but, in contrast to the traditional objective of optimizing speaker segmentation and identification, our goal is to create pure clusters (speakers) that can be used to synthesize style-voices. Expressive speech synthesis is a sub-field of speech synthesis that has been drawing a lot of attention lately, as until recently there was no effort paid to increasing the adequacy of the produced voices to the task they were intended to be used in. In (Lorenzo-Trueba et al., 2013) a work to synthesize expressive voices adapting average voices to the desire style is presented. They also mention the necessity of increase the available training data for each style. In this work we aim to develop a system able to extract from different style meetings pure clusters (speakers) suitable for the voice synthesis. Therefore, we accept losing some speech segments as long as the clusters generated are purer (speech from only one speaker). 2 Database The evaluation presented in this paper is carried out using the C ORAL ROM (Moreno-Sandoval et al., 2005) database. This corpus is a multi language and multi style database covering a wide spectrum of formal and informal speaking styles, in public and private situations. All the languages included are Romance (French, Italian, Portuguese and Spanish), with styles ranging from formal to informal, extracted either from the media or from private spontaneous natural speaking. In this paper, the Spanish formal media styles have been analysed: news broadcasts, sports, meteorological reports, reportages, talk shows, scientific press and interviews. These data have been extracted from media broadcasts of different stations, and they present a great deal of variability in the recording environments and a high number of speakers (more than 200). This results in some speakers uttering only a few short sentences, making them almost irrelevant from a statistical parametrical point of view. The number of speakers per session is variable (between 1 and 28 speakers). Table 1 summarizes average characteristics of the considered sessions for each speaking style. The manual transcriptions of these sessions are speaker turns where we can find the speaker specified, but the segment includes also noises, silences or music (everything from the end of the previous speaker to the beginning of the next). To refine these references to include speech only segments we have force aligned the speech with the text provided also in the transcriptions using acoustic models trained from the spanish partition of TC-STAR EPPS (European Parliament Plenary Sessions) and PARL (Spanish Parliament Plenary Sessions). Although the forced alignment helped highly to this task, it was not free from errors, and we had to correct manually some labels. Style # sessions #spk/session Time/session Interviews min Meteorology min News min Reportage min Scientific press min Sports min Talk shows min Table 1: Features of the speaking style sessions in the C ORAL ROM database. To evaluate the implemented methods this database has been splitted into two, the development set and the test set. Both sets are composed of sessions from all the styles evaluated. Around of a third part of the database has been reserved to test 78

79 New experiments on speaker diarization for unsupervised speaking style voice building for speech synthesis experiments. The development set is composed of 27 sessions that sum up minutes, and the test set is composed of 14 sessions that sum up minutes. 3 Diarization system In previous work (Lorenzo-Trueba et al., 2012) we used a simplified version of the speaker diarization system described in Pardo et al. (2012). Instead of using three input features (MFCC, Time delay of arrival TDOA- and F0) we only used MFCCs and we did not apply any noise filtering to the recordings. Although our usual diarization system relies also on TDOAs (Martínez-González et al., 2012), in this case, we cannot use the delay features as there is only one channel from each session. In Figure 1 we show the modules of the system. Except the Music detection module, all of them were included in the UPM diarization system of Pardo et al. (2012). In dotted lines, a music detection module is represented whose influence in the diarization system will be evaluated in this paper. The segments detected as music by this module are discarded from the speech segments detected by the VAD module, and, therefore, will not be assigned to any speaker. The Wiener filter intends to reduce the background noise in the recording. Although for the Multiple Distant Microphone (MDM) task the application of this filter has proved to be positive (Wooters and Huijbregts, 2007), experiments with our database render different results which will be presented in the following sections. The audio signal is then processed by the MFCC estimation module, where MFCC vectors of 19 components [mfcc] are calculated every 10 ms with a window of 30ms. The audio signal is also processed by the Voice Activity Detector (VAD) module which is a hybrid energy-based detector and model-based decoder. The F0 module extracts the F0 feature and adds it to the clustering module as a new stream (Pardo et al., 2012). The following module is the segmentation and agglomerative clustering process which consists of an initialization part and an iterative segmentation and merging process. The initialization process segments the speech into K blocks (equivalent to an initial hypothesis of K speakers or clusters) uniformly distributed. Every cluster is modelled using a gaussian mixture model (GMM) initially containing a number of components that has to be specified (we use 5 for [mfcc] and 1 for [F0] streams). After the initial segmentation a set of training and re-segmenting steps is carried out using EM training and Viterbi decoding. Then the merging step takes place. When a merging takes place the segmentation and clustering steps are repeated until a stopping criterion is reached. More information about the baseline system can be consulted in Pardo, Anguera and Wooters, (2007). Wiener filter Input 4 Experiments MFCC estimation F0 estimation VAD Music detection Figure 1: Block diagram of the system In this section we present new developments to the system presented in Lorenzo-Trueba et al. (2012). Different from what was presented previously is the fact that the speech/non speech transcriptions have been corrected by hand and that the database has been divided into development and test sets. The diarization score for the baseline system for the development set is included in the first row in Table 2. However, since our goal is to increase the precision of the clusters, we have calculated also the precision and recall and we have included in the last column a merit factor which weights the precision by two thirds and the recall by one third. All those values are presented in Table Initial number of speakers Segmentation and agglomerative clustering of speech regions The original UPM diarization system begins segmenting the recording in 16 clusters, and merging them reducing in each iteration its number. As each cluster corresponds to a hypothetical speaker, the system will never recognize more than these 16 initial speakers. 79

80 Beatriz Martínez-González, José Manuel Pardo, J.D. Echeverry-Correa, J. M. Montero 20,5 withnoisereduction 86 DER 20 19,5 19 withoutnoisereduction Precision(%) 85, , , ,05 0,1 0,15 0,2 0,25 0,3 F0_weight 83,5 83 withnoisereduction withoutnoisereduction 0 0,05 0,1 0,15 0,2 0,25 0,3 F0_weight Figure 2: DER with and without applying the noise filter, using MFCC and F0 features. MFCC_weight=1-F0_weight. Figure 3: Precision with and without noise reduction, using MFCC and F0 features. MFCC_weight=1-F0_weight. Some sessions have more than these 16 speakers, and thus, the system will never find all of them. In our previous experiments long sessions were splitted so no more than 9 speakers were present in a recording. In this work no sessions have been splitted so we decided to carry out some experiments beginning with 32 clusters. The best result (in precision) across different F0 weights using noise reduction (see next section) and beginning with 32 participants is shown in Table 2, second row. We noticed that even for some of the sessions with higher number of participants the results are worse than using 16 clusters (third row of Table 2). It occurs that most of the participants in the recording talked for few seconds, and these participants are hardly recognized by the system. 4.2 Noise reduction and F0 In our previous paper, we used only MFCC features to perform the diarization without noise reduction. In this work we wanted to explore the effects of applying also noise filtering and the F0 features included in Pardo et al. (2012). To combine the MFCCs with the F0 features the system needs a weight to be applied to each of these vectors. These weights are complementary, summing up 1. In Figure 2, the DER obtained for the development set when initially applying or not a noise filter (Wiener) is presented across the weight factor used for the F0 stream. The diarization error is lowest when the noise filter is not used and the weight of the F0 vector is 0.15 keeping nearly the same merit factor (see Table 2, row fourth). Although this would be the working point in terms of DER, we had mentioned previously that our target in this diarization task is not to minimize the diarization error rate (DER) but maximize the purity of the clusters created, i.e. the precision. The results in precision across F0 weights are shown in Figure 3. In this case the best working point is not so clear. Numerically the best precision value is obtained when the system applies noise reduction and an F0_weight of However, this value is not so far from the best working point in the case of not applying noise reduction. In Table 2 we can see the results for both working points (rows 3rd and 4th). If we analyse the merit factor, it is very similar for both systems, so we will consider both systems in the next experiments. 4.3 Music Detection Many of the recordings from the media have music as well as speech. The VAD module usually labels these segments as speech, and then the diarization system assigns them to one speaker, corrupting it. If we want to use the generated clusters to synthesize voices, we want to delete any segment that would corrupt our voices. Music and noises are among the events to avoid, as well as speech overlapped with either music or noises. 80

81 New experiments on speaker diarization for unsupervised speaking style voice building for speech synthesis System Insertion penalty of MR module F0 weight DER Precision Recall Merit factor Precision error improvement (%) Baseline Baseline+NR+F0+K Baseline+NR+F Baseline+F Baseline+MR Baseline+NR+F0+MR Baseline+F0+MR Baseline+NR+F0+MR Table 2: Results for the development set. Relative precision error improvement is calculated over the baseline. K stands for the initial number of hypothetical speakers, K=16 if nothing indicated. NR stands for noise reduction algorithm and MR stands for music recognition algorithm. System Insertion penalty of MR module F0 weight DER Precision Recall Merit factor Precision error improvement (%) Baseline Baseline+NR+F Baseline+F Baseline+MR Baseline+NR+F0+MR Baseline+F0+MR Baseline+NR+F0+MR Table 3: Results for the test set. Relative precision error improvement is calculated over the baseline. NR stands for noise reduction algorithm and MR stands for music recognition algorithm. There are several previous works on speech and music segmentation. Many of them focus on the use of different features that would help in the discrimination between music and speech. This is the case of Izumitani, Mukai and Kashino, (2008), Gallardo-Antolin and Montero, (2010) or Panagiotakis and Tziritas, (2005). Other works like Lavner and Ruinskiy, (2009) focused in system architecture to segment speech and music. In Gallardo and San-Segundo, (2010) the UPM-UC3M system for the Albayzin evaluation 2010 on audio segmentation is presented. The best combination of features for the segmentation of music are MFCC, CHROMA coefficients (see Bartsch and Wakefield, (2001)), and Entropy features (Misra et al., 2004). In this work we have applied this algorithm for the music segmentation. There are five classes recognized: speech, speech+noise, speech+music, music and others. As our database is not labeled with these classes, we cannot train our own models for each of them, so, for the recognition, we used the same models that were trained in Gallardo and San- Segundo, (2010). Once the segmentation is carried out, we only remove music and others segments from the speech segments detected by the VAD module (see diagram in Figure 1). We carried out some experiments varying the insertion penalty in the music recognition system. The higher the term the higher the number of segments labeled as music or others. Three kind of experiments have been carried out applying the music detection module: apply only the music detection to the baseline system, apply it in combination with F0 and in combination with F0 and the noise reduction module. For these experiments the F0 weight has been set to 0.05 when we apply noise reduction 81

82 Beatriz Martínez-González, José Manuel Pardo, J.D. Echeverry-Correa, J. M. Montero and 0.15 when we do not (these were the two best systems in previous section, respectively rows third and fourth in Table 2). In Figure 4, the precision and recall for the three studied systems across different insertion penalty values is presented. These three systems reach the best precision values with insertion penalty of 15 (using F0 and applying noise reduction and music detection) and 5 (for the two systems that do not use noise reduction). Higher values of this term allow more changes between classes, which means, at the end, more segments categorized as music. In fact, even if we lose more speech segments wrongly labelled as music, as long as we discard enough real music segments, the clusters generated with the remaining segments will be purer. Removing more segments, especially if they are likely to be music, could reduce the amount of speech recovered but, as long as the precision of the clusters increase and we still have enough data, the voices generated with these clusters should be more accurate. In fact, if we remove too much speech we are not only reducing the data available for voice bulding, but the models trained by the diarization system will be less accurate and, therefore, the final segmentation will have more errors. The best numerical result (in precision) for this method is included in Table 2, sixth row (with noise reduction and insertion penalty of 15). However, in the fifth and seventh row, the best result for the two other systems with music detection are presented (no noise reduction, insertion penalty term of 5 and use or not of F0 features). We can see that even though the precision values are a bit lower, the merit factor of these two systems surpass that of the system with the best precision value (in which we applied noise reduction). The noise reduction module apparently affects highly to the recall of the system. This can be due to the high insertion penalty defined for the music detection module when using also noise reduction. For comparison purposes we have included results with the baseline, noise reduction, F0 and Music detection module when the insertion penalty is 5 (the same of the two systems without noise reduction). Precision result decreases while recall increases, but not enough to reach the performance in merit factor of any of the other two systems where no noise reduction is applied. Our task implies maximizing precision but we want to maintain a certain level of recall and considering the variation in the merit factor we cannot yet decide between these options. Experiments with the test set will show if one of them turns clearly better precision_baseline+nr+f0+mr Título del gráfico recall_baseline+nr+f0+mr precision_baseline+mr recall_baseline+mr precision_baseline+f0+mr recall_baseline+f0+mr p-10 p-15 p-5 p0 p5 p10 p15 Insertion penalty Figure 4: Precision and recall versus insertion penalty of the music recognizer for the development database. F0_weight=0.05 for system with noise reduction and 0.15 for system without it. 5 Results with the test set and discussion In this section we will contrast the results of the development set with a new set, not used until now, the test set. The first modification tried over the development set was to increase the initial number of hypothetical speakers. This modification did not improve diarization just for the development set, thus, it is not necessary a test evaluation with a different set of sessions. The second group of experiments was focused on optimizing the systems using or not F0 and a noise reduction Wiener filter. At this point it was not clear if we should use or not the noise filtering. Both systems delivered similar performance in precision and merit factor. Thus we decide to keep both systems in future experiments. Finally, in the last experiments with the development set, we tried to take advantage of a music detection module. This module is applied alone and in combination with the two previous ones, adjusting for each one the insertion penalty term. The three of them achieved high relative precision error improvement (24.51%, 22.88% 82

83 New experiments on speaker diarization for unsupervised speaking style voice building for speech synthesis and 22.36%). For comparison purposes, we included also the performance of the system with the best precision result but with insertion penalty of 5 (eighth row of Table 2, 22.35% of precision error improvement). However, these systems still had very similar precision and the best one degrades heavily its recall, and, consequently, its merit factor, so we decided to check all of them with the test set. The experiments we have carried out with the test set to check our findings are included in Table 3. When there is no music reduction, the use of F0 decreases the precision error in 17.12% for the system with noise reduction, which is much more than the 3.99% achieved when no noise reduction is applied (second and third row in Table 3). However, the use of noise reduction, as we have seen before, reduces heavily the recall of the system, and the merit factor of these two systems turns very similar (86.14 vs 86.04). When we include the music detection module, the system with noise reduction (fifth and seventh row in Table 3) has the same problem we have been noticing. The recall is heavily reduced by the combination of noise reduction and music reduction, this time affecting the precision as well, which is increased much less than the two other systems with music reduction. The two systems without noise reduction outperform clearly the rest because not only precision increases, but also the merit factor. In this case, the use of music reduction alone is slightly better than its combination with the F0 features. The precision, in this case, turns 90.62%, and recall decreases to 84.38% (vs precision 90.22% and recall of 84.01% for the system without noise reduction and F0; and precision of 87.76% and 88.95% and recall of 80.81% and 78.92% for the system with noise reduction, F0 and insertion penalty of 15 and 5 respectively), and therefore, the merit factor increases significantly. We obtain with this system a relative decrease of the precision error of 39.64% over the test set. We can see also, that for the test set, the use of the music reduction system decreases the DER value of the baseline in more than one point, which means that we are not discarding much clear speech, and the diarization system can model better the speakers. Finally, in Table 4, the results obtained with different styles of the test set are presented. The precision in speaker diarization ranges from % for reportages to 99.93% for meteorology recordings. The set of reportages is more difficult (it is the only one with precision below 90%) due to noise and the high number of different speakers that can participate (see Table 1). In future work new strategies should be drawn in order to tackle this problem. Style Precision Recall Merit factor Interviews Meteorology News Reportages Scientific press Sports Talk shows ALL Table 4: Precision, recall and merit factor for the different styles in the test set. 6 Conclusions In this paper we have analysed the task of unsupervised diarization focused on obtaining pure speaker recordings in order to synthesize voices. With this purpose we have modified slightly the traditional task of diarization. Now we have focused on recovering pure speaker clusters, even if we have to discard many segments, or speakers, overlapped with other speakers or noises. For such objective we have defined a merit factor that weights the precision and the recall. We have studied the application of some modules from the UPM diarization system and the UPM music detection module. We have proved that by using the music recognition module we can decrease the precision error 22.36% for the development set and % for the test set, improving also the merit factor. The noise reduction module in combination with the music reduction module makes the system to lose too many segments of speech, reducing the recall, and thus the merit factor, and making this combination undesirable. Results using F0 in combination with music detection were slightly better for the development set and slightly worse for the test set, therefore, we cannot prove its usefulness for this task. 83

84 Beatriz Martínez-González, José Manuel Pardo, J.D. Echeverry-Correa, J. M. Montero 7 Acknowledgements The work leading to these results has received funding from the European Union under grant agreement n It has also been supported by TIMPANO (TIN C05-03), INAPRA (MICINN, DPI C02-02) and MA2VICMR (Comunidad Autónoma de Madrid, S2009/TIC-1542) projects. References Anguera, X., S. Bozonnet, N.W. D. Evans, C. Fredouille, O. Friedland, and O. Vinyals Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech, and Language Processing. Vol. 20, no. 2, February. ISSN: Bartsch, M. A. and G. H. Wakefield To catch a chorus: using chroma-based representations for audio thumbnailing. IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics, pp Clark, R. and S. King. 2012, March. [Online]. Available: Gallardo-Antolin, A. and J. M. Montero Histogram Equalization-Based Features for Speech, Music, and Song Discrimination. IEEE Signal Processing Letters, vol.17, no.7, pp , July. Gallardo, A. and R. San-Segundo Upmuc3m system for music and speech segmentation. Jornadas de Tecnología del Habla FALA November. Izumitani, T., R- Mukai, and K. Kashino A background music detection method based on robust feature extraction. IEEE International Conference on, Acoustics, Speech, and Signal Processing, ICASSP 2008, pp Lavner, Y. and D. Ruinskiy A Decision- Tree-Based Algorithm for Speech/Music Classification and Segmentation. EURASIP Journal on Audio, Speech, and Music Processing. Lorenzo-Trueba, J., B. Martínez, R. Barra- Chicote, V. López-Ludeña, J. Ferreiros, J. Yamagishi and J.M. Montero Towards an Unsupervised Speaking Style Voice Building Framework: Multi-Style Speaker Diarization. InterSpeech 2012, Portland, (Oregon). Lorenzo-Trueba, J., R. Barra-Chicote, J. Yamagishi, O. Watts and J.M. Montero Towards Speaking Style Transplantation in Speech Synthesis. In Proceedings SSW th ISCA Speech Synthesis Workshop, August 31st - September 2 nd. Martínez-González, B., J. M. Pardo, J. D. Echeverry-Correa, J. A. Vallejo-Pinto and R. Barra-Chicote Selection of TDOA parameters for MDM speaker diarization. Interspeech 2012, Portland (Oregon). Misra, H., S. Ikbal, H. Bourlard and H. Hermansky Spectral entropy based feature for robust ASR. IEEE International Conference on Acoustics, Speech, and Signal Processing, Proceedings. (ICASSP '04), vol.1, pp.i,193-6 vol.1, May. Moreno-Sandoval, A., G. De la Madrid, M. Alcántara, A. Gonzalez, JM Guirao and R. De la Torre, The spanish corpus, C- ORAL-ROM: Integrated Reference Corpora for Spoken Romance Languages. Amsterdam: John Benjamins Publishing Company, pp Panagiotakis, C. and G. Tziritas A Speech/Music Discriminator Based on RMS and Zero-Crossings. IEEE Transactions on Multimedia, vol. 7, no. 1, February. Pardo, J. M., X. Anguera, and C. Wooters Speaker diarization for multiple-distantmicrophone meetings using several sources of information. IEEE Transactions on Computers, vol. 56, no. 9, pp , Sept. Pardo, J. M., R. Barra-Chicote, R. San-Segundo, R. de Cordoba, and B. Martinez-Gonzalez, Speaker diarization features: The upm contribution to the rt09 evaluation. IEEE Transactions on Audio, Speech and Language Processing, vol. 20, no. 2, pp Wooters, C. and M. Huijbregts The ICSI RT07s Speaker Diarization System. In Proceedings of the Second International Workshop on Classification of Events, Activities, and Relationships (CLEAR 2007) and the Fifth Rich Transcription 2007 Meeting Recognition (RT 2007), Baltimore, Maryland, pp

85 Tesis

86

87 Procesamiento del Lenguaje Natural, Revista nº 52 marzo de 2014, pp recibido revisado aceptado Diseño y generación semi-automática de patrones adaptables para el Reconocimiento de Entidades Design and semi-automatic generation of adaptable patterns for Entity Recognition Mónica Marrero Universidad Carlos III de Madrid Avenida de la Universidad 30, Leganés Resumen: Tesis doctoral en Ciencia y Tecnología Informática realizada por Mónica Marrero Llinares en la Universidad Carlos III de Madrid bajo la dirección de la Dra. Sonia Sánchez Cuadrado y el Dr. Jorge Morato Lara. El acto de defensa tuvo lugar el martes 21 de mayo de 2013 ante el tribunal formado por los doctores Juan Lloréns Morillo (Universidad Carlos III de Madrid), Rafael Valencia García (Universidad de Murcia) y Roberto Carniel (Universidad de Udine). Obtuvo mención internacional y la calificación de Sobresaliente Cum Laude por unanimidad. Palabras clave: Extracción de Información, Reconocimiento de Entidades Nombradas, generación automática de patrones Abstract: PhD Thesis in Computer Science written by Mónica Marrero Llinares at the University Carlos III of Madrid under the supervision of Dra. Sonia Sánchez Cuadrado and Dr. Jorge Morato Lara. The author was examined on 21st May 2013 by a committee formed by the doctors Juan Lloréns Morillo (University Carlos III of Madrid), Rafael Valencia García (University of Murcia) and Roberto Carniel (University of Udine). It obtained the grade of Excellent Cum Laude unanimously and received the international mention. Keywords: Information Extraction, Named Entity Recognition, Automatic Pattern Generation 1 Introducción ISSN La información digital es uno de los principales activos de cualquier organización, y su buena gestión puede determinar el éxito en muchos sectores. Esta gestión supone no sólo el tratamiento de la información generada en la organización, sino también la posibilidad de hacer uso de la información que nos rodea, y que puede suponer ventajas competitivas: estudios de mercado, análisis de opiniones, seguridad, vigilancia tecnológica, etc. En este entorno cobra especial importancia la Extracción de Información, que tiene por objetivo recuperar directamente aquellos elementos de información de nuestro interés en lugar de recuperar documentos completos como ocurre en Recuperación de Información, reduciendo así la cantidad de información que necesitamos leer. Una de las áreas de la Extracción de Información es la llamada Reconocimiento de Entidades Nombradas (Named Entity Recognition NER ), que tiene por objetivo la identificación de semánticas relevantes en un texto (e.g. personas, localizaciones y fechas en textos periodísticos). Se trata de un área de investigación especialmente activa en ámbitos como el de la Biomedicina (Marrero et al., 2010a) y con aplicación en otras áreas relacionadas con la gestión de información y el procesamiento del lenguaje natural, como la anotación semántica, los sistemas de búsqueda de respuesta, la población de ontologías y el análisis de opinión. El núcleo de los sistemas NER son los patrones capaces de reconocer las entidades de interés. La generación manual de estos patrones es compleja, por lo que es habitual el uso de métodos de generación automática que usan corpus anotados para el aprendizaje. Sin embargo, la efectividad de estos métodos está frecuentemente reñida con el esfuerzo que supone para el usuario anotar los corpus. Tras una introducción del área y sus antecedentes en los capítulos 1 y 2, la tesis ana Sociedad Española para el Procesamiento del Lenguaje Natural

88 Mónica Marrero liza las herramientas, objetivos y evaluación del área en los capítulos 3 y 4, y muestra que NER está lejos de ser un área resuelta como se ha llegado a afirmar en la literatura. Presenta entonces un método capaz de generar patrones adaptables a los nuevos tipos de entidades y dominios exigidos actualmente, y más consecuente con el problema de no disponer de corpus anotados. Para ello en el capítulo 5 se diseña un modelo de representación que soporta patrones: Flexibles: permiten incorporar diferentes tipos de atributos del texto capaces de describir las entidades o su contexto. Potentes: capaces de reconocer diferentes estructuras del lenguaje. Editables: legibles y basados en estándares para facilitar su edición ante pequeños cambios, evitando así generarlos de nuevo con los costes que ello conlleva. En el capítulo 6 se describe un método de generación semi-automática de estos patrones que no requiere corpus anotados de partida, sino que guía al usuario en el proceso de anotación. La relación efectividad/coste de anotación lograda es evaluada en el capítulo 7, donde se muestra que se obtienen mejores tasas en relación al estado del arte. Finalmente los capítulos 8 y 9 exponen las conclusiones y trabajos futuros de la tesis. 2 Principales Contribuciones El Reconocimiento de Entidades Nombradas ha sido evaluado en grandes foros a nivel internacional. Las elevadas tasas de efectividad observadas en dos de estos foros, Message Understanding Conference (MUC) (Grishman y Sundheim, 1996) y Conference on Natural Learning (CoNLL) (Sang y Meulder, 2003), entre finales de los 90 y principios de los 2000 podrían justificar que la tarea de reconocimiento de entidades en texto esté resuelta, y así ha llegado a afirmarse en la literatura (Cunningham, 2005). La tesis profundiza en la evaluación del área y muestra que se ha estancado en el reconocimiento de entidades típicas, para las que habitualmente existen recursos anotados. Las contribuciones de esta tesis comienzan por mostrar que la validez de estos foros es limitada ante las necesidades actuales. Es necesario avanzar en NER hacia foros de evaluación 88 más dinámicos y sistemas más prácticos, capaces de adaptarse a las nuevas semánticas exigidas, y que además faciliten la anotación de los recursos necesarios para la evaluación (Marrero et al., 2009; Marrero et al., 2013). La contribución principal de este trabajo consiste en presentar una alternativa ante los métodos de generación del estado del arte que resulta ser más adaptable ante el reconocimiento de nuevas entidades, y más práctica al no requerir la anotación de corpus previos. Esta alternativa consta de un modelo de representación de patrones y un método de generación de estos patrones. Los resultados obtenidos muestran que reduce los costes necesarios para alcanzar los mismos niveles de efectividad frente a métodos similares. 2.1 Modelo de Representación Para poder incorporar cualquier atributo del lenguaje que pueda ser útil en la identificación de entidades (e.g. categoría gramatical, ortografía, etc.) es necesario contar con un modelo de representación de patrones que lo soporte. Un análisis de los modelos existentes muestra que ninguno reúne conjuntamente requisitos de potencia, flexibilidad y legibilidad. Se diseña entonces un modelo de representación basado en Gramáticas Independientes del Contexto (GIC), con su misma potencia y legibilidad pero que además permite el uso de diferentes atributos a la vez. Para ello este nuevo modelo introduce funciones asociadas a los símbolos no terminales. Estas funciones permiten representar condiciones (pares atributo-valor) sobre las cadenas de entrada de la gramática (Figura 1). A este nuevo modelo se le da el nombre de Information Extraction Grammar (IEG). El modelo IEG permite representar lenguajes regulares e independientes del contex- Figura 1: Las entidades a la izquierda pueden representarse con el patrón IEG de la derecha. El no terminal C tiene asociada una función que indica que su valor ha de ser menor que 25 y el no terminal D tiene asociada una función que indica que el atributo Semántica de las cadenas reconocidas ha de ser id-time.

iclef-2002 at Universities of Alicante and Jaen University of Alicante (Spain)

iclef-2002 at Universities of Alicante and Jaen University of Alicante (Spain) iclef-2002 at Universities of Alicante and Jaen University of Alicante (Spain) ! Introduction! Passage Retrieval Systems! IR-n system! IR-n system at iclef-2002! Conclusions and Future works ! Introduction!

Más detalles

ANÁLISIS Y DESARROLLO DE UNA PLATAFORMA BIG DATA

ANÁLISIS Y DESARROLLO DE UNA PLATAFORMA BIG DATA ANÁLISIS Y DESARROLLO DE UNA PLATAFORMA BIG DATA Autor: de la Cierva Perreau de Pinninck, Leticia Director: Sonia García, Mario Tenés Entidad Colaboradora: VASS RESUMEN DEL PROYECTO Tras la realización

Más detalles

ADAPTACIÓN DE REAL TIME WORKSHOP AL SISTEMA OPERATIVO LINUX

ADAPTACIÓN DE REAL TIME WORKSHOP AL SISTEMA OPERATIVO LINUX ADAPTACIÓN DE REAL TIME WORKSHOP AL SISTEMA OPERATIVO LINUX Autor: Tomás Murillo, Fernando. Director: Muñoz Frías, José Daniel. Coordinador: Contreras Bárcena, David Entidad Colaboradora: ICAI Universidad

Más detalles

From e-pedagogies to activity planners. How can it help a teacher?

From e-pedagogies to activity planners. How can it help a teacher? From e-pedagogies to activity planners. How can it help a teacher? Elena de Miguel, Covadonga López, Ana Fernández-Pampillón & Maria Matesanz Universidad Complutense de Madrid ABSTRACT Within the framework

Más detalles

Agustiniano Ciudad Salitre School Computer Science Support Guide - 2015 Second grade First term

Agustiniano Ciudad Salitre School Computer Science Support Guide - 2015 Second grade First term Agustiniano Ciudad Salitre School Computer Science Support Guide - 2015 Second grade First term UNIDAD TEMATICA: INTERFAZ DE WINDOWS LOGRO: Reconoce la interfaz de Windows para ubicar y acceder a los programas,

Más detalles

ESTUDIO, ANÁLISIS Y EVALUACIÓN DEL ENTORNO DE TRABAJO HADOOP. Entidad Colaboradora: ICAI Universidad Pontificia Comillas

ESTUDIO, ANÁLISIS Y EVALUACIÓN DEL ENTORNO DE TRABAJO HADOOP. Entidad Colaboradora: ICAI Universidad Pontificia Comillas ESTUDIO, ANÁLISIS Y EVALUACIÓN DEL ENTORNO DE TRABAJO HADOOP. Autor: Director: Rubio Echevarria, Raquel Contreras Bárcena, David Entidad Colaboradora: ICAI Universidad Pontificia Comillas RESUMEN DEL PROYECTO

Más detalles

Universidad de Guadalajara

Universidad de Guadalajara Universidad de Guadalajara Centro Universitario de Ciencias Económico-Administrativas Maestría en Tecnologías de Información Ante-proyecto de Tésis Selection of a lightweight virtualization framework to

Más detalles

Final Project (academic investigation)

Final Project (academic investigation) Final Project (academic investigation) MÁSTER UNIVERSITARIO EN BANCA Y FINANZAS (Finance & Banking) Universidad de Alcalá Curso Académico 2015/16 GUÍA DOCENTE Nombre de la asignatura: Final Project (academic

Más detalles

Sistema de Control Domótico

Sistema de Control Domótico UNIVERSIDAD PONTIFICIA COMILLAS ESCUELA TÉCNICA SUPERIOR DE INGENIERÍA (ICAI) INGENIERO EN ELECTRÓNICA Y AUTOMATICA PROYECTO FIN DE CARRERA Sistema de Control Domótico a través del bus USB Directores:

Más detalles

Este proyecto tiene como finalidad la creación de una aplicación para la gestión y explotación de los teléfonos de los empleados de una gran compañía.

Este proyecto tiene como finalidad la creación de una aplicación para la gestión y explotación de los teléfonos de los empleados de una gran compañía. SISTEMA DE GESTIÓN DE MÓVILES Autor: Holgado Oca, Luis Miguel. Director: Mañueco, MªLuisa. Entidad Colaboradora: Eli & Lilly Company. RESUMEN DEL PROYECTO Este proyecto tiene como finalidad la creación

Más detalles

ESTUDIO DE SOLUCIONES DE BUSINESS INTELLIGENCE

ESTUDIO DE SOLUCIONES DE BUSINESS INTELLIGENCE ESTUDIO DE SOLUCIONES DE BUSINESS INTELLIGENCE Autor: Jover Sanz-Pastor, Teresa. Director: Cesteros García, Francisco José. Entidad colaboradora: AVANSIS Integración. RESUMEN Hoy en día en todas las empresas

Más detalles

UTILIZACIÓN DE UN BOLÍGRAFO DÍGITAL PARA LA MEJORA DE PROCEDIMIENTOS DE CAMPO EN UNA CENTRAL NUCLEAR.

UTILIZACIÓN DE UN BOLÍGRAFO DÍGITAL PARA LA MEJORA DE PROCEDIMIENTOS DE CAMPO EN UNA CENTRAL NUCLEAR. UTILIZACIÓN DE UN BOLÍGRAFO DÍGITAL PARA LA MEJORA DE PROCEDIMIENTOS DE CAMPO EN UNA CENTRAL NUCLEAR. Autor: Ruiz Muñoz, Rafael. Director: Muñoz García, Manuel. Entidad Colaboradora: Empresarios Agrupados.

Más detalles

VI. Appendix VI English Phrases Used in Experiment 5, with their Spanish Translations Found in the Spanish ETD Corpus

VI. Appendix VI English Phrases Used in Experiment 5, with their Spanish Translations Found in the Spanish ETD Corpus VI. Appendix VI English Phrases Used in Experiment 5, with their Spanish Translations Found in the Spanish ETD Corpus App. VI. Table 1: The 155 English phrases and their Spanish translations discovered

Más detalles

SISTEMA CONTROL DE ACCESOS A EDIFICIOS MEDIANTE TARJETAS CRIPTOGRÁFICAS Y TARJETAS DE RADIOFRECUENCIA (RFID)

SISTEMA CONTROL DE ACCESOS A EDIFICIOS MEDIANTE TARJETAS CRIPTOGRÁFICAS Y TARJETAS DE RADIOFRECUENCIA (RFID) SISTEMA CONTROL DE ACCESOS A EDIFICIOS MEDIANTE TARJETAS CRIPTOGRÁFICAS Y TARJETAS DE RADIOFRECUENCIA (RFID) Alumno: Velayos Sardiña, Marta Director: Palacios Hielscher, Rafael Entidad Colaboradora: ICAI

Más detalles

RDA in BNE. Mar Hernández Agustí Technical Process Department Manager Biblioteca Nacional de España

RDA in BNE. Mar Hernández Agustí Technical Process Department Manager Biblioteca Nacional de España RDA in BNE Mar Hernández Agustí Technical Process Department Manager Biblioteca Nacional de España RDA at the National Library of Spain: BNE preparation for new Cataloguing Rules Since 2007 BNE has been

Más detalles

MANUAL EASYCHAIR. A) Ingresar su nombre de usuario y password, si ya tiene una cuenta registrada Ó

MANUAL EASYCHAIR. A) Ingresar su nombre de usuario y password, si ya tiene una cuenta registrada Ó MANUAL EASYCHAIR La URL para enviar su propuesta a la convocatoria es: https://easychair.org/conferences/?conf=genconciencia2015 Donde aparece la siguiente pantalla: Se encuentran dos opciones: A) Ingresar

Más detalles

DISPOSITIVO DE CONTROL PARA REDES DE DISTRIBUCIÓN ELÉCTRICA RESUMEN DEL PROYECTO

DISPOSITIVO DE CONTROL PARA REDES DE DISTRIBUCIÓN ELÉCTRICA RESUMEN DEL PROYECTO I DISPOSITIVO DE CONTROL PARA REDES DE DISTRIBUCIÓN ELÉCTRICA Autor: Juárez Montojo, Javier. Director: Rodríguez Mondéjar, José Antonio. Entidad Colaboradora: ICAI-Universidad Pontificia Comillas RESUMEN

Más detalles

Aplicación web para el modelado de redes eléctricas

Aplicación web para el modelado de redes eléctricas Aplicación web para el modelado de redes eléctricas Autores: Sergio Burgos González Carlos Mateo (Director) Tomás Gómez San Román (Director) Resumen: El proyecto consiste en el desarrollo de una aplicación

Más detalles

Sistemas de impresión y tamaños mínimos Printing Systems and minimum sizes

Sistemas de impresión y tamaños mínimos Printing Systems and minimum sizes Sistemas de impresión y tamaños mínimos Printing Systems and minimum sizes Para la reproducción del Logotipo, deberán seguirse los lineamientos que se presentan a continuación y que servirán como guía

Más detalles

Questionnaires for the Evaluation of Awareness in a Groupware Application

Questionnaires for the Evaluation of Awareness in a Groupware Application Questionnaires for the Evaluation of Awareness in a Groupware Application Technical Report DIAB-12-11-1 Montserrat Sendín a, Juan-Miguel López-Gil b, and Víctor López-Jaquero c a GRIHO HCI Research Lab.,

Más detalles

TEDECO Tele-Conference

TEDECO Tele-Conference TEDECO Tele-Conference http://teteco.googlecode.com Ignacio Martín Oya Tutor: Jesús Martínez Mateo Tecnología para el Desarrollo y la Cooperación Facultad de Informática Universidad Politécnica de Madrid

Más detalles

manual de servicio nissan murano z51

manual de servicio nissan murano z51 manual de servicio nissan murano z51 Reference Manual To understand featuring to use and how to totally exploit manual de servicio nissan murano z51 to your great advantage, there are several sources of

Más detalles

Mi ciudad interesante

Mi ciudad interesante Mi ciudad interesante A WebQuest for 5th Grade Spanish Designed by Jacob Vuiller jvuiller@vt.edu Introducción Tarea Proceso Evaluación Conclusión Créditos Introducción Bienvenidos! Eres alcalde de una

Más detalles

SCADA BASADO EN LABVIEW PARA EL LABORATORIO DE CONTROL DE ICAI

SCADA BASADO EN LABVIEW PARA EL LABORATORIO DE CONTROL DE ICAI SCADA BASADO EN LABVIEW PARA EL LABORATORIO DE CONTROL DE ICAI Autor: Otín Marcos, Ana. Directores: Rodríguez Pecharromán, Ramón. Rodríguez Mondéjar, José Antonio. Entidad Colaboradora: ICAI Universidad

Más detalles

Estudio y analisis en el diseño de una canal de comunicaciones para el desarrollo de la interactividad en la televisión digital RESUMEN

Estudio y analisis en el diseño de una canal de comunicaciones para el desarrollo de la interactividad en la televisión digital RESUMEN Estudio y analisis en el diseño de una canal de comunicaciones para el desarrollo de la interactividad en la televisión digital Autor: Alberto Cuesta Gómez Director: Dr. Sadot Alexandres Fernández RESUMEN

Más detalles

Welcome to lesson 2 of the The Spanish Cat Home learning Spanish course.

Welcome to lesson 2 of the The Spanish Cat Home learning Spanish course. Welcome to lesson 2 of the The Spanish Cat Home learning Spanish course. Bienvenidos a la lección dos. The first part of this lesson consists in this audio lesson, and then we have some grammar for you

Más detalles

Introducción a la Ingeniería de Software. Diseño Interfaz de Usuario

Introducción a la Ingeniería de Software. Diseño Interfaz de Usuario Introducción a la Ingeniería de Software Diseño Interfaz de Usuario Diseño de la Interfaz de Usuario Normalmente no se contratan especialistas Hay casos en los cuales es más normal: videojuegos y sitiosweb

Más detalles

Gaia en las universidades españolas y los centros de inves3gación

Gaia en las universidades españolas y los centros de inves3gación Gaia en las universidades españolas y los centros de inves3gación Ana Ulla Miguel (GGG) Depto. de Física Aplicada, Universidade de Vigo The GGG group is presently composed of the following members: Dra.

Más detalles

Karina Ocaña Izquierdo

Karina Ocaña Izquierdo Estudié Ingeniería en Sistemas Computacionales (1997) y una Maestría en Ingeniería de Cómputo con especialidad en Sistemas Digitales (2000), ambas en el Instituto Politécnico Nacional (México). En el 2003,

Más detalles

Qué viva la Gráfica de Cien!

Qué viva la Gráfica de Cien! Qué viva la Gráfica de Cien! La gráfica de cien consiste en números del 1 al 100 ordenados en cuadrilones de diez números en hileras. El resultado es que los estudiantes que utilizan estás gráficas pueden

Más detalles

SOFTWARE PARA LA GESTIÓN INFORMÁTICA DE UNA CLÍNICA DENTAL

SOFTWARE PARA LA GESTIÓN INFORMÁTICA DE UNA CLÍNICA DENTAL SOFTWARE PARA LA GESTIÓN INFORMÁTICA DE UNA CLÍNICA DENTAL Autora: Laura Martín García Director: Alberto Ciudad Sánchez RESUMEN El objetivo de este proyecto es realizar el análisis, diseño y desarrollo

Más detalles

RFID TEMPERATURE SENSOR. Autor: Polo Tascón, David. Director: Kramer, Kathleen. Entidad colaboradora: Advantageous Systems LLC.

RFID TEMPERATURE SENSOR. Autor: Polo Tascón, David. Director: Kramer, Kathleen. Entidad colaboradora: Advantageous Systems LLC. RFID TEMPERATURE SENSOR. Autor: Polo Tascón, David. Director: Kramer, Kathleen. Entidad colaboradora: Advantageous Systems LLC. RESUMEN DEL PROYECTO Existen casos en la industria de la tecnología, medicina,

Más detalles

DISEÑO DE UN CRONOTERMOSTATO PARA CALEFACCIÓN SOBRE TELÉFONOS MÓVILES. Entidad Colaboradora: ICAI Universidad Pontificia Comillas.

DISEÑO DE UN CRONOTERMOSTATO PARA CALEFACCIÓN SOBRE TELÉFONOS MÓVILES. Entidad Colaboradora: ICAI Universidad Pontificia Comillas. DISEÑO DE UN CRONOTERMOSTATO PARA CALEFACCIÓN SOBRE TELÉFONOS MÓVILES Autor: Sánchez Gómez, Estefanía Dolores. Directores: Pilo de la Fuente, Eduardo. Egido Cortés, Ignacio. Entidad Colaboradora: ICAI

Más detalles

SISTEMA DE IMPORTACIÓN DINÁMICA DE INFORMACIÓN WEB Y PRESENTACIÓN WAP (SIDIW-PW)

SISTEMA DE IMPORTACIÓN DINÁMICA DE INFORMACIÓN WEB Y PRESENTACIÓN WAP (SIDIW-PW) SISTEMA DE IMPORTACIÓN DINÁMICA DE INFORMACIÓN WEB Y PRESENTACIÓN WAP (SIDIW-PW) Autora: Arias-Camisón Sarasua, Susana. Director: Contreras Bárcena, David Entidad Colaboradora: ICAI Universidad Pontificia

Más detalles

IRS DATA RETRIEVAL NOTIFICATION DEPENDENT STUDENT ESTIMATOR

IRS DATA RETRIEVAL NOTIFICATION DEPENDENT STUDENT ESTIMATOR IRS DATA RETRIEVAL NOTIFICATION DEPENDENT STUDENT ESTIMATOR Subject: Important Updates Needed for Your FAFSA Dear [Applicant], When you completed your 2012-2013 Free Application for Federal Student Aid

Más detalles

Adobe Acrobat Reader X: Manual to Verify the Digital Certification of a Document

Adobe Acrobat Reader X: Manual to Verify the Digital Certification of a Document dobe crobat Reader X: Manual de verificación de Certificación Digital de un documento dobe crobat Reader X: Manual to Verify the Digital Certification of a Document support@bioesign.com Desarrollado por:

Más detalles

DESARROLLO DE UN SISTEMA INTEGRADO DE GESTIÓN DE PROYECTOS: PLANIFICACIÓN OPERATIVA, PLANIFICACIÓN ECONÓMICA Y PLANIFICACIÓN DE LA EJECUCIÓN.

DESARROLLO DE UN SISTEMA INTEGRADO DE GESTIÓN DE PROYECTOS: PLANIFICACIÓN OPERATIVA, PLANIFICACIÓN ECONÓMICA Y PLANIFICACIÓN DE LA EJECUCIÓN. DESARROLLO DE UN SISTEMA INTEGRADO DE GESTIÓN DE PROYECTOS: PLANIFICACIÓN OPERATIVA, PLANIFICACIÓN ECONÓMICA Y PLANIFICACIÓN DE LA EJECUCIÓN. Autor: Ramírez Vargas, Gema. Director: Muñoz García, Manuel.

Más detalles

Contents. Introduction. Aims. Software architecture. Tools. Example

Contents. Introduction. Aims. Software architecture. Tools. Example ED@CON Control Results Management Software Control with Remote Sensing Contents Introduction Aims Software architecture Tools Example Introduction Control results management software (Ed@con) is a computer

Más detalles

SISTEMA DE GESTIÓN Y ANÁLISIS DE PUBLICIDAD EN TELEVISIÓN

SISTEMA DE GESTIÓN Y ANÁLISIS DE PUBLICIDAD EN TELEVISIÓN SISTEMA DE GESTIÓN Y ANÁLISIS DE PUBLICIDAD EN TELEVISIÓN Autor: Barral Bello, Alfredo Director: Alcalde Lancharro, Eduardo Entidad Colaboradora: Media Value S.L. RESUMEN DEL PROYECTO El presente proyecto

Más detalles

PROYECTO INFORMÁTICO PARA LA CREACIÓN DE UN GESTOR DOCUMENTAL PARA LA ONG ENTRECULTURAS

PROYECTO INFORMÁTICO PARA LA CREACIÓN DE UN GESTOR DOCUMENTAL PARA LA ONG ENTRECULTURAS PROYECTO INFORMÁTICO PARA LA CREACIÓN DE UN GESTOR DOCUMENTAL PARA LA ONG ENTRECULTURAS Autor: García Lodares, Victor. Director: Castejón Silvo, Pedro. Entidad Colaboradora: Entreculturas. Resumen del

Más detalles

Por tanto, la aplicación SEAH (Sistema Experto Asistente para Hattrick) ofrece las siguientes opciones:

Por tanto, la aplicación SEAH (Sistema Experto Asistente para Hattrick) ofrece las siguientes opciones: SEAH: SISTEMA EXPERTO ASISTENTE PARA HATTRICK Autor: Gil Mira, Alfredo Director: Olivas Varela, Jose Ángel Entidad Colaboradora: ICAI Universidad Pontificia Comillas RESUMEN DEL PROYECTO Hatrick es un

Más detalles

Pages: 205. Authors: Dr. Carmen Bestué, Ph. D. Dr. Mariana Orozco Jutoran, Ph. D. Chapters: 6

Pages: 205. Authors: Dr. Carmen Bestué, Ph. D. Dr. Mariana Orozco Jutoran, Ph. D. Chapters: 6 Pages: 205 Authors: Dr. Carmen Bestué, Ph. D. Dr. Mariana Orozco Jutoran, Ph. D. Chapters: 6 1 Course Description and Objectives The aim of this course is to provide an in depth analysis and intensive

Más detalles

Pages: 118. Dr. Carmen Mangirón. Chapters: 6

Pages: 118. Dr. Carmen Mangirón. Chapters: 6 Pages: 118 Author: Dr. Carmen Mangirón Chapters: 6 1 General description and objectives This course is designed to provide advanced instruction and supervised practice in translation from English into

Más detalles

Encuesta. Objetivo: Encuestar a los alumnos del 1º al 5º ciclo de licenciatura en inglés de la Universidad de oriente.

Encuesta. Objetivo: Encuestar a los alumnos del 1º al 5º ciclo de licenciatura en inglés de la Universidad de oriente. Encuesta Objetivo: Encuestar a los alumnos del 1º al 5º ciclo de licenciatura en inglés de la Universidad de oriente. 1 Considera necesario que se imparta la signatura informática como herramienta para

Más detalles

Disfruten su verano! Hola estudiantes,

Disfruten su verano! Hola estudiantes, Hola estudiantes, We hope that your experience during Spanish 1 was enjoyable and that you are looking forward to improving your ability to communicate in Spanish. As we all know, it is very difficult

Más detalles

Adaptación y Configuración de Procesos de Software Tailoring and Configuration of Software Processes

Adaptación y Configuración de Procesos de Software Tailoring and Configuration of Software Processes Adaptación y Configuración de Procesos de Software Tailoring and Configuration of Software Processes Rodolfo Villarroel Acevedo 1* 1 Pontificia Universidad Católica de Valparaíso. Avenida Brasil 2241,

Más detalles

Barclaycard Center Identidad Visual / Visual Identity Uso de la marca como referencia / Use of the brand as reference

Barclaycard Center Identidad Visual / Visual Identity Uso de la marca como referencia / Use of the brand as reference Barclaycard Center Identidad Visual / Visual Identity Uso de la marca como referencia / Use of the brand as reference Julio 2014 / July 2014 Contenidos / Contents 02 Eventos / Eventos 3 14 15 16 Aplicacion

Más detalles

SISTEMA DE TRADUCCIÓN EN TARJETAS INTELIGENTES (SIM TOOLKIT)

SISTEMA DE TRADUCCIÓN EN TARJETAS INTELIGENTES (SIM TOOLKIT) SISTEMA DE TRADUCCIÓN EN TARJETAS INTELIGENTES (SIM TOOLKIT) Autor: García-Merás Capote, Patricia. Director: Casarrubio Feijóo, Carlos. Entidad Colaboradora: Gemplus S.A. RESUMEN DEL PROYECTO La comunicación

Más detalles

Los ensayos que se van a desarrollar son los siguientes:

Los ensayos que se van a desarrollar son los siguientes: I Resumen El objetivo principal del proyecto es desarrollar un software que permita analizar unos datos correspondientes a una serie de ensayos militares. Con este objetivo en mente, se ha decidido desarrollar

Más detalles

Descripción de contenidos del E-FORM Capítulo Proyectos de cooperación europea

Descripción de contenidos del E-FORM Capítulo Proyectos de cooperación europea Descripción de contenidos del E-FORM Capítulo Proyectos de cooperación europea DOCUMENTO NO VÁLIDO PARA PRESENTAR LA SOLICITUD* *Documento-resumen del formulario online (eform) de la 2ª convocatoria de

Más detalles

PROGRAMA. Operaciones de Banca y Bolsa SYLLABUS BANKING AND STOCK MARKET OPERATIONS

PROGRAMA. Operaciones de Banca y Bolsa SYLLABUS BANKING AND STOCK MARKET OPERATIONS PROGRAMA 4º Curso. Grado en Administración y Dirección SYLLABUS BANKING AND STOCK MARKET OPERATIONS 4 rd year. Pág. 1 / 8 Colegio Universitario de Estudios Financieros Leonardo Prieto Castro, 2 Tel. +34

Más detalles

WLAB SISTEMA DE CONTROL REMOTO EN TIEMPO REAL DE EQUIPOS DE LABORARIO. Directores: Rodríguez Pecharromán, Ramón. Palacios Hielscher, Rafael.

WLAB SISTEMA DE CONTROL REMOTO EN TIEMPO REAL DE EQUIPOS DE LABORARIO. Directores: Rodríguez Pecharromán, Ramón. Palacios Hielscher, Rafael. WLAB SISTEMA DE CONTROL REMOTO EN TIEMPO REAL DE EQUIPOS DE LABORARIO. Autor: Rodríguez de la Rosa, Alicia. Directores: Rodríguez Pecharromán, Ramón. Palacios Hielscher, Rafael. Entidad Colaboradora: ICAI

Más detalles

MICINN Imágenes Médicas Publicaciones 2011-2012

MICINN Imágenes Médicas Publicaciones 2011-2012 MICINN Imágenes Médicas Publicaciones 2011-2012 Iván Macía 11/01/2012 PUBLICACION 1 Macía, I.; Graña, M.; Maiora, J.; Paloc, C. & de Blas, M. Detection of type II endoleaks in abdominal aortic aneurysms

Más detalles

An explanation by Sr. Jordan

An explanation by Sr. Jordan & An explanation by Sr. Jdan direct object pronouns We usually use Direct Object Pronouns to substitute f it them in a sentence when the it them follows the verb. Because of gender, him and her could also

Más detalles

Instalación: Instalación de un agente en una máquina cliente y su registro en el sistema.

Instalación: Instalación de un agente en una máquina cliente y su registro en el sistema. HERRAMIENTA DE MONITORIZACIÓN DE SISTEMAS Autor: Sota Madorrán, Iñaki. Director: Igualada Moreno, Pablo. Entidad Colaboradora: Evotec Consulting, S.L. RESUMEN DEL PROYECTO El proyecto consiste en el diseño,

Más detalles

Edgar Quiñones. HHRR: Common Sense Does Not Mean Business. Objective

Edgar Quiñones. HHRR: Common Sense Does Not Mean Business. Objective Edgar Quiñones HHRR: Common Sense Does Not Mean Business Objective Share experiences & insight gained in the last two decades in the management consulting business regarding why Common Sense Does Not Mean

Más detalles

UNIVERSIDAD NACIONAL MAYOR DE SAN MARCOS UNIDAD DE POSTGRADO DE INGENIERÍA DE SISTEMAS E INFORMATICA

UNIVERSIDAD NACIONAL MAYOR DE SAN MARCOS UNIDAD DE POSTGRADO DE INGENIERÍA DE SISTEMAS E INFORMATICA UNIVERSIDAD NACIONAL MAYOR DE SAN MARCOS UNIDAD DE POSTGRADO DE INGENIERÍA DE SISTEMAS E INFORMATICA DISEÑO E IMPLEMENTACIÓN DE UNA OFICINA DE GESTION DE PROYECTOS PARA LA POSITIVA SEGUROS Informe Profesional

Más detalles

IBM Software Demos Rational Model Driven Development

IBM Software Demos Rational Model Driven Development This demonstration introduces IBM Rational XDE and IBM WebSphere Studio working together to improve J2EE software development. Esta demostración presenta la combinación de IBM Rational XDE e IBM WebSphere

Más detalles

Learning Masters. Early: Force and Motion

Learning Masters. Early: Force and Motion Learning Masters Early: Force and Motion WhatILearned What important things did you learn in this theme? I learned that I learned that I learned that 22 Force and Motion Learning Masters How I Learned

Más detalles

Una librería digital de modelos de simulaciones para la enseñanza de la ingeniería de control. Francisco Esquembre. Universidad de Murcia

Una librería digital de modelos de simulaciones para la enseñanza de la ingeniería de control. Francisco Esquembre. Universidad de Murcia Una librería digital de modelos de simulaciones para la enseñanza de la ingeniería de control Francisco Esquembre Universidad de Murcia Eiwissa 2010, León Easy Java Simulations Reflexiones en el camino

Más detalles

Propiedades del Mineral en Molinos SAG y AG Ahorrando tiempo y dinero con MetSMART: Probando y Simulando la Dureza del Mineral

Propiedades del Mineral en Molinos SAG y AG Ahorrando tiempo y dinero con MetSMART: Probando y Simulando la Dureza del Mineral Propiedades del Mineral en Molinos SAG y AG Ahorrando tiempo y dinero con MetSMART: Probando y Simulando la Dureza del Mineral Ore Properties in AG/SAG Mill Saving Time and Money with MetSMART: Testing

Más detalles

DISEÑO E IMPLEMENTACIÓN DE UN PROTOCOLO DE REDES PEER-TO-PEER

DISEÑO E IMPLEMENTACIÓN DE UN PROTOCOLO DE REDES PEER-TO-PEER DISEÑO E IMPLEMENTACIÓN DE UN PROTOCOLO DE REDES PEER-TO-PEER Autor: García Sanjuán, Luis María. Director: Muñoz Berengena, José Manuel. Entidad Colaboradora: ICAI Universidad Pontificia Comillas RESUMEN

Más detalles

PROYECTO - WLAB. SISTEMA DE CONTROL REMOTO EN TIEMPO REAL DE EQUIPOS DE LABOROTORIO AUTORA: Sara Mira Fernández. Resumen

PROYECTO - WLAB. SISTEMA DE CONTROL REMOTO EN TIEMPO REAL DE EQUIPOS DE LABOROTORIO AUTORA: Sara Mira Fernández. Resumen PROYECTO - WLAB. SISTEMA DE CONTROL REMOTO EN TIEMPO REAL DE EQUIPOS DE LABOROTORIO AUTORA: Sara Mira Fernández Resumen La idea de la que parte este proyecto es la de permitir acceder al Laboratorio de

Más detalles

SPANISH ORAL LANGUAGE ASSESSMENT. Jill Jegerski Department of Modern Languages April 8, 2011

SPANISH ORAL LANGUAGE ASSESSMENT. Jill Jegerski Department of Modern Languages April 8, 2011 SPANISH ORAL LANGUAGE ASSESSMENT Jill Jegerski Department of Modern Languages April 8, 2011 INTRODUCTION Basic Spanish program at CSI Three-course Gen. Ed. sequence: SPN 113, 114, 213 Approximately 800

Más detalles

BOOK OF ABSTRACTS LIBRO DE RESÚMENES

BOOK OF ABSTRACTS LIBRO DE RESÚMENES BOOK OF ABSTRACTS LIBRO DE RESÚMENES 19 th International Congress on Project Management and Engineering XIX Congreso Internacional de Dirección e Ingeniería de Proyectos AEIPRO (Asociación Española de

Más detalles

ANÁLISIS DE SOLUCIONES PARA LA IMPLEMENTACIÓN DE UNA PLATAFORMA BIG DATA

ANÁLISIS DE SOLUCIONES PARA LA IMPLEMENTACIÓN DE UNA PLATAFORMA BIG DATA ANÁLISIS DE SOLUCIONES PARA LA IMPLEMENTACIÓN DE UNA PLATAFORMA BIG DATA Autora: Lizaso Barrera, Natalia. Directores: Mario Tenés y Sonia García. Entidad Colaboradora: VASS. RESUMEN DEL PROYECTO Hoy en

Más detalles

Real Time Systems. Part 2: Cyclic schedulers. Real Time Systems. Francisco Martín Rico. URJC. 2011

Real Time Systems. Part 2: Cyclic schedulers. Real Time Systems. Francisco Martín Rico. URJC. 2011 Real Time Systems Part 2: Cyclic schedulers Scheduling To organise the use resources to guarantee the temporal requirements A scheduling method is composed by: An scheduling algorithm that calculates the

Más detalles

ESTABLECIMIENTO DE UNA RED DE DISTRIBUCIÓN EFICIENTE EN TERMINOS DE COSTES OPERACIONALES.

ESTABLECIMIENTO DE UNA RED DE DISTRIBUCIÓN EFICIENTE EN TERMINOS DE COSTES OPERACIONALES. UNIVERSIDAD PONTIFICIA COMILLAS ESCUELA TÉCNICA SUPERIOR DE INGENIERÍA (ICAI) INGENIERO INDUSTRIAL ESTABLECIMIENTO DE UNA RED DE DISTRIBUCIÓN EFICIENTE EN TERMINOS DE COSTES OPERACIONALES. Autor: Castillo

Más detalles

Diseño de un directorio Web de diseñadores gráficos, ilustradores y fotógrafos.

Diseño de un directorio Web de diseñadores gráficos, ilustradores y fotógrafos. Universidad Nueva Esparta Facultad de Ciencias Administrativas Escuela de Administración de Diseño de un directorio Web de diseñadores gráficos, ilustradores y fotógrafos. Tutor: Lic. Beaujon, María Beatriz

Más detalles

APLICACIÓN WEB BASADA EN UNA SOLUCIÓN EN SAP R/3 PARA EL MANTENIMIENTO DE TRENES

APLICACIÓN WEB BASADA EN UNA SOLUCIÓN EN SAP R/3 PARA EL MANTENIMIENTO DE TRENES APLICACIÓN WEB BASADA EN UNA SOLUCIÓN EN SAP R/3 PARA EL MANTENIMIENTO DE TRENES Autor: Alberny, Marion. Director: Alcalde Lancharro, Eduardo. Entidad Colaboradora: CGI. RESUMEN DEL PROYECTO La mayoría

Más detalles

FICHA MEMORIA DOCENTE Curso Académico 2006/ 07

FICHA MEMORIA DOCENTE Curso Académico 2006/ 07 FICHA Curso Académico 2006/ 07 / CODE 3104 COURSE NAME/TITLE Informatics DEGREE Agricultural, forestry, engineering and food technology TYPE Optative ORIENTATION All ESTUDIES PROGRAM 1999 CYCLE 1 COURSE

Más detalles

Art Studio. Did you know...?

Art Studio. Did you know...? Art Studio Did you know...? Did you know...? In our Art Studio, we encourage children to use the materials in any way they wish. We provide ideas that they may use to begin work but do not expect copies

Más detalles

Matemáticas Muestra Cuadernillo de Examen

Matemáticas Muestra Cuadernillo de Examen Matemáticas Muestra Cuadernillo de Examen Papel-Lápiz Formato Estudiante Español Versión, Grados 3-5 Mathematics Sample Test Booklet Paper-Pencil Format Student Spanish Version, Grades 3 5 Este cuadernillo

Más detalles

ESTUDIO, PLANIFICACIÓN Y GESTIÓN DE LA IMPLEMENTACIÓN DE UN SISTEMA BIG DATA PARA LA MONITORIZACIÓN EXTREMO A EXTREMO DE SERVICIOS DE CLIENTE

ESTUDIO, PLANIFICACIÓN Y GESTIÓN DE LA IMPLEMENTACIÓN DE UN SISTEMA BIG DATA PARA LA MONITORIZACIÓN EXTREMO A EXTREMO DE SERVICIOS DE CLIENTE ESTUDIO, PLANIFICACIÓN Y GESTIÓN DE LA IMPLEMENTACIÓN DE UN SISTEMA BIG DATA PARA LA MONITORIZACIÓN EXTREMO A EXTREMO DE SERVICIOS DE CLIENTE Autor: Giménez González, José Manuel. Director: Romero Orobio,

Más detalles

KEY ENGLISH TEST (K.E.T.)

KEY ENGLISH TEST (K.E.T.) KEY ENGLISH TEST (K.E.T.) El examen KET for Schools corresponde al primer examen que rinden nuestros alumnos de Sexto Básico, de la serie denominada Cambridge Main Suite, la cual incluye posteriormente

Más detalles

GUIDE FOR PARENT TEACHER CONFERENCES

GUIDE FOR PARENT TEACHER CONFERENCES GUIDE FOR PARENT TEACHER CONFERENCES A parent-teacher conference is a chance for you and your child s teacher to talk. You can talk about how your child is learning at home and at school. This list will

Más detalles

http://mvision.madrid.org

http://mvision.madrid.org Apoyando el desarrollo de carrera de investigadores en imagen biomédica Supporting career development of researchers in biomedical imaging QUÉ ES M+VISION? WHAT IS M+VISION? M+VISION es un programa creado

Más detalles

Learning Masters. Fluent: States of Matter

Learning Masters. Fluent: States of Matter Learning Masters Fluent: States of Matter What I Learned List the three most important things you learned in this theme. Tell why you listed each one. 1. 2. 3. 22 States of Matter Learning Masters How

Más detalles

Passaic County Technical Institute 45 Reinhardt Road Wayne, New Jersey 07470

Passaic County Technical Institute 45 Reinhardt Road Wayne, New Jersey 07470 Note: Instructions in Spanish immediately follow instructions in English (Instrucciones en español inmediatamente siguen las instrucciónes en Inglés) Passaic County Technical Institute 45 Reinhardt Road

Más detalles

DESARROLLO DE UN PROGRAMA DE CONTABILIDAD FINANCIERA Autor: Rodríguez Díez, Guillermo. Director: Fernández García, Mercedes.

DESARROLLO DE UN PROGRAMA DE CONTABILIDAD FINANCIERA Autor: Rodríguez Díez, Guillermo. Director: Fernández García, Mercedes. DESARROLLO DE UN PROGRAMA DE CONTABILIDAD FINANCIERA Autor: Rodríguez Díez, Guillermo. Director: Fernández García, Mercedes. RESUMEN DEL PROYECTO En este proyecto se ha desarrollado una aplicación de contabilidad

Más detalles

IMPLANTACIÓN DE UNA SOLUCIÓN PLM QUE GARANTICE LAS CLAVES Y PRINCIPIOS RECOGIDOS POR EL SISTEMA DE GESTIÓN DE LA CALIDAD SIX SIGMA

IMPLANTACIÓN DE UNA SOLUCIÓN PLM QUE GARANTICE LAS CLAVES Y PRINCIPIOS RECOGIDOS POR EL SISTEMA DE GESTIÓN DE LA CALIDAD SIX SIGMA IMPLANTACIÓN DE UNA SOLUCIÓN PLM QUE GARANTICE LAS CLAVES Y PRINCIPIOS RECOGIDOS POR EL SISTEMA DE GESTIÓN DE LA CALIDAD SIX SIGMA Autor: Prats Sánchez, Juan. Director: Díaz Carrillo, Gerardo. Entidad

Más detalles

HERRAMIENTA PARA LA OPTIMIZACIÓN DEL PORFOLIO DE PRODUCTOS DE LAS REDES DE VENTAS DE UN LABORATORIO FARMACÉUTICO

HERRAMIENTA PARA LA OPTIMIZACIÓN DEL PORFOLIO DE PRODUCTOS DE LAS REDES DE VENTAS DE UN LABORATORIO FARMACÉUTICO HERRAMIENTA PARA LA OPTIMIZACIÓN DEL PORFOLIO DE PRODUCTOS DE LAS REDES DE VENTAS DE UN LABORATORIO FARMACÉUTICO Autor: Tárano Pastor, Ramón. Director: Moreno Alonso, Pablo. Director: Ruiz del Palacio,

Más detalles

Esta fase termina presentando el producto diseñado para cumplir todas estas necesidades.

Esta fase termina presentando el producto diseñado para cumplir todas estas necesidades. Resumen Autor: Directores: Alfonso Villegas García de Zúñiga Eduardo García Sánchez Objetivo El objetivo de este proyecto es estudiar la creación e implantación de un módulo de habitabilidad portátil.

Más detalles

Learning Masters. Fluent: Wind, Water, and Sunlight

Learning Masters. Fluent: Wind, Water, and Sunlight Learning Masters Fluent: Wind, Water, and Sunlight What I Learned List the three most important things you learned in this theme. Tell why you listed each one. 1. 2. 3. 22 Wind, Water, and Sunlight Learning

Más detalles

Learning Masters. Fluent: Animal Habitats

Learning Masters. Fluent: Animal Habitats Learning Masters Fluent: Animal Habitats What I Learned List the three most important things you learned in this theme. Tell why you listed each one. 1. 2. 3. 22 Animal Habitats Learning Masters How I

Más detalles

THE BILINGUAL CLASSROOM: CONTENT AND LANGUAGE INTEGRATED LEARNING

THE BILINGUAL CLASSROOM: CONTENT AND LANGUAGE INTEGRATED LEARNING THE BILINGUAL CLASSROOM: CONTENT AND LANGUAGE INTEGRATED LEARNING Curso de: Carolina Fernández del Pino Vidal Nº Horas 110 h. /11 créditos (0,5000 puntos) Matricula AFILIADOS A ANPE Y U.P. COMILLAS NO

Más detalles

UNIVERSIDAD PONTIFICIA COMILLAS ESCUELA TÉCNICA SUPERIOR DE INGENIERÍA (ICAI) INGENIERO INDUSTRIAL RESUMEN. Resumen

UNIVERSIDAD PONTIFICIA COMILLAS ESCUELA TÉCNICA SUPERIOR DE INGENIERÍA (ICAI) INGENIERO INDUSTRIAL RESUMEN. Resumen RESUMEN Resumen 1 RESUMEN El uso de túneles de viento ha ido proliferando al mismo ritmo que la aeronáutica y otras disciplinas relacionadas con la ingeniería lo han hecho a lo largo del s. XX. Este tipo

Más detalles

UNIVERSIDAD DE PUERTO RICO RECINTO DE RÍO PIEDRAS PLAN DE ASSESSMENT DEL APRENDIZAJE ESTUDIANTIL

UNIVERSIDAD DE PUERTO RICO RECINTO DE RÍO PIEDRAS PLAN DE ASSESSMENT DEL APRENDIZAJE ESTUDIANTIL 1 UNIVERSIDAD DE PUERTO RICO RECINTO DE RÍO PIEDRAS PLAN DE ASSESSMENT DEL APRENDIZAJE ESTUDIANTIL PARTE I - DOMINIOS DE LA MISIÓN DEL RECINTO Programa académico o Concentración: Medula en Administración

Más detalles

Pages: 171. Dr. Olga Torres Hostench. Chapters: 6

Pages: 171. Dr. Olga Torres Hostench. Chapters: 6 Pages: 171 Author: Dr. Olga Torres Hostench Chapters: 6 1 General description and objectives The aim of this course is to provide an in depth analysis and intensive practice the various computerbased technologies

Más detalles

manual de agua potable y saneamiento Most of the time, manual de agua potable y saneamiento is just instructions regarding how to install the system.

manual de agua potable y saneamiento Most of the time, manual de agua potable y saneamiento is just instructions regarding how to install the system. manual de agua potable y saneamiento Most of the time, manual de agua potable y saneamiento is just instructions regarding how to install the system. 2 manual de agua potable y saneamiento MANUAL DE AGUA

Más detalles

Presentación Estrategias de Búsquedas en ISI Web of Science. M. Gavilan Sistema de Bibliotecas UACh

Presentación Estrategias de Búsquedas en ISI Web of Science. M. Gavilan Sistema de Bibliotecas UACh Presentación Estrategias de Búsquedas en ISI Web of Science M. Gavilan Sistema de Bibliotecas UACh Bases de datos Qué es una base de datos BIBLIOGRÁFICA? es una colección organizada de registros. registro

Más detalles

Puede pagar facturas y gastos periódicos como el alquiler, el gas, la electricidad, el agua y el teléfono y también otros gastos del hogar.

Puede pagar facturas y gastos periódicos como el alquiler, el gas, la electricidad, el agua y el teléfono y también otros gastos del hogar. SPANISH Centrepay Qué es Centrepay? Centrepay es la manera sencilla de pagar sus facturas y gastos. Centrepay es un servicio de pago de facturas voluntario y gratuito para clientes de Centrelink. Utilice

Más detalles

The Home Language Survey (HLS) and Identification of Students

The Home Language Survey (HLS) and Identification of Students The Home Language Survey (HLS) and Identification of Students The Home Language Survey (HLS) is the document used to determine a student that speaks a language other than English. Identification of a language

Más detalles

TEACHER TOOLS: Teaching Kids Spanish Vocabulary. An Activity in 4 Steps

TEACHER TOOLS: Teaching Kids Spanish Vocabulary. An Activity in 4 Steps TEACHER TOOLS: Teaching Kids Spanish Vocabulary An Activity in 4 Steps Teaching Kids Spanish Vocabulary Lesson for Spanish Teachers Learning new vocabulary words in Spanish is an important element in the

Más detalles

Entrevista: el medio ambiente. A la caza de vocabulario: come se dice en español?

Entrevista: el medio ambiente. A la caza de vocabulario: come se dice en español? A la caza de vocabulario: come se dice en español? Entrevista: el medio ambiente 1. There are a lot of factories 2. The destruction of the ozone layer 3. In our city there is a lot of rubbish 4. Endangered

Más detalles

GENERAL INFORMATION Project Description

GENERAL INFORMATION Project Description RESULTADOS! GENERAL INFORMATION Project Description The campaign "Adopt a car " had as its main objective to position Autoplaza, the main automotive selling point of Chile, as a new car sales location

Más detalles

Diseño ergonómico o diseño centrado en el usuario?

Diseño ergonómico o diseño centrado en el usuario? Diseño ergonómico o diseño centrado en el usuario? Mercado Colin, Lucila Maestra en Diseño Industrial Posgrado en Diseño Industrial, UNAM lucila_mercadocolin@yahoo.com.mx RESUMEN En los últimos años el

Más detalles