Papers and Publications

Domain Control for Neural Machine Translation


Comments: Published in RANLP 2017
Subjects: Computation and Language (cs.CL)



arXiv:1612.06141
[pdf,
autre]


Machine translation systems are very sensitive to the domains they were trained on. Several domain adaptation techniques have been deeply studied. We propose a new technique for neural machine translation (NMT) that we call domain control which is performed at runtime using a unique neural network covering multiple domains. The presented approach shows quality improvements when compared to dedicated domains translating on any of the covered domains and even on out-of-domain data. In addition, model parameters do not need to be re-estimated for each domain, making this effective to real use cases. Evaluation is carried out on English-to-French translation for two different testing scenarios. We first consider the case where an end-user performs translations on a known domain. Secondly, we consider the scenario where the domain is not known and predicted at the sentence level before translating. Results show consistent accuracy improvements for both conditions.



Catherine Kobus,
Josep Crego,
Jean Senellart


[v2] Tue, 12 Sep 2017 12:01:40 GMT


SYSTRAN Pure Neural Machine Translation


Neural Machine Translation:


Each of us have experienced or heard of deep learning in day-to-day business applications. What are the fundamentals of this new technology and what new opportunities does it offer?


Jan 31, 2017


Neural Machine Translation from Simplified Translations


Comments: Submitted to EACL 2017 short paper
Subjects: Computation and Language (cs.CL)



arXiv:1612.06139
[pdf,
ps,
autre]


Abstract: Text simplification aims at reducing the lexical, grammatical and structural complexity of a text while keeping the same meaning. In the context of machine translation, we introduce the idea of simplified translations in order to boost the learning ability of deep neural translation models. We conduct preliminary experiments showing that translation complexity is actually reduced in a translation of a source bi-text compared to the target reference of the bi-text while using a neural machine translation (NMT) system learned on the exact same bi-text. Based on knowledge distillation idea, we then train an NMT system using the simplified bi-text, and show that it outperforms the initial system that was built over the reference data set. Performance is further boosted when both reference and automatic translations are used to learn the network. We perform an elementary analysis of the translated corpus and report accuracy results of the proposed approach on English-to-French and English-to-German translation tasks.



Josep Crego,
Jean Senellart


[v1] Mon, 19 Dec 2016 11:50:58 GMT


Domain specialization: a post-training domain adaptation for Neural Machine Translation


Comments: Submitted to EACL 2017 short paper
Subjects: Computation and Language (cs.CL)



arXiv:1612.06141
[pdf,
autre]


Domain adaptation is a key feature in Machine Translation. It generally encompasses terminology, domain and style adaptation, especially for human post-editing workflows in Computer Assisted Translation (CAT). With Neural Machine Translation (NMT), we introduce a new notion of domain adaptation that we call "specialization" and which is showing promising results both in the learning speed and in adaptation accuracy. In this paper, we propose to explore this approach under several perspectives.



Christophe Servan,
Josep Crego,
Jean Senellart


[v1] Mon, 19 Dec 2016 11:52:08 GMT


SYSTRAN's Pure Neural Machine Translation Systems


Subjects: Computation and Language (cs.CL)



arXiv:1610.05540
[pdf,
ps,
autre]


Abstract: Since the first online demonstration of Neural Machine Translation (NMT) by LISA, NMT development has recently moved from laboratory to production systems as demonstrated by several entities announcing roll-out of NMT engines to replace their existing technologies. NMT systems have a large number of training configurations and the training process of such systems is usually very long, often a few weeks, so role of experimentation is critical and important to share. In this work, we present our approach to production-ready systems simultaneously with release of online demonstrators covering a large variety of languages (12 languages, for 32 language pairs). We explore different practical choices: an efficient and evolutive open-source framework; data preparation; network architecture; additional implemented features; tuning for production; etc. We discuss about evaluation methodology, present our first findings and we finally outline further work.
Our ultimate goal is to share our expertise to build competitive production systems for "generic" translation. We aim at contributing to set up a collaborative framework to speed-up adoption of the technology, foster further research efforts and enable the delivery and adoption to/by industry of use-case specific engines integrated in real production workflows. Mastering of the technology would allow us to build translation engines suited for particular needs, outperforming current simplest/uniform systems.



Josep Crego,
Jungi Kim,
Guillaume Klein,
Anabel Rebollo,
Kathy Yang,
Jean Senellart,
Egor Akhanov,
Patrice Brunelle,
Aurelien Coquard,
Yongchao Deng,
Satoshi Enoue,
Chiyo Geiss,
Joshua Johanson,
Ardas Khalsa,
Raoum Khiari,
Byeongil Ko,
Catherine Kobus,
Jean Lorieux,
Leidiana Martins,
Dang-Chuan Nguyen,
Alexandra Priori,
Thomas Riccardi,
Natalia Segal,
Christophe Servan,
Cyril Tiquet,
Bo Wang,
Jin Yang,
Dakun Zhang,
Jing Zhou,
Peter Zoldan


[v1] Tue, 18 Oct 2016 11:32:42 GMT


System Combination RWTH Aachen - SYSTRAN for the NTCIR-10 PatentMT Evaluation 2013


Abstract: This paper describes the joint submission by RWTH Aachen University and SYSTRAN in the Chinese-English Patent Machine Translation Task at the 10th NTCIR Workshop. We specify the statistical systems developed by RWTH Aachen University and the hybrid machine translation systems developed by SYSTRAN. We apply RWTH Aachen’s combination techniques to create consensus hypotheses from very different systems: phrase-based and hierarchical SMT, rule-based MT (RBMT) and MT with statistical post-editing (SPE). The system combination was ranked second in BLEU and second in the human adequacy evaluation in this competition.


Minwei Feng, Markus Freitag, Hermann Ney, Bianka Buschbeck, Jean Senellart, Jin Yang


June 18-21, 2013, Tokyo, Japan


SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT 2011


Abstract: This report describes SYSTRAN’s Chinese-English and English-Chinese machine translation systems that participated in the CWMT 2011 machine translation evaluation tasks. The base systems are SYSTRAN rule-based machine translation systems, augmented with various statistical techniques. Based on the translations of the rule-based systems, we performed statistical post-editing with the provided bilingual and monolingual training corpora. In this report, we describe the technology behind the systems, the training data, and finally the evaluation results in the CWMT 2011 evaluation. Our primary Chinese-English system was ranked first in BLEU in the translation tasks.


Jin Yang, Satoshi Enoue, Jean Senellart


Proceedings of the 7th China Workshop on Machine Translation (CWMT), September 2011.


Convergence of Translation Memory and Statistical Machine Translation


Abstract: We present two methods that merge ideas from statistical machine translation (SMT) and translation memories (TM). We use a TM to retrieve matches for source segments, and replace the mismatched parts with instructions to an SMT system to fill in the gap. We show that for fuzzy matches of over 70%, one method outperforms both SMT and TM base- lines.


Philipp Koehn, Jean Senellart


JEC, November 2010.


Fast Approximate String Matching with Suffix Arrays and A* Parsing


Abstract: We present a novel exact solution to the approximate string matching problem in the context of translation memories, where a text segment has to be matched against a large corpus, while allowing for errors. We use suffix arrays to detect exact n-gram matches, A* search heuristics to discard matches and A* parsing to validate candidate segments. The method outperforms the canonical baseline by a factor of 100, with average lookup times of 4.3–247ms for a segment in a realistic scenario.


Philipp Koehn, Jean Senellart


AMTA, October 2010.


SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems


Abstract: This report describes both of SYSTRAN's Chinese-English and English-Chinese machine translation systems that participated in the CWMT2009 machine translation evaluation tasks. The base systems are SYSTRAN rule-based machine translation systems, augmented with various statistical techniques. Based on the translations of the rule-based systems, we perform statistical post-editing with the provided bilingual and monolingual training corpora. In this report, we describe the technology behind the systems, the training data, and finally the evaluation results in the CWMT2009 evaluation. Our primary systems were top-ranked in the evaluation tasks.


Jin Yang, Satoshi Enoue, Jean Senellart, Tristan Croiset


November 2009, CWMT


Selective addition of corpus-extracted phrasal lexical rules to a rule-based machine translation system


Abstract: In this work, we show how an existing rule-based, general-purpose machine translation system may be improved and adapted automatically to a given domain, whenever parallel corpora are available. We perform this adaptation by extracting dictionary entries from the parallel data. From this initial set, the application of these rules is tested against the baseline performance. Rules are then pruned depending on sentence-level improvements and deteriorations, as evaluated by an automatic string-based metric. Experiments using the Europarl dataset show a 3% absolute improvement in BLEU over the original rule-based system.


Loic Dugast, Jean Senellart, Philipp Koehn


MT Summit, August 2009.


Statistical Post Editing and Dictionary Extraction: SYSTRAN/Edinburgh submissions for ACL-WMT2009


Abstract: We describe here the two Systran/University of Edinburgh submissions for WMT2009. They involve a statistical post-editing model with a particular handling of named entities (English to French and German to English) and the extraction of phrasal rules (English to French).


Loïc Dugast, Jean Senellart, Philipp Koehn


March 2009.


SMT and SPE Machine Translation Systems for WMT'09


Abstract: This paper describes the development of several machine translation systems for the 2009 WMT shared task evaluation. We only consider the translation between French and English. We describe a statistical system based on the Moses decoder and a statistical post-editing system using SYSTRAN’s rule-based system. We also investigated techniques to automatically extract additional bilingual texts from comparable corpora.


Holger Schwenk, Sadaf Abdul Rauf, Loic Barrault, Jean Senellart


March 2009.


First Steps towards a General Purpose French/English Statistical Machine Translation System


Abstract: This paper describes an initial version of a general purpose French/English statistical machine translation system. The main features of this system are the open-source Moses decoder, the integration of a bilingual dictionary and a continuous space target language model. We analyze the performance of this system on the test data of the WMT'08 evaluation.


Holger Schwenk, Jean-Baptiste Fouet, Jean Senellart


June 2008.


Can we Relearn an RBMT System?


Abstract: This paper describes SYSTRAN submissions for the shared task of the third Workshop on Statistical Machine Translation at ACL. Our main contribution consists in a French-English statistical model trained without the use of any human-translated parallel corpus. In substitution, we translated a monolingual corpus with SYSTRAN rule-based translation engine to produce the parallel corpus. The results are provided herein, along with a measure of error analysis.


Loïc Dugast, Jean Senellart, Philipp Koehn


June 2008.


SYSTRAN Translation Stylesheets: Machine Translation driven by XSLT


Abstract: XSL Transformation stylesheets are usually used to transform a document described in an XML formalism into another XML formalism, to modify an XML document, or to publish content stored into an XML document to a publishing format (XSL-FO, (X)HTML...). SYSTRAN Translation Stylesheets (STS) use XSLT to drive and control the machine translation of XML documents (native XML document formats or XML representations — such as XLIFF — of other kinds of document formats).


Pierre Senellart, Jean Senellart


September 2005


Intuitive Coding of the Arabic Lexicon


Abstract: SYSTRAN started the design and the development of Arabic, Farsi and Urdu to English machine translation systems in July 2002. This paper describes the methodology and implementation adopted for dictionary building and morphological analysis. SYSTRAN's IntuitiveCoding® technology (ICT) facilitates the creation, update, and maintenance of Arabic, Farsi and Urdu lexical entries, is more modular and less costly. ICT for Arabic, Farsi, and Urdu requires the implementation of stem-based lexical entries, the authentic scripts for each language, a statistical Arabic stem-guesser, and separate declarative modules for internal and external morphology.


Ali Farghaly, Jean Senellart


MT Summit IX; September 22-26, 2003.


SYSTRAN New Generation: The XML Translation Workflow


Abstract: Customization of Machine Translation (MT) is a prerequisite for corporations to adopt the technology. It is therefore important but nonetheless challenging. Ongoing implementation proves that XML is an excellent exchange device between MT modules that efficiently enables interaction between the user and the processes to reach highly granulated structure-based customization. Accomplished through an innovative approach called the SYSTRAN Translation Stylesheet, this method is coherent with the current evolution of the “authoring process”. As a natural progression, the next stage in the customization process is the integration of MT in a multilingual tool kit designed for the "authoring process".


Jean Senellart, Christian Boitet, Laurent Romary


MT Summit IX, September 22-26, 2003.


SYSTRAN Review Manager


Abstract: The SYSTRAN Review Manager (SRM) is one of the components that comprise the SYSTRAN Linguistics Platform (SLP), a comprehensive enterprise solution for managing MT customization and localization projects. The SRM is a productivity tool used for the review, quality assessment and maintenance of linguistic resources combined with a SYSTRAN solution. The SRM is used in-house by SYSTRAN’s development team and is also licensed to corporate customers as it addresses leading linguistic challenges, such as terminology and homographs, which makes it a key component of the QA process. Extremely flexible, the SRM adapts to localization and MT customization projects from small to large-scale. Its Web-based interface and multi-user architecture enable a centralized and efficient work environment for local and geographically disbursed individual users and teams. Users segment a given corpus to fluidly review and evaluate translations, as well as identify the typology of errors. Corpus metrics, terminology extraction and detailed reporting capabilities facilitate prioritizing tasks, resulting in immediate focus on those issues that significantly impact MT quality. Data and statistics are tracked throughout the customization process and are always available for regression tests and overall project management. This environment is highly conducive to increased productivity and efficient QA in the MT customization effort.


Jean-Cédric Costa, Christiane Panissod


MT Summit IX; September 22-26, 2003.


SYSTRAN Intuitive Coding Technology


Abstract: Customizing a general-purpose MT system is an effective way to improve machine translation quality for specific usages. Building a user-specific dictionary is the first and most important step in the customization process. An intuitive dictionary-coding tool was developed and is now utilized to allow the user to build user dictionaries easily and intelligently. SYSTRAN's innovative and proprietary IntuitiveCoding® technology is the engine powering this tool. It is comprised of various components: massive linguistic resources, a morphological analyzer, a statistical guesser, finite-state automaton, and a context-free grammar. Methodologically, IntuitiveCoding® is also a cross-application approach for high quality dictionary building in terminology import and exchange. This paper describes the various components and the issues involved in its implementation. An evaluation frame and utilization of the technology are also presented. Future plans for further advancing this technology forward are projected.


Jean Senellart, Jin Yang, Anabel Rebollo


MT Summit IX; September 22-26, 2003.


The SYSTRAN Linguistics Platform


Abstract: SYSTRAN's SLP (SYSTRAN Linguistics Platform) is a comprehensive enterprise solution for managing a full range of translation and localization project tasks. The SLP consists of the SYSTRAN machine translation (MT) technology, linguistic resources and tools for project management, corpus analysis and quality evaluation. The underlying platform that supports the SLP is the SYSTRAN WebServer, a client/server application that can be accessed transparently through most common software applications. It supports document formats including HTML, RTF, XML, and SGML. The SYSTRAN WebServer is hosted at the customer’s site and can be integrated with internal translation workflow systems. The SYSTRAN WebServer is a robust and high-volume platform that can support an unlimited number of users, and millions of translation jobs per day.


A software solution to manage multilingual corporate knowledge


October 2002.


SYSTRAN-Autodesk: Resource Alignment and Implicit Transfer


Abstract: In this article we present the concept of "implicit transfer" rules. We will show that they represent a valid compromise between huge direct transfer terminology lists and large sets of transfer rules, which are very complex to maintain. We present a concrete, real-life application of this concept in a customization project (TOLEDO project) concerning the automatic translation of Autodesk (ADSK) support pages. In this application, the alignment is moreover combined with a graph representation substituting linear dictionaries. We show how the concept could be extended to increase coverage of traditional translation dictionaries as well as to extract terminology from large existing multilingual corpora. We also introduce the concept of "alignment dictionary" which seems promising in its ability to extend the pragmatic limits of multilingual dictionary management.


Jean Senellart, Mirko Plitt, Christophe Bailly, Françoise Cardoso


MT Summit 8, September 18-22, 2001.


New Generation SYSTRAN Translation System


Abstract: In this paper, we present the design of the new generation Systran translation systems, currently utilized in the development of English-Hungarian, English-Polish, English-Arabic, French-Arabic, Hungarian-French and Polish-French language pairs. The new design, based on the traditional Systran machine translation expertise and the existing linguistic resources, addresses the following aspects: efficiency, modularity, declarativity, reusability, and maintainability. Technically, the new systems rely on intensive use of state-of-the-art finite automaton and formal grammar implementation. The finite automata provide the essential lookup facilities and the natural capacity of factorizing intuitive linguistic sets. Linguistically, we have introduced a full monolingual description of linguistic information and the concept of implicit transfer. Finally, we present some by-products that are directly derived from the new architecture: intuitive coding tools, spell checker and syntactic tagger.


Jean Senellart, Péter Dienes, Tamás Váradi


MT Summit 8, September 18-22, 2001.


Tous les champs sont requis

Nous respectons la confidentialité de vos informations et nous ne les utiliserons que dans le cadre de nos échanges.