Leveraging SNOMED CT terms and relations for machine translation of clinical texts from Basque to Spanish

We present a method for machine translation of clinical texts without using bilingual clinical texts, leveraging the rich terminology and structure of the Systematized Nomenclature of Medicine – Clinical Terms (SNOMED CT), which is considered the most comprehensive, multilingual clinical health care terminology collection in the world. We evaluate our method for Basque to Spanish translation, comparing the performance with and without using clinical domain resources. As a method to leverage domain-specific knowledge, we incorporate to the training corpus lexical bilingual resources previously used for the automatic translation of SNOMED CT into Basque, as well as artificial sentences created making use of the relations specified in SNOMED CT. Furthermore, we use available Electronic Health Records in Spanish for backtranslation and copying. For assessing our proposal, we use Recurrent Neural Network and Transformer architectures, and we try diverse techniques for backtranslation, using not only Neural Machine Translation but also Rule-Based and Statistical Machine Translation systems. We observe large and consistent improvements ranging from 10 to 15 BLEU points, obtaining the best automatic evaluation results using Transformer for both general architecture and backtranslation systems.

Xabier Soto, Olatz Perez de Viñaspre, Maite Oronoz, Gorka Labaka

Ixako argitalpen alorra:

Artikuluaren erreferentzia: 

Proceedings of the Second Workshop on Multilingualism at the Intersection of Knowledge Bases and Machine Translation

Argitalpen mota:

Argitalpen mota fina (argitalpen_sailkapen_ohia):

Datu-base bibliografikoak:

HiTZeko jakintza arloa: