XML and translation technologies

  By: Sofiane Madani

Every aspect of our life today is based on data, all our details are now saved in the form of computerized data which is transferred, displayed, parsed, stored, searched, indexed and translated. The most important technology used in the fields of translation and localization is no doubt XML. This technology is playing a very important role in the design of new tools aiding translation to be consistent and effective. The translation industry has been completely reshaped and depends entirely on these tools now.

XML is primarily used as a database for any type of data. XML has been taken advantage of as an original storage format for data presented to translation, it is also used as a temporary format for the translation or localization process. It is also an ideal format to interact with databases and prepare content for translation and localization. For this reason, a number of XML-derived vocabularies or standards have been created. The first standard we will present is XLIFF (XML Localization Interchange File Format), this is an intermediary format used to exchange data through translation and localization. In simpler terms, this standard supported by the majority of translation, localization and tools providers is a format used to store the extracted text that needs to be translated in an XML document. The text in this XLIFF document will be converted back to its original format at the end of the process.

Another important standard for translators is the TMX (Translation Memory Exchange) format. This format is used to exchange the translation memory databases by using XML files and a specific vocabulary to handle translation units. The goal of TMX is to enable translation memory users to exchange their resources through the various CAT tools existing on the market. The method is to develop a standard vocabulary in the XML format for storing translation units that all CAT tools could use and import into their proprietary format. It is important to point out that the way that the tools generate the segments makes the import-export process not that perfect. Generally, there are always a few translation units being rejected at the import process, because these tools do not use the same segmentation rules, even with the existence of a segmentation rules exchange standard.

Terminology exchange is also an important aspect for translation professionals, and for this reason, the TBX standard has been created. This solution, which is much less used than TMX, is very useful for interchange between tools. This standard aims to help machine translation systems exchange lexicons and CAT tools leverage resources through human-oriented glossaries.

For processing corpora, XCES (XML Corpus Encoding Standard) is the format used to encode texts and make corpora searchable and effective. This is done by the annotating and indexing of linguistic items automatically or manually, and this is essential for classifying tokens according to their part-of-speech.

It is important to note that the organisation that was leading all the activities regarding exchange standards LISA (Localisation Industry Standard Association) has been declared insolvent. The ETSI (European Telecommunication Standard Institute) has been designated as its successor. LISA’s website still offers standard documents to download.

