{"id":461,"date":"2021-09-08T15:06:04","date_gmt":"2021-09-08T15:06:04","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2021\/09\/08\/building-a-language-agnostic-neural-machine-translation-system\/"},"modified":"2021-09-08T15:06:04","modified_gmt":"2021-09-08T15:06:04","slug":"building-a-language-agnostic-neural-machine-translation-system","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2021\/09\/08\/building-a-language-agnostic-neural-machine-translation-system\/","title":{"rendered":"Building a Language-Agnostic Neural Machine Translation System"},"content":{"rendered":"<h3>Why Machine Translation<\/h3>\n<p>At Salesforce, our goal in introducing machine translation was to increase scalability and better serve our international customers. Advantages include:<\/p>\n<p>Innovating and acquiring know-how internallyReducing translation time by enhancing translators\u2019 productivityIncreasing content freshness by publishing more frequent\u00a0updatesReinvesting savings into high-value content and\u00a0products<\/p>\n<p>When we explored commercially available solutions four years ago, we quickly realized that they didn\u2019t offer any data privacy and customizability to support XML tags or terminology, and the specificity of our technical domain was challenging. Driven by <a href=\"https:\/\/www.logos.t.u-tokyo.ac.jp\/~hassy\/\">Kazuma Hashimoto<\/a>, Lead Research Scientist at Salesforce Research and <a href=\"https:\/\/www.linkedin.com\/in\/raffaellabuschiazzo\/\">Raffaella Buschiazzo<\/a>, Director, Localization at R&amp;D Localization, the team built a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Neural_machine_translation\">Neural Machine Translation (NMT)<\/a> system on the Salesforce domain with a language-agnostic architecture with models for each language. The system processes whole XML files from English into 16 languages: French, German, Japanese, Spanish, Mexican Spanish, Brazilian Portuguese, Italian, Korean, Russian, Simplified Chinese, Traditional Chinese, Swedish, Danish, Finnish, Norwegian, and\u00a0Dutch.<\/p>\n<p>Our primary target for machine translation (MT) was the <a href=\"https:\/\/help.salesforce.com\/s\/\">Salesforce online help<\/a> content, which had been localized by professional translators for 20 years. Salesforce international customers use our online help to resolve issues. If they can\u2019t do it through the localized help because translation quality is poor, they will escalate to in-country tech support. For this reason, high quality machine translation is expected but is hard to achieve because the new release content comprises new feature\/product terminology not included in our <a href=\"https:\/\/en.wikipedia.org\/wiki\/Translation_memory\">translation memories<\/a> and general training\u00a0corpora.<\/p>\n<h3>Challenges<\/h3>\n<p>Challenges that we faced while building our system\u00a0include:<\/p>\n<p><strong>Tag handling<\/strong>. We built our end-to-end system based on our publicly available dataset and we filed a patent (<a href=\"https:\/\/patents.google.com\/patent\/US10963652B2\/\">US10963652B2\u200a\u2014\u200aStructured text translation<\/a>).<strong>Getting the \u201cright\u201d mix<\/strong> of customized training datasets and more general ones. We benchmarked our MT output against commercially available systems and observed that, while our model scored higher for sentences with content specific to Salesforce, the others provided better results for more generic content. That\u2019s how we decided to switch from using our Salesforce-only data to fine-tuning publicly pre-trained models: mBART, XLM-R,\u00a0etc.<strong>Quality tracking<\/strong>. When we started, we used BiLingual Evaluation Understudy (BLEU) scores and conducted human evaluations at each iteration. Translators evaluated 500 machine translated strings using 1\/2\/3 categorization (1\u200a\u2014\u200aTranslation is ready for publication; 2\u200a\u2014\u200aTranslation is useful but needs human post-editing; 3\u200a\u2014\u200aTranslation is useless), plus offered their overall feedback after post-editing 100K new words + 300K edits per major release. But this was not giving us enough data to understand how we were doing. That\u2019s why in 2020 we started calculating the edit distance on every post-edited segment by using an algorithm respecting the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Damerau%E2%80%93Levenshtein_distance\">Damerau\u2013Levenshtein edit distance<\/a>. It counts the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two adjacent characters. The average of all machine translated segments that didn\u2019t need any edits across languages over five releases is 36.03% which is pretty good considering that we have languages that are still challenging for MT, such as Finnish and Japanese.<strong>MT API<\/strong>. When our model started producing good quality output, we knew that it was time to integrate the MT API to our localization pipeline. This required security implementation, testing, and computer capacity.<\/p>\n<h3>Technical Overview<\/h3>\n<p>Our goal was to translate rich-formatted (XML-tagged) text, so we investigated how neural machine translation models could handle such tagged\u00a0text.<\/p>\n<p>Here are two example pages from a PDF output from our English-to-Japanese system. We can see that the original styles are preserved because our MT model can handle XML-tagged text.<\/p>\n<p>To build our system, we first extracted training data from the Salesforce online help, and we released the <a href=\"https:\/\/github.com\/salesforce\/localization-xml-mt\">dataset<\/a> for research purposes. Basically, our model is based on a transformer encoder-decoder model, where the input is XML-tagged text in English, and the output is XML-tagged text in another language.<\/p>\n<p>To explicitly handle XML-specific tokens, we use an XML-tag-aware tokenizer and train a seq2seq model. Our model also has a copy mechanism to allow it to directly copy the special tokens, named entities, product names, URLs, email address, etc. This copy mechanism will later be used to align positions of the XML tags between two languages.<\/p>\n<p>For the entire system design, there are two phases: training and translation. At training time, we use the training data from the latest release, and we also incorporate release notes for the target release. Release notes describe new products or features in the particular release, so we can let the model learn how to translate the new terminology in the new context.<br \/>This is similar to providing a new grocery list, but release notes can be seen as a set of contextualized groceries. For the translation phase, we first extract new English sentences that have little overlap with our translation memory, remove metadata from XML tags, and run our trained model. This is because, if we can find a similar string in the translation memory, we do not need to run MT, but instead human translators can directly edit it. In the end, we recover the metadata for each XML tag by using our copy mechanism for the tag position alignment. The translated sentences are verified and post-edited, if necessary, by our professional translators, and then finally published.<\/p>\n<p>Here is an example of how our system\u2019s pipeline\u00a0works.<\/p>\n<p><strong>(1) Overview<\/strong><\/p>\n<p>An English sentence with markup tags is fed into our system, and then another sentence in a target language (here Japanese) is output. Those tags are represented as placeholders with\u00a0IDs.<\/p>\n<p><strong>(2) Input preprocessing<\/strong><\/p>\n<p>Our system first preprocesses the input sentence by replacing the placeholder tags with their corresponding tags (i.e., &lt;ph&gt;\u2026&lt;\/ph&gt; tags). For the sake of simplicity, we do not include any metadata or attributes inside these tags, which makes it easier for our MT model to perform translation with the\u00a0tags.<\/p>\n<p><strong>(3) Translation by our\u00a0model<\/strong><\/p>\n<p>We then run our MT model to translate the English sentence with the simplified tags. Here we can see a Japanese sentence (with the tags) output by our\u00a0model.<\/p>\n<p><strong>(4) Tag alignment<\/strong><\/p>\n<p>To convert the translation result to the original format with the placeholder tags, we need to know which source-side (i.e., English-side) tags correspond to which target-side (i.e., Japanese-side) tags. We leverage the copy mechanism in our model to find a one-to-one alignment between the source-side and target-side tags. The detail of the copy mechanism is described in <a href=\"https:\/\/arxiv.org\/abs\/2006.13425\">our paper<\/a>. The alignment matrix allows us to find the best alignment as shown in the figure. In this example, tags with the same color correspond to each\u00a0other.<\/p>\n<p><strong>(5) Output postprocessing<\/strong><\/p>\n<p>In the end, we can replace the simplified tags with the original placeholders by following the tag alignment result. As a result, we can see that the translated sentence is output in the same\u00a0format.<\/p>\n<p>In 2020 we succeeded in implementing Salesforce NMT as a standard localization process for our online help. Now 100% of our help is machine translated and human post-edited by our professional translators for all 16 languages we support. We developed a plugin to track MT quality systematically, trained our translators on MT post-editing best practices, and reduced training time for the MT models from 1 day to 2\u20133 hours per language. This training time reduction has been done with <a href=\"https:\/\/arxiv.org\/abs\/1612.00796\">a continual learning method<\/a>. The idea is that, instead of training the models at every release, we quickly fine-tune our previous models only with new additional data (like release notes and new content added in the latest release).<\/p>\n<h3>Future Applications<\/h3>\n<p>We envision extending MT to <a href=\"https:\/\/help.salesforce.com\/articleView?id=sf.faq_getstart_what_languages_does.htm&amp;type=5\">34 languages<\/a>, to machine translate UI software strings and other content (i.e. knowledge articles, developer\u2019s guides, and so on) in the near future. Having an in-house MT system also means that we could potentially use it for customer-facing products such as Salesforce case feed, Experience Cloud, Slack apps, etc. And maybe one day we could make Salesforce MT API available to our customers as an out-of-the-box or trainable product. There is so much content out there that people need to understand but not enough time and resources to human translate it all! That\u2019s where Salesforce NMT system can be the protagonist.<\/p>\n<h3>Acknowledgements<\/h3>\n<p>This program wouldn\u2019t have been possible without the leadership and support over the years by Caiming Xiong, Managing Director of Salesforce Research, Yingbo Zhou, Director of Salesforce Research, and Teresa Marshall, VP, Globalization and Localization.<\/p>\n<h3>Additional Resources<\/h3>\n<p><a href=\"https:\/\/aclanthology.org\/2020.amta-user.20\/\">Our AMTA 2020 presentation \u201cBuilding Salesforce Neural Machine Translation System\u201d<\/a><a href=\"https:\/\/arxiv.org\/abs\/2006.13425\">Our WMT 2019 paper \u201cA High-Quality Multilingual Dataset for Structured Documentation Translation\u201d<\/a><\/p>\n<p><a href=\"https:\/\/engineering.salesforce.com\/building-a-language-agnostic-neural-machine-translation-system-22ea7db97edc\">Building a Language-Agnostic Neural Machine Translation System<\/a> was originally published in <a href=\"https:\/\/engineering.salesforce.com\/\">Salesforce Engineering<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<p><a href=\"https:\/\/engineering.salesforce.com\/building-a-language-agnostic-neural-machine-translation-system-22ea7db97edc?source=rss----cfe1120185d3---4\">Read More<\/a><\/p>","protected":false},"excerpt":{"rendered":"<p>Why Machine Translation At Salesforce, our goal in introducing machine translation was to increase scalability and better serve our international customers. Advantages include: Innovating and acquiring know-how internallyReducing translation time by enhancing translators\u2019 productivityIncreasing content freshness by publishing more frequent\u00a0updatesReinvesting savings into high-value content and\u00a0products When we explored commercially available solutions four years ago, we&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2021\/09\/08\/building-a-language-agnostic-neural-machine-translation-system\/\">Continue reading <span class=\"screen-reader-text\">Building a Language-Agnostic Neural Machine Translation System<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-461","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":575,"url":"https:\/\/fde.cat\/index.php\/2022\/05\/09\/language-packs-metas-mobile-localization-solution\/","url_meta":{"origin":461,"position":0},"title":"Language packs: Meta\u2019s mobile localization solution","date":"May 9, 2022","format":false,"excerpt":"More than 3 billion people around the world rely on our services each month. On mobile, around 57 percent of people on Facebook for Android and 49 percent of those on Facebook for iOS use the app in a language other than English. Delivering the best experience for these people,\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":700,"url":"https:\/\/fde.cat\/index.php\/2023\/04\/11\/3-ways-salesforce-takes-ai-research-to-the-next-level\/","url_meta":{"origin":461,"position":1},"title":"3 Ways Salesforce Takes AI Research to the Next Level","date":"April 11, 2023","format":false,"excerpt":"In our \u201cEngineering Energizers\u201d Q&A series, we examine the life experiences and career paths that have shaped Salesforce engineering leaders. Meet Shelby Heinecke, a research manager for the Salesforce AI team. Shelby leads her diverse team on a variety of projects, ranging from identity resolution to recommendation systems to conversational\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":229,"url":"https:\/\/fde.cat\/index.php\/2021\/02\/02\/ml-lake-building-salesforces-data-platform-for-machine-learning\/","url_meta":{"origin":461,"position":2},"title":"ML Lake: Building Salesforce\u2019s Data Platform for Machine Learning","date":"February 2, 2021","format":false,"excerpt":"Salesforce uses machine learning to improve every aspect of its product suite. With the help of Salesforce Einstein, companies are improving productivity and accelerating key decision-making. Data is a critical component of all machine learning applications and Salesforce is no exception. In this post I will share some unique challenges\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":317,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/a-deep-dive-on-text-classification-at-salesforce\/","url_meta":{"origin":461,"position":3},"title":"A Deep Dive on Text Classification at Salesforce","date":"August 31, 2021","format":false,"excerpt":"published on Towards Data\u00a0SciencePutting from a Sand Trap (Image by\u00a0Author)We\u2019re excited to announce that Noah Burbank, a Principal Data Scientist in Sales Cloud, has recently published a deep dive into text classification at Salesforce on Towards Data Science. The article, How to choose the right model for text classification in\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":719,"url":"https:\/\/fde.cat\/index.php\/2023\/05\/23\/automation-at-scale-migrating-200000-machines-from-centos-7-to-rhel-9\/","url_meta":{"origin":461,"position":4},"title":"Automation at Scale: Migrating 200,000 Machines from CentOS 7 to RHEL 9","date":"May 23, 2023","format":false,"excerpt":"When a legacy operating system (OS) approaches its end-of-support date, some organizations will upgrade their OS as fast as possible. Others may kick the can down the road, delaying any headaches they might encounter during the upgrade process. Six years ago, Salesforce Engineering put the pedal to the metal, migrating\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":705,"url":"https:\/\/fde.cat\/index.php\/2023\/04\/18\/ai-based-identity-resolution-the-key-for-linking-diverse-customer-data\/","url_meta":{"origin":461,"position":5},"title":"AI-based Identity Resolution: The Key for Linking Diverse Customer Data","date":"April 18, 2023","format":false,"excerpt":"Companies want a comprehensive view of their customers, enabling them to solve business and marketing challenges, such as personalization, segmentation, and targeting \u2014 but they face an uphill battle as they are drowning in data. For example, many companies cannot match the identity of a customer who visits their website\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/461","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=461"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/461\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=461"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=461"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=461"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}