Weighting Machine Translation Quality with Cost of Correction

Machine translation (MT)) was one topic yesterday as I enjoyed lunch with Ben Cornelius and Ray Flournoy of Adobe Systems. Of course, a key criteria for evaluating these tools is the quality of the translated output. Most people would say it is the most important criteria. Speed, cost, and integration with other tools are also significant.

However, evaluation of any tool  should take the intended application into account. Some applications can use MT output directly. Many more require review and editing of the output before publication. Manual review and editing introduces labor costs and delay which can be significant.

Therefore we should look at the cost of operating the tool plus the cost of post-editing, when evaluating MT tools. Clearly, the optimum is to have 100% quality and no post-edits are required. But, this is not usually the case…

Not all editing tasks are the same. Some edits are easy to make and low cost to fix. Others are labor intensive. The entries that require editing may be obvious and therefore easy to find. Some may be subtle and require more intensive scrutiny to identify.

The post-editing needed by machine translation output will follow a pattern that varies with the MT engine (and its rules, or training, etc.). (Human authors also have a writing pattern requiring particular classes of edits.) Typographic and terminology substitution errors may be easy to address. Some grammar and style errors may be more costly. Consistency, flow and the relationship among sequences of sentences may be harder yet.

This suggests an interesting criteria for evaluating tools, the joint editing productivity and total operational cost of using the MT tool. An MT product that generates text needing edits that are both easy to find and to fix could be very low total cost. Another tool producing higher quality linguistic output, might still be less productive if post-editing is difficult.

A good metric for MT tools would be to assign a weight proportional to the cost of fixing a problem to each class of error. A document could have 100 typos and be much cheaper to ready for publication than a document with only a few consistency or contextual errors that required thought and consideration to address.

This metric would also help with process configuration. For example, if I have to produce both Mexican and Iberian Spanish translation, based on English source material, I have several options.

If “>-MT->” represents a machine translation step, and “>-PE->” represents a post-edit step:

Option Step 1 Step 2
A Simple MT, then PE en >-MT-> mx
en >-MT-> es
mx >-PE-> mx2
es >-PE-> es2
B mx to es en >-MT-> mx
mx >-MT-> es
mx >-PE-> mx2
es >-PE-> es2
C mx post-edit to es en >-MT-> mx
mx >-PE-> mx2
mx2 >-PE-> es
es >-PE-> es2
D es to mx en >-MT-> es
es >-MT-> mx
es >-PE-> es2
mx >-PE-> mx2
E es post-edit to mx en >-MT-> es
es >-PE-> es2
es2 >-PE-> mx
mx >-PE-> mx2

The scenario that is most effective is the one requiring the least editing. This may not correlate with unweighted measurements of each machine translator’s linguistic quality.

When I mentioned this, Ben recalled a demo by ProMT that the three of us attended recently. ProMT machine translation has a nice feature for managing placeholders used to represent program variables.

Here is an example sentence with two placeholders represented by an identifier in curly brackets.
“The file {0} contains {1} words.”

The filename and word count would be substituted at run-time for {0} and {1} respectively.

Many machine translation tools segment the text in between the placeholders rather than treating the placeholders as part of the syntax of the sentence. Therefore the placeholders are not properly addressed in the translated output. The problem is exacerbated by tools that convert markup tags to placeholders.

For example, according to Alex Yaneshevsky of ProMT, Idiom WorldServer converts:
<i>My name</i> <u> is </u> <b>Alex</b>
{1}My name{2} {3} is {4} {5}Alex{6}

The resulting translation gives:
{1}{2}Меня зовут Alex{6}{3}{4}{5}

Even if you don’t read Russian, you can see that “Alex” should retain placeholders as “{5}Alex{6}”.

Post-editors must remove the original placeholders from where they are positioned in the text and insert placeholders into the correct locations. This would be a significant cost consideration for either software or markup localization.

ProMT treats the placeholders as part of the sentence resulting in better placement in the output. This simplifies post editing and improves productivity.

(I am not commenting on ProMT translation quality. For this scenario their output significantly reduces post-editing cost.)

Ideally machine translation would deliver 100% quality. However, if the quality is less than 100%, then evaluating the combination of machine translation and post-editing effort is a more useful measure than selecting tools or configuring workflow based on just quality metrics. Higher quality might be irrelevant if it is more challenging for the human post editor to correct the text.

Advertisements

Tags: , , , , ,

12 Responses to “Weighting Machine Translation Quality with Cost of Correction”

  1. Kirti Says:

    Asia Online has attempted to address this issue with a cost calculator that allows users to specify the costs associated with correcting different kinds of errors. For example the errors at the top of this table require much more effort to correct than the ones at the bottom of the table.:

    Segment Classification / Definition
    No Sense The text is not understandable to any degree. It must be completely re-translated.
    Mistranslations The text is understandable, but translated poorly. It needs to be completely re-translated.
    Grammar/Syntax The grammar is incorrect or the relation of words is their reference to other words, or their dependence according to the sense is in correct and needs to be adjusted.
    Literal Translation The translation is a literal translation.
    Terminology The terminology is incorrect. The word is translated to a different meaning for the same word.
    Unknown Words One or more words were unknown to the translation engine.
    Addition The text has additional words that need to be deleted.
    Omission The text is missing words that need to be inserted.
    Spelling The text has spelling errors.
    Punctuation The punctuation is incorrect.
    Capitalization The capitalization of some words is incorrect.
    Correct Translation Sentence is correct and does not need any editing. Ready to publish as it is.

    You can use the calculator at http://www.asiaonline.net/TranslationCostCalculator.aspx

    This will allow a user to get some sense for what if any value can be gained from using an MT system. This can be used with any MT output sample.

  2. i18nguy Says:

    Kirti, thanks for this. I added the bolding to make the list more readable, I hope that is ok for you. Readers should follow the link to learn more.

    The tool looks like a good start, although it is still primitive.
    Adding the placeholders problem to the list would be good.

    Also recognizing that some problems are not fixed by translators or post-editors would also be good. I have in mind that terminology and unknown words might be fixed by updating glossaries, adding to the MT corpus, etc.

    Have you been tracking the results of these calculations? How do the MT tools compare when ranked against a post-edit weighted list?

  3. Kirti Says:

    There could be several ways to handle the variables you illustrate above.

    In pre-or post processing around the actual translation engine is the way we think would work best. If variables are setup in a systematic way this should not be too difficult to handle through a process workflow design.

    We have just introduced the calculator and are starting to publicize it. Our intention is to hopefully share the kind of inputs that we are seeing on average so that the defaults get more meaningful and as we get more feedback we will attempt to add new features to the calculator to make it more useful to users.

    I am not sure what a you mean by post-edit weighted list.
    I think the engines that have the most customization effort will tend to do better on these calculations as we all know that using baseline free MT engines is not so meaningful as in most cases the cleanup effort is substantial.

    Asia Online suggests that users focus on developing the engine and raising the quality of the output significantly before putting it into production. Also, as these engines continue to evolve in quality, measurements are really only snapshots of the system at one point in time.

  4. Ray Flournoy Says:

    The concept of editing costs will more and more come into play as MT becomes more viable in the localization world.

    One of the LSPs that we have talked with about post-editing put forward a vague classification of post-editing quality, with rates to go along with the quality tiers. Post-editing to a human-quality level was most expensive, and they also offered a cheaper alternative that was billed as only fixing the “most serious errors.”

    There was no explanation of what is classified as a serious error or not, and the rates for that reduced quality level were so bargain-basement that I wondered if they were correcting anything at all. But all of the LSPs seem to be making up pricing as they go along so I can’t blame them too much.

  5. Alex Yanishevsky Says:

    As far as popuralization and commercial acceptance, MT is following the trajectory of its cousin, TM. Early TM systems were primarily concerned with preserving and reusing segments. As they became more mature, users expected higher sophistication like proper handling of tags and error-free conversions to lesser used file formats. MT software providers are now being held up to the same expectations as MT goes more “mainstream” and proves itself a viable part of the globalization content chain. Consequently, MT engines not only need to be able to handle tags, but also to “turn on a dime” and handle tags in various semantic settings.

    Example 1 – with and without placeholder support in Idiom

    Source
    Check NativeApplication.supportsSystemTrayIcon to determine whether system tray icons are supported on the current system.

    Converted by Idiom as

    Check {1}NativeApplication.supportsSystemTrayIcon{2} to determine whether {3}system tray icons{4} are supported on the {5}current system{6}.

    will be sent to MT as

    {1}Check NativeApplication.supportsSystemTrayIcon to determine whether system tray icons are supported on the current system.{2}{3}{4}{5}{6}

    Without placeholders (plain text)
    {1}Проверьте NativeApplication.supportsSystemTrayIcon, чтобы задать, поддержаны ли значки подноса системы на существующей системе. {2}{3}{4}{5}{6}

    Naturally, this will require a good amount of post-editing.

    With placeholders supported
    {1}Проверьте NativeApplication.supportsSystemTrayIcon,{2} чтобы задать, поддержаны ли {3}значки подноса системы{4} на {5}существующей системе{6}.

    Example 2 – with placeholders in Idiom and recognizing the tag as an adjective
    Source
    Your {1}AccountDetails/CountryName{2}
    Account Is Confirmed

    With placeholders supported
    Votre compte {1}AccountDetails/CountryName{2} est confirmée.

  6. Ben Cornelius Says:

    This continues to be an interesting conversation Tex.

    I think what has been missing is not around categorization of linguistic quality. Technical quality and operating overhead may have as much or more impact and effort cost than issues with linguistic quality. To echo Tex, “The scenario that is most effective is the one requiring the least editing. This may not correlate with unweighted measurements of each machine translator’s linguistic quality.”

    For many of us, MT is one aspect of the overall process, or we have multiple scenarios in which we may use MT. Such as within the context of a human translation process, a publishing or build process, or simply providing unedited results to the user similar to MS Word’s translate feature or any number of transportals.

    In some of these cases, linguistic quality is not the primary consideration. As Ray states, the LSPs seem to recognize there is a spectrum and (as always) provide a spectrum of pricing and services to address different goals defined by the market opportunities they perceive.

    Thanks Kirti for making the effort to help the industry with the cost calculator. As a tool for measuring the impact of linguistic quality, it is clearly useful. As a measure of total ROI, integration with existing models or a generic MTM model, we also need to consider interoperability with some of the standard elements of a TM system and/or the experience and productivity of the linguists involved. In my heavily tagged world, a “plain text process” needs to be supplemented by much more – and results viewed in waters that are still a bit muddy.

    Consider a situation where all segments contain placeholders requiring post editing intervention and repeated segments cannot be edited once at the parent. 1000 segments to post edit can be less productive and cost efficient than 400 fuzzy matches to edit.

    Interoperability with terminology, metadata, placeholders, repetition information, etc., and gaining a measure of MT confidence from the engine that can be evaluated vs. a fuzzy match (and displayable to the linguist) all factor significantly into such an environment. Without these considerations, a free engine with great linguistic quality can surprise us with increased costs and timelines that show human translation could have been more effective.

  7. i18nguy Says:

    Ben, Kirti,

    Perhaps we should also be looking for a specialized post-editing tool that is very efficient at correcting the problems left behind by an MT-generated output.

    An editor that would manipulate placeholders easily, perhaps automatically updating some of the placeholder numbering for you (e.g. if you change one placeholder to be #2, it knows to make the other one #1) would help.

    An editor that would let you highlight a phrase and let you choose among alternative gender, tense, endings, etc. would be nice to have.
    Perhaps not that hard to implement either.

    Is anyone making post-editing tools like this?

  8. Ray Flournoy Says:

    I know that Spoken Translation (http://www.spokentranslation.com/) was looking at methods for disambiguating and correcting text for translation, but the examples I’ve seen with them mostly dealt with lexical choice. I know that they had more ambitious goals, but I don’t know how far their tools have progressed.

    There is a cute demo on their site showing their translation system, although only a small bit of the demo is about disambiguation and correction.

    Yahoo! Japan licenses its MT from Cross Language (not the Belgian company — a different Cross Language), and it enables you to highlight words in the text and see which tokens on the other side correspond to the highlighted text. It’s only useful if you have minimal knowlege of the second language, but this sort of information (aligning words and phrases between source and target strings) would be a first step for allowing a user to trace the origin of translation errors and correct on the source side.

    You can test that here: http://honyaku.yahoo.co.jp/.

  9. » Blog Archive » More news from the industry Says:

    […] we strongly recommend this blog post on weighting machine translation quality with cost of […]

  10. Japanese Translator Says:

    Well, translation machines do pretty good work for languages that look similar, like latin languages between them for instance, and in these cases a post-edition by a proofreader might prove sufficient. But when it gets to more specific ones, say Arabic or Japanese, you should definitely have a professional translator do the jobs and save you a lot of trouble.

  11. i18nGuy Says:

    It is true that automating translation between languages that are not as strongly related, with completely different heritages, is much more difficult. As a general guideline, we can agree that the more distinct the grammars, the more likely a professional translator would be required.

    However, I do not agree that a professional is necessarily required. I would look first at the purpose (application) and audience for the output, the quality of the machine, and the quality of the input content.

    Some companies have focused intensively on specific language pairs (such as English-Japanese) and do very well with those.

  12. japanesetranslationservices Says:

    I really enjoyed this post. It proved to be Very helpful to me and I am sure to all the commentators here! Keep writing. Thanks…
    Japanese Translation

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: