Archive for the ‘Machine Translation (MT)’ Category

Machine Translation Silicon Valley Meeting Group

2009/07/23

Ben Cornelius and Ray Flournoy asked whether there is interest in a regular meeting group in Silicon Valley for Machine Translation.

Ben and I host another meeting group for Globalization Management Systems which has been very beneficial for members. We have invited speakers. As a result the group has learned a lot about what is available on the market, what is forthcoming, and how other users think about purchasing, deploying, and integrating these tools, as well as their real-world experiences.

Should there be a meeting group for Machine Translation topics?
What do you think the goals of this group should be? Who should (or would) attend? (e.g. what are typical roles of attendees? MT developers, users, admins, etc.)

We discussed a number of our interests and possibilities. Without revealing our own personal preferences and agendas, I list a few potential topics below.

You can respond here, but I have set up a Yahoo Group that you can add yourself to if interested. If there is enough interest we can work on logistics and have at least an initial meeting or two.
The group is at:

groups.yahoo.com/group/MachineTranslationInterest.

  • Is anyone interested in hosting a meeting or two to get us started?
  • Is anyone interested in helping to organize a meeting schedule for the first few meetings? (I personally don’t have the time.)
  • Please also comment on how often you are interested and/or willing to meet. Monthly? Quarterly?


Here is a potential mission statement:

This group is for people in Silicon Valley to meet to discuss machine translation topics.
Members are interested in the design, application, integration, use, testing, selection, optimization, etc. of machine translation tools and to share real world experiences.

Meetings are webcast so anyone can participate. We encourage attending in person for networking and to have interactive and high quality discussions.

Possible discussion topics, led by members and invited speakers, vendors, etc.

  • Networking, Comeraderie
  • Learn about tools on the market
  • Selecting an MT tool, writing an RFP, testing, etc.
  • MT applications, integration
  • Optimizing processes using MT
  • Learn what others are doing with MT
  • Designing MT engines, Language issues, research topics
  • Vision for MT industry
  • Competitive solutions to MT, (crowd sourcing, etc)
  • MT data sources (TAUS, corpus, etc.)
  • Hosted MT solutions, MT cloud computing

Feel free to suggest others

Advertisements

Weighting Machine Translation Quality with Cost of Correction

2009/07/17

Machine translation (MT)) was one topic yesterday as I enjoyed lunch with Ben Cornelius and Ray Flournoy of Adobe Systems. Of course, a key criteria for evaluating these tools is the quality of the translated output. Most people would say it is the most important criteria. Speed, cost, and integration with other tools are also significant.

However, evaluation of any tool  should take the intended application into account. Some applications can use MT output directly. Many more require review and editing of the output before publication. Manual review and editing introduces labor costs and delay which can be significant.

Therefore we should look at the cost of operating the tool plus the cost of post-editing, when evaluating MT tools. Clearly, the optimum is to have 100% quality and no post-edits are required. But, this is not usually the case…

Not all editing tasks are the same. Some edits are easy to make and low cost to fix. Others are labor intensive. The entries that require editing may be obvious and therefore easy to find. Some may be subtle and require more intensive scrutiny to identify.

The post-editing needed by machine translation output will follow a pattern that varies with the MT engine (and its rules, or training, etc.). (Human authors also have a writing pattern requiring particular classes of edits.) Typographic and terminology substitution errors may be easy to address. Some grammar and style errors may be more costly. Consistency, flow and the relationship among sequences of sentences may be harder yet.

This suggests an interesting criteria for evaluating tools, the joint editing productivity and total operational cost of using the MT tool. An MT product that generates text needing edits that are both easy to find and to fix could be very low total cost. Another tool producing higher quality linguistic output, might still be less productive if post-editing is difficult.

A good metric for MT tools would be to assign a weight proportional to the cost of fixing a problem to each class of error. A document could have 100 typos and be much cheaper to ready for publication than a document with only a few consistency or contextual errors that required thought and consideration to address.

This metric would also help with process configuration. For example, if I have to produce both Mexican and Iberian Spanish translation, based on English source material, I have several options.

If “>-MT->” represents a machine translation step, and “>-PE->” represents a post-edit step:

Option Step 1 Step 2
A Simple MT, then PE en >-MT-> mx
en >-MT-> es
mx >-PE-> mx2
es >-PE-> es2
B mx to es en >-MT-> mx
mx >-MT-> es
mx >-PE-> mx2
es >-PE-> es2
C mx post-edit to es en >-MT-> mx
mx >-PE-> mx2
mx2 >-PE-> es
es >-PE-> es2
D es to mx en >-MT-> es
es >-MT-> mx
es >-PE-> es2
mx >-PE-> mx2
E es post-edit to mx en >-MT-> es
es >-PE-> es2
es2 >-PE-> mx
mx >-PE-> mx2

The scenario that is most effective is the one requiring the least editing. This may not correlate with unweighted measurements of each machine translator’s linguistic quality.

When I mentioned this, Ben recalled a demo by ProMT that the three of us attended recently. ProMT machine translation has a nice feature for managing placeholders used to represent program variables.

Here is an example sentence with two placeholders represented by an identifier in curly brackets.
“The file {0} contains {1} words.”

The filename and word count would be substituted at run-time for {0} and {1} respectively.

Many machine translation tools segment the text in between the placeholders rather than treating the placeholders as part of the syntax of the sentence. Therefore the placeholders are not properly addressed in the translated output. The problem is exacerbated by tools that convert markup tags to placeholders.

For example, according to Alex Yaneshevsky of ProMT, Idiom WorldServer converts:
<i>My name</i> <u> is </u> <b>Alex</b>
{1}My name{2} {3} is {4} {5}Alex{6}

The resulting translation gives:
{1}{2}Меня зовут Alex{6}{3}{4}{5}

Even if you don’t read Russian, you can see that “Alex” should retain placeholders as “{5}Alex{6}”.

Post-editors must remove the original placeholders from where they are positioned in the text and insert placeholders into the correct locations. This would be a significant cost consideration for either software or markup localization.

ProMT treats the placeholders as part of the sentence resulting in better placement in the output. This simplifies post editing and improves productivity.

(I am not commenting on ProMT translation quality. For this scenario their output significantly reduces post-editing cost.)

Ideally machine translation would deliver 100% quality. However, if the quality is less than 100%, then evaluating the combination of machine translation and post-editing effort is a more useful measure than selecting tools or configuring workflow based on just quality metrics. Higher quality might be irrelevant if it is more challenging for the human post editor to correct the text.