Recently, I've been talking at AMTA and GlobalSake about how our technology at ContentQuo has helped scale human quality evaluation of Machine Translation output, and a long-standing thought has crossed my mind again:
Machine Translation models are, from a certain perspective, very similar to translation vendors.
Here are some reasons:
Since they are so similar, why wouldn't we apply some of the well-established (Translation) Vendor Management methods, approaches, and processes also to our Machine Translation Management practice?
Here's one possible blueprint for this:
VM: All Vendor Managers know that before you can get any translation done on time, at the right cost, and within an acceptable timeframe, you must have a wide enough pool of vendors that you could reach out to at any time -- not just for the translations you need done today, but also for the work your team might need to get done in a month or two from now.
For commercially popular language pairs, you often have a huge choice of possible vendors to draw upon, and it's easy to find them online -- e.g. in job boards like ProZ, or marketplaces like Smartcat. For less popular languages, there might be fewer vendors available and they get harder and harder to source. Different vendors also have different specialization -- e.g. many choose to focus on specific domains/subject matters and/or certain types of services. For an enterprise, you need a strong mix of many those in order to satisfy all your global content production needs -- that means you, or your LSP, works with many vendors at a time.
MT: The landscape of Machine Translation has been evolving with lightning speed since Neural went to market in 2015. In 2022 there are over 40 distinct companies that offer Machine Translation models in more 100,000 (yes, one hundred thousand!) language pairs total. That's a lot of potential (Machine) translation vendors!
In traditional VM, you would rarely be able to source all the languages and/or content types from just 1 vendor without compromising on price, quality, or reliability. Same also applies to Machine Translation VM -- when you are at the earliest stage of your MT implementation, considering only the most well-known names such as Google and DeepL means that you leave 43 other potential vendors at the table without even exploring what benefits they could offer!
Insight: When just starting a new MT implementation, consider the entire range of MT engine providers on the market! The choice in 2022 is tremendous, and you are almost guaranteed to miss out on both quality and cost if you only look at Google.
VM: No sane Vendor Manager would make a decision to onboard -- or not onboard -- a new vendor without a resume check. However, equally so they won't make a decision ONLY after taking a look at their resume. After all, education is not a replacement for experience, and some of the best translators out there have never been educated as such.
The reverse is also true: just being a bilingual subject matter expert does not make you a great professional translator, no matter what your relatives or friends say. You have to figure out how well they actually work, what kind of quality they can deliver, and which subject areas or content types they could be your #1 vendor for (as opposed to #17).
Commonly, this is done via test translations (paid or free), which are then carefully and objectively evaluated by senior linguists before a "yay" or "nay" decision is taken by the Vendor Manager (typically while also considering other factors such as rates, communication, etc.). Savvy VM teams apply advanced quality evaluation protocols such as Error Annotation (e.g. with an MQM-based error typology) for those tests, while in less mature teams and/or lower-stake situations, more informal evaluation methods (e.g. holistic evaluation of the entire test) might be used.
MT: After sourcing enough candidate Machine Translation providers, teams typically face the challenge of screening & testing them in a smart & efficient way in order to shortlist the best performing engines for the next stage. Since there is no CV for MT engine providers (yet), your best bet to speed up screening is looking at publicly available reports such as Intento's State of MT -- they spend hundreds of thousands of $$$ every year to do high-level quality evaluations on most popular stock (generic) MT models in many language pairs and domains using sample content. Looking at this kind of report, you can easily shortlist the most promising 10 out of 40 engine vendors based on your language pairs and subject matter alone.
However, none of such reports will ever apply to YOUR organization's content -- your marketing content is (hopefully) NOT the same as another company's marketing content, and your support content might not be that similar to their support content! So, just as any sane Vendor Manager, you should never skip the Testing step no matter how good the MT engines CV's are. Thankfully, there are highly cost-effective ways to run quality tests on many pieces of MT output from different engines -- automatic metrics are frequently used at this stage, while lightweight human evaluation methods such as Adequacy-Fluency (rating scale method) can complement them well to give you more assurance on the right direction without breaking the bank.
Insight: Take time to properly screen AND quality-test all the potential MT vendors just like VMs do -- don't blindly trust their reputation or generic market analysis reports! If you don't have enough in-house expertise, engage an MT implementation company and/or specialised Linguistic Quality Evaluation tools to help you.
VM: After screening and testing has been succesfully passed by a candidate vendor, the most crucial part begins: You need to help your new partner start performing effectively with your translations. Practice shows that even the most senior translators need some training when starting to work on a new company's content or in a somewhat different domain.
Smart VMs invest in their vendors especially heavily during this stage by making sure those receive clear, detailed, objective, and actionable feedback on their translations frequently, and also have a chance to talk back & object if, for example, a reviewer oversteps their authority. Only at the end of this stage, after gathering enough well-structured data, can a VM reliably conclude that a vendor is a great fit for her organisation & deserves to remain in her vendor pool.
MT: Similarly to traditional vendors, MT vendors can significantly increase the quality of their output after proper training. While the methods of training machine translators do (somewhat) differ from how we train human translators, one thing remain constant: the need to meticulously evaluate how the quality of the machine changes after training. Since with Neural MT, it's so hard to predict what will happen after putting in more training data, or cleaning the data, or applying terminology, there is only one way to find out: roll up the sleeves and assess the output.
At this stage, you typically also want to use a combination of automatic metrics (these are fast & free) and human evaluation (more detailed and objective ones like Error Annotation -- either to specifically inspect post-training differences in MT output, or to run full blind quality tests, or to combined approaches with post-editing lab tests and annotations together). Only at the end of this stage, can an MT manager arrive at a balanced & objective decision on which mix of MT vendors deserves to be onboarded and deployed for actual use.
Insight: Training on your in-domain bilingual data has great impact on the quality of MT, so make sure to train your screened and tested MT engines to the best of your ability! Only then perform final, detailed assessments on the quality of the output to make the best possible decision for your MT vendor pool.
VM: Junior vendor managers tend to think that after onboarding & initial training, their job is done -- let's just move on to the next one. Real VM experts, on the other hand, know that onboarding is only the beginning of the long buyer-vendor partnership journey! Without a continuous, solid process for regularly (and ideally, randomly) inspecting how your well-onboarded and well-trained suppliers are ACTUALLY performing in real-life translation assignments, it's so very easy to get complacent and run your team (or perhaps even your entire company) into big trouble because you all got enamored with a new vendor that looked brilliant after the onboarding stage but then underperformed spectacularly under pressure without proper supervision.
"Trust but verify" is the best motto for this: after all, to change is human, and not all change is for the better. Many circumstances impact translator vendor performance, so all mature translation organizations have perfected this process of regular quality assessments and structured feedback. Not only does it help you keep quality risk at bay -- it also helps you retain your best vendors if you do it well (after all, not getting enough useful & actionable feedback on your work is often thought of as one of the most frustrating parts of the translation profession), but also regularly prune your vendor pool by getting rid of chronic underperformers.
MT: Same as with human vendors, machine translation vendors change -- probably even more often that human ones do! Even if you do not retrain your own engines (which you should absolutely do regularly, unless you're making advanced in the emerging "Adaptive MT" technologies), or even if you use stock engines, MT vendors constantly work to acquire more data and train their baseline models! They also roll out algorithm updates that could in severe cases break their engine's output and wreak havoc on YOUR translated content (even if all other content is seemingly OK).
Same as with human vendors, the old "Trust but verify" adage applies but with 10x the importance: Because of the sheer volume that your MT engines process on a regular basis, a tiny change in the MT engine output has the potential to cause catastrophic consequences for your company. In this case, regular quality assessment act as insurance -- while it is of course unable to prevent those mistakes from happening, at least you have the control over any possible changes, can determine how major or minor the quality differences are, and take proper action to either capitalize on new capabilities (if quality improves) or roll back your training, or even switch to a backup MT vendor (if quality drops sharply)!
Insight: Without doing regular quality assessments of your MT engine output, you risk to miss situations where engine quality either degrades sharply (and you have to fix it ASAP), or increases noticeably (and you can apply it in more situations and for more content types). Be proactive -- don't wait for bad feedback from users or customers!
Any other approaches that teams implementing Machine Translation could learn from Vendor Managers?