Today’s blog post will try to provide some insight into building a custom MT environment with Systran Training Server 7. Are you ready? Here goes:
Background: French company Systran was the first MT vendor in the world and has been around since the 1960s. They were the biggest and best for a long time until the big boys decided to jump into the arena. Yahoo Babelfish is based on Systran technology, Google used it for a while; Systran was bundled into Mac OS/X PCs for years. Using outdated rules-based technology, Systran lost their premacy when companies such as Google and Microsoft started to give away statistical MT for free.
Hybrid MT Training: With all of this new competition, Systran was in a bind. Their answer to this was the introduction of Enterprise Server 7, a hybrid RbMT and SMT system. Without getting too technical (which I can’t anyway since I am not that knowledgable about the internal workings of the system), you start with the Systran baseline which is pretty good anyway. You then enhance the quality of the MT engine by training it statistically with your own text corpuses. You can also enhance the MT engine by adding new rules and dictionary entries (see a blog post I wrote last year about this). The result is a Translation Model which is applied to the MT engine.
System Description: Here are some system highlights, accompanied with some selected screenshots:
This is the module that enables you to upload and maintain your training corpuses. It allows you to create virtual files by joining several TMs. It can also automatically split your corpus into 3 parts: a training corpus, a testing corpus and a tuning corpus. It supports TMX, TXT, XML, LIFF and some other file formats. We found that the TMX import works well; you may need to experiment with the various filters at the beginning.
The Training Manager
Baseline evaluation. Before you train your MT, you can do a baseline evaluation to see how good your training corpus is. After the baseline evaluation is done, the system displays scores using various scoring methods (including the BLEU score). In our experience, running a baseline training takes nearly as long as the training process itself. So it may be better to go straight to the training process and evaluate the BLEU score (and the MT itself).
Hybrid Training: this is the process that trains the baseline RbMT system with your training corpus.
Statistical Training: this is a pure SMT training process which lets you build new language pairs even when there is no Systran RbMT baseline. Think of it as Moses with a high-level GUI. I can’t comment on the value of this training mode as we did not use it.
Resource Extraction: Systran Training Manager can create a user dictionary from your training corpus. This is an important piece of the Hybrid training puzzle: not only can you train the baseline with your corpus, you can also automatically create dictionaries that will further enhance the quality of the engine.
Dictionary Validation: once you have created your user dictionaries, you can run an evaluation which will validate the correct entries and apply them to the translation model.
Document Alignment: this one is pretty neat! Think of it as Trados WinAlign but with less headaches. All you need to do is upload the corpuses and Training Manager does the rest. You can export and download the TM created in the alignment process as a TMX file and import it into a larger TM.
Systran provides a robust, powerful solution that will enable you to create a custom MT workspace. The hybrid training works well and provides good results. As with all Systran products, the UI is very pleasing to the eye, intuitive and easy to use. Documentation is skimpy though. I never understood why a User Guide needs to have screen shots which you can see anyway (if you own the software), while offering little explanation on what to do with the screens or how to use the software.
This system is not for the faint-hearted. If you are expecting an out-of-the-box solution which will magically create killer MT engines in a short time, you will be sorely disappointed. You will need to have lots of patience, an abundance of resources and a dedicated person to run the training server; preferably someone with training in computational linguistics.
The system is not cheap. The software costs plenty. And you will need a powerful, dedicated server with at least 12 GB of RAM to run it. That alone can run thousands of dollars a year.
The system is temperamental. Some tasks work fine while others don’t. Fortunately, Systran has a great support team who are nice to work with and can solve any of your problems.
Is Systran Training Server 7 a good solution for your organization? Only you can decide. Choosing the path to take in your MT journeys is a long-term decision and one not to be taken lightly.