Skip to content

identifying the optimal translation among a set of candidate translations

Notifications You must be signed in to change notification settings

slimane-msb/Translation-Selection

 
 

Repository files navigation

This project presents a pipeline aimed at identifying the optimal translation among a set of candidate translations. The pipeline is designed to leverage the N-gram language feature for translation selection. Specifically, the pipeline employs quality estimation (QE) as a metric for selecting the best translation, using Quest++, a QE tool developed by Lucia Specia's team at the University of Sheffield. Quest++ predicts a translation score by extracting features from both the source and target sentences. To achieve an efficient machine learning algorithm for the pipeline, the learning module of Quest++ is utilized to train and tune hyperparameters\cite{qe}. This approach is expected to improve translation quality by selecting the best translation from multiple candidates, thereby enhancing the accuracy of the translation process.

Introduction

This method has demonstrated its effectiveness in various natural language processing tasks, including post-editing, which involves reviewing and correcting machine-translated text to produce an accurate, high-quality translation that conveys the intended meaning. Specifically, the method is highly useful for selecting the best translation from multiple options generated by different MT systems or the same translation machine. Due to the probabilistic nature of natural language processing, ranking the results obtained from different MT systems is challenging. Quality estimation (QE) has emerged as the most suitable tool for addressing this issue compared to other metrics. In recent years, several institutions, including the NLP Sheffield research group, LIMSI Paris-Saclay, IRIS Sorbonne, and several open-source projects, have developed QE tools. In this project, we will focus on Quest++, a QE tool developed by Lucia Specia’s team at the University of Sheffield. Pearson’s correlation analysis indicates that QE scores have a higher correlation with human annotations than other metrics Additionally, the differences are statistically significant with a 99.8% confidence level, as determined by bootstrapping re-sampling and paired t-test.

$$\displaystyle r = \dfrac{n(\sum xy) - (\sum x)(\sum y)}{\sqrt{\left[ n\sum x^2 - (\sum x)^2 \right] \left[ n \sum y^2 - (\sum y)^2 \right]}}$$

where x is the score assigned by a translation expert, and y is the predicted ouput of QE

sentence level

Quality Estimation (QE) can be applied at different levels, including word level, sentence level, and document level. Each level corresponds to the element that will be assigned a score. For instance, in word level QE, each translated word is assigned a score, while in sentence level QE, each sentence is ranked to determine which sentences require more effort or time to correct or improve their overall quality. In this paper, we focus on sentence-level Quality Estimation (QE) in machine translation (MT). Sentence-level QE is considered the most effective level due to the importance of context in languages. For example, the translation of the French article title written by Francois Yvon "Cet article présente un survol de l’état de l’art en traitement automatique des langues" into English by DeepL as "This article presents an overview of the state of the art in natural language processing" demonstrates the importance of context. Although "natural" is not the best translation for "automatique," it is selected to fit the meaning and context of the sentence, thanks to sentence-level QE. Document-level QE, on the other hand, can be useful for summarizing and ensuring that the summary or paraphrasing retains the meaning of the document. However, sentence-level QE is still necessary for better translation before verifying the meaning in the final step.

questplusplus framework

An open source tool, pipelined Translation Quality Estimation (TQE), has been developed, which is a new release of Quest. It consists of two independent modules: Feature Extractor Module, developed in Java, and Machine Learning Module, developed in Python.

features

The Feature Extractor Module is a crucial component in the quality estimation of machine translation. It processes the raw input data and extracts a set of features that are relevant and meaningful for the TQE task. These features are used as inputs to the TQE model, which makes predictions about the quality of the machine translation output. The quality of the feature extraction module plays a significant role in the performance of the TQE model. Therefore, it is essential to design and tune this component carefully to ensure that the features accurately reflect the quality of the translation.

The features extracted by the Feature Extractor Module include the:

  • Number of tokens in the source and target sentence and their ratio

  • Language model probability of source and target sentence

  • Ratio of punctuation symbols in source and sentence

  • Ratio of percentage of numbers

  • Content/non content words, nouns/verbs/etc. in source and sentence

  • Proportion of dependency relations between (aligned) constituents in source and sentence

  • Difference in depth of syntactic trees of source and sentence.

These features are specified in the Feature Configuration File.

Input and Output files

Quest++ takes as input a file containing one sentence per line and returns as output a file containing the values of each feature separated by tabs. Each row corresponds to a sentence, and each column corresponds to a feature.

16.0	22.0	4.125	1.1	107.875	2.049204	0.0	0.625	0.066666
18.0	18.0	4.0	1.0	306.3889	2.5749147	0.0	0.722222

n-gram features

"NGram" (or "N-gram") is a statistical language model that is widely used in natural language processing (NLP) and computational linguistics. An N-gram is a contiguous sequence of N items (usually words or letters) from a given text or corpus, that is used to predict the likelihood of a word given its preceding N-1 words

N-gram features are frequently employed to capture the local context of words in a sentence in Quality Estimation (QE). This involves extracting n-gram statistics from the source and target sentences. The n-gram features serve to represent the local context of words in the sentences, and they can provide valuable information regarding the language model and fluency of the sentences. The process of extracting n-gram features from the source and target sentences was initially done using SRILM. However, kenlm is now used in its place due to its superior performance in extracting the probability of both the source and target sentences using modern NLP techniques. The details of kenlm will be discussed in the upcoming section.

Build and run

ant "-Dplatforms.JDK_1.8.home=/usr/lib/jvm/java-8-<<version>>"

java -cp QuEst++.jar shef.mt.SentenceLevelFeatureExtractor -tok -case true -lang 
english spanish -input input/source.sent-level.en input/target.sent-level.es 
-config config/config.sentence-level.properties
  • "-cp QuEst++.jar": specifies the classpath where the jar file is located

  • "shef.mt.SentenceLevelFeatureExtractor": the main class to be executed

  • "-tok": flag to specify that token-level features should be extracted

  • "-case true": flag to specify that case features should be extracted

  • "-lang english spanish": flag to specify the languages of the input files as English and Spanish

  • "-input input/sourceinput/target": specifies the input files as source.sent-level.en and target.sent-level.es under the "input" directory

  • "-config config/config.sentence-level.properties": specifies the location of the configuration file, "config.sentence-level.properties", under the "config" directory.

Machine Learning module

The mission of the machine learning module in QuEst++ is to provide support for various scikit-learn algorithms to improve the system’s ability to learn and make predictions. The module is implemented in Python and is located in the learning package of the QuEst++ main package.

python src/learn_model.py config/model.cfg

Optimization of parameters

To optimize the parameters for machine learning algorithms used in quality estimation (QE) of machine translation (MT) systems, cross-validation was employed. In case no parameters are specified, the code generates a model estimator with default values. The process of optimizing these parameters was conducted by employing cross-validation techniques, which involved dividing the training set into five sub-samples. This method allowed for the creation of an effective model that accurately predicted the quality of the machine translation output.

kenlm language model

KenLM (Kneser-Ney Language Model) is an open-source language modeling library that is widely used for natural language processing (NLP) tasks. It is designed for building, querying, and scoring language models in a fast and efficient way. KenLM uses the Kneser-Ney smoothing algorithm, which is a widely used method for smoothing n-gram probabilities in language models to reduce the impact of data sparsity.

$$P(w_{1:n}) \approx \prod_{k=1}^{n} p(w_k|w_{k-1})$$

Probability is important in translations because it can help identify the most likely or probable translation for a given text or phrase. Machine translation systems, for example, often use probabilistic models to determine the most likely translation based on the probability of certain words or phrases appearing in a particular language. This can help improve the accuracy and quality of translations, particularly in cases where there are multiple possible translations or when dealing with ambiguous language. Additionally, understanding probabilities can also help human translators make more informed decisions when making translation choices.

Sampling sentences from a language model

Sampling sentences from a language model is a process of generating new sentences or text by randomly selecting words based on the probabilities predicted by the language model.

Generalization and Zeros

To build a language model for MT systems, we need a training corpus, if however the probability of any word in the test set is 0, the entire probability of the test set is 0, as a result, we use smoothing or discounting to avoid the zero probabilty problem.

Unknown Words

To handle out of vocabulary words, we can impliment the Unknown Words Methode, where words in the training data are replaced by ‘<UNK>‘ based on their frequency, this technique however might result in a language model to have a lower measure of perplexity (which indicates how well the model predicts the next word in a sequence) by using a smaller set of words and giving a higher probability to unknown words. In other words, the model may perform better by limiting the range of possible words and assuming that unknown words are more likely to occur.

Kneser-Ney Smoothing

Kneser-Ney Smoothing is a language modeling technique that aims to improve the accuracy of language models by adjusting the probability estimates of words based on their frequency of occurrence in the training data.

The main idea behind Kneser-Ney smoothing is to estimate the probability of a word based on the probability of its preceding words. Specifically, the technique uses a discounting factor to discount the probability of frequent n-grams (sequences of n words) and redistributes the discounted probability mass to the less frequent n-grams.

$$P_{KN}(W_i | W_{i-1}) = \frac{max(C(w_{i-1}w_i)-d,0)}{C(w_{i-1})} + \lambda(w_{i-1})P_{CONTINUATION}(W_i)$$

ARPA

ARPA (Advanced Recurrent Neural Network Architecture) files are text files used to store language models in a compact and efficient way. ARPA files are widely used in natural language processing applications and are compatible with many popular language modeling toolkits, including KenLM, SRILM, and the Stanford NLP toolkit.

The
data and
end markers indicate the beginning and end of the ARPA file, respectively. The first line after the
data marker specifies the number of unigrams in the model, and the second line specifies the number of bigrams

-2.133817	Now look	-0.1207087
-1.864134	Now many
-2.811212	Now materialistic
-2.811212	Now only

KenLM requires the corpus of text to be in a specific format. Each line of the corpus should contain a single sentence, with words separated by spaces. The corpus should also be preprocessed to remove any special characters, punctuation, and other noise.

WMT datasets

Training data: A large set of up to five alternative machine translations produced by different MT systems for each source sentence and ranked for quality by humans. This is the outcome of the manual evaluation of the translation task from WMT09-WMT12. It includes two language pairs: German-English and English-Spanish, with 7,098 and 3,117 source sentences and up to five ranked translations, respectively.
Test data: A new set of up to 5 alternative machine translations per source sentence. Notice that there will be some overlap between the MT systems used in the training data and test data, but not all systems will be the same.

Evaluation for each language pair will be performed against human ranking of pairs of alternative translations, using as metric the overall Kendall’s tau correlation (i.e. weighted average).

Results

To evaluate the performance of different machine learning algorithms and feature sets implemented for quality estimation of machine translation systems, the Root Mean Square Error (RMSE) metric was used. Unlike other error metrics, RMSE places more emphasis on large errors because it calculates the square of the difference between the predicted and true values, thereby magnifying the effect of larger errors. This is important because for a MT system to perform better, it is preferable to have sentences with a small error rather than having a meaningless translation. Thus, the RMSE metric is an appropriate choice for evaluating the effectiveness of different approaches to quality estimation in MT.

$$rmse = \sqrt{\frac{1}{n} \times \sum \left( y_{pred} - y_{true} \right) ^2}$$

where n is the number of sentences, $y_pred$ is the predicted score, and $y_true$ is the real score. The difference $(y_pred - y_true)^2$ is squared to ensure that the error is always positive and larger errors are given more weight. The average is taken over all the data points in the dataset, and the square root is taken to obtain the final RMSE value.

Based on the results presented, the ridge algorithm was found to be more effective in selecting the best translation. This was achieved through the optimization and tuning of hyperparameters using the Quest++ learning module, it’s worth noting on the other hand, that the choice of an ML algorithm depends on the size as well as the genre of the data sets and the corpus of text with which the features are extracted. Quest++ offers an efficient way for testing various sklearn models, and as a result, offers more efficient final results on scoring translations.

ML algorithm ridge lasolars svr
rmse 0.850 0.989 0.898

ML algorithms

features including ngram probability without ngram probability
rmse 0.898 1.671

Importance of Ngram

Conclusion

In this study, we have introduced an effective pipeline for translation selection and demonstrated the effectiveness of the quest++ learning module in performing additional feature extraction tasks. Specifically, we explored the relationship between the probability of a sentence and its length, and found that adjusting the sentence probability based on length did not affect the RMSE. Moreover, quest++ can be used not only for translation selection, but also for system combination, a technique that combines the outputs of multiple machine translation systems to improve accuracy and reliability. However, quest++ has limitations in capturing long-term dependencies and handling out-of-vocabulary words. To address these limitations, we will comparein the next paper quest++ with transquest, a Python library that fine-tunes state-of-the-art transformer-based models for quality estimation, sequence classification, question-answering, and sequence tagging tasks. Transquest outperforms current open-source quality estimation frameworks such as OpenKiwi and DeepQuest and is built on top of the Hugging Face Transformers library.

Acknowledgments

I would like to express my sincere gratitude to my supervisor, Francois Yvon, for his invaluable guidance, encouragement, and support throughout the course of this project. I would also like to extend my thanks to Mark Evrard, my machine learning teacher, and Francois Lande for their insights and contributions. This work was completed during a school internship managed by Sylvain Conchon, to whom I am also grateful. Without their combined efforts, this project would not have been possible

References

9 specia,kashif.shah,t.cohn QuEst - A translation quality estimation framework.

Lucia Specia, Nicola Cancedda and Marc Dymetman Estimating the Sentence-Level Quality of Machine Translation Systems.

Lucia Specia, Gustavo Henrique Paetzold and Carolina Scarton Multi-level Translation Quality Prediction with QUEST++

Kashif Shaha, Eleftherios Avramidisb, Ergun Biçicic, Lucia Specia QuEst — Design, Implementation and Extensions

Ergun Biçicia, Lucia Specia QuEst for High Quality Machine Translation

François Yvon Le modèle Transformer: un “ couteau suisse ” pour le traitement automatique des langues

Taweh Beysolow Applied Natural Language Processing with Python

Lucia Specia · Dhwaj Raj · Marco Turchi Machine translation evaluation versus quality estimation

Julia Ive Frederic BlainLucia Specia deepQuest: A Framework for Neural-based Quality Estimation

Daniel Jurafsky and James H. Martin N-gram Language Models

Chetna Khanna Byte-Pair Encoding: Subword-based tokenization algorithm

Hui Zhang and David Chiang Kneser-Ney Smoothing on Expected Counts

Tomas Mikolov , Stefan Kombrink , Anoop Deoras , Lukas Burget , Jan Honza RNNLM - Recurrent Neural Network Language Modeling Toolkit

EMNLP 2022 SEVENTH CONFERENCE ON MACHINE TRANSLATION (WMT22)

Tharindu Ranasinghe and canstantine orasan and Ruslan Mitkov TransQuest: Translation Quality Estimation with Cross-lingual Transformers

Ranasinghe, Tharindu and Orasan, Constantin and Mitkov, Ruslan TransQuest: Translation Quality Estimation with Cross-lingual Transformers documentation

About

identifying the optimal translation among a set of candidate translations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • xBase 53.9%
  • JavaScript 40.9%
  • Java 2.0%
  • Erlang 1.8%
  • SourcePawn 1.3%
  • Python 0.1%