LASER embeddings v2

is a python package that makes the new LASER models of Facebook Research's NNLB (no language left behind) easy installable and ready to use.

The original version had a lot of extra dependencies and is hard to install.

CURRENT VERSION:

We now provide updated LASER models which support over 200 languages. Please see here for more details including how to download the models and perform inference.

According to our experience, the sentence encoder also supports code-switching, i.e. the same sentences can contain words in several different languages.

We have also some evidence that the encoder can generalize to other languages which have not been seen during training, but which are in a language family which is covered by other languages.

A detailed description of how the multilingual sentence embeddings are trained can be found in [10], together with an experimental evaluation.

Installation

download encoders from Amazon s3 by e.g. bash download_models.sh
- This downloads all the LASER files to /models folder
download third party software by bash ./install_external_tools.sh
download the data used in the example tasks (see description for each task)

Applications

We showcase several applications of multilingual sentence embeddings with code to reproduce our results (in the directory "tasks").

Cross-lingual document classification using the MLDoc corpus [2,6]
WikiMatrix Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia [7]
Bitext mining using the BUCC corpus [3,5]
Cross-lingual NLI using the XNLI corpus [4,5,6]
Multilingual similarity search [1,6]
Sentence embedding of text files example how to calculate sentence embeddings for arbitrary text files in any of the supported language.

For all tasks, we use exactly the same multilingual encoder, without any task specific optimization or fine-tuning.

License

LASER is BSD-licensed, as found in the LICENSE file in the root directory of this source tree.

Supported languages

The original LASER model was trained on the following languages:

Afrikaans, Albanian, Amharic, Arabic, Armenian, Aymara, Azerbaijani, Basque, Belarusian, Bengali, Berber languages, Bosnian, Breton, Bulgarian, Burmese, Catalan, Central/Kadazan Dusun, Central Khmer, Chavacano, Chinese, Coastal Kadazan, Cornish, Croatian, Czech, Danish, Dutch, Eastern Mari, English, Esperanto, Estonian, Finnish, French, Galician, Georgian, German, Greek, Hausa, Hebrew, Hindi, Hungarian, Icelandic, Ido, Indonesian, Interlingua, Interlingue, Irish, Italian, Japanese, Kabyle, Kazakh, Korean, Kurdish, Latvian, Latin, Lingua Franca Nova, Lithuanian, Low German/Saxon, Macedonian, Malagasy, Malay, Malayalam, Maldivian (Divehi), Marathi, Norwegian (Bokmål), Occitan, Persian (Farsi), Polish, Portuguese, Romanian, Russian, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Turkish, Uighur, Ukrainian, Urdu, Uzbek, Vietnamese, Wu Chinese and Yue Chinese.

We have also observed that the model seems to generalize well to other (minority) languages or dialects, e.g.

Asturian, Egyptian Arabic, Faroese, Kashubian, North Moluccan Malay, Nynorsk Norwegian, Piedmontese, Sorbian, Swabian, Swiss German or Western Frisian.

LASER3

Updated LASER models referred to as LASER3 supplement the above list with support for 147 languages. The full list of supported languages can be seen here.

References

[1] Holger Schwenk and Matthijs Douze, Learning Joint Multilingual Sentence Representations with Neural Machine Translation, ACL workshop on Representation Learning for NLP, 2017

[2] Holger Schwenk and Xian Li, A Corpus for Multilingual Document Classification in Eight Languages, LREC, pages 3548-3551, 2018.

[3] Holger Schwenk, Filtering and Mining Parallel Data in a Joint Multilingual Space ACL, July 2018

[4] Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R. Bowman, Holger Schwenk and Veselin Stoyanov, XNLI: Cross-lingual Sentence Understanding through Inference, EMNLP, 2018.

[5] Mikel Artetxe and Holger Schwenk, Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings arXiv, Nov 3 2018.

[6] Mikel Artetxe and Holger Schwenk, Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond arXiv, Dec 26 2018.

[7] Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong and Paco Guzman, WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia arXiv, July 11 2019.

[8] Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave and Armand Joulin CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB

[9] Paul-Ambroise Duquenne, Hongyu Gong, Holger Schwenk, Multimodal and Multilingual Embeddings for Large-Scale Speech Mining,, NeurIPS 2021, pages 15748-15761.

[10] Kevin Heffernan, Onur Celebi, and Holger Schwenk, Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
LASERembeddingv2		LASERembeddingv2
tasks/embed		tasks/embed
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
dev_requirements.txt		dev_requirements.txt
download_models.sh		download_models.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LASER embeddings v2

Installation

Applications

License

Supported languages

LASER3

References

About

Releases

Packages

Languages

License

Tekst-ai/LASERembedding

Folders and files

Latest commit

History

Repository files navigation

LASER embeddings v2

Installation

Applications

License

Supported languages

LASER3

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages