Skip to content
Chris Beck edited this page Jul 21, 2016 · 14 revisions

GNU gettext is a widely used collection of tools for writing internationalized programs.

The GNU gettext system dates since the early 90s, and is very mature -- many people are familiar with it, and especially, many translators exist around the world who know how to use the gettext tools to translate programs. GNU gettext 'libintl' itself is written in C and thus is quite portable.

This library was written in an attempt to improve on some sore spots in the GNU gettext system without losing the benefits -- especially, to make cross-platform distribution a bit simpler, and to avoid libintl itself in order to avoid some bugs that otherwise cannot be fixed so far as I know.

Especially, this library was written with programs that target both native platforms and emscripten / asm.js in mind. When using more exotic cross-compiler toolchains, it becomes a lot more attractive to use a simple header-only library, to keep your build system simple.

To fully explain what this library is and what it does, it's helpful to give a broad overview of GNU gettext itself first.

Gettext

When writing non-i18n programs, and a programmer needs to display a message to a user, they can easily do this by putting the message in a string literal in the source code and, printing it, assigning it to a dialog, whatever. Programmers are taught to do this even in their very first hello world program.

int main() {
  std::cout << "Hello World!" << std::endl;
}

When writing localized programs, this code is broken -- it displays a message to the user which has not been translated(!). The code needs to be fixed (according to GNU gettext) by marking the string for translation.

int main() {
  std::cout << _("Hello World!") << std::endl;
}

Why? For one, the code must be modified so that before proceeding to std::cout, the string passes through a translation catalog. However, besides this, the program also must have at its disposal a catalog of the actual translations, for any language it might be used in. This catalog needs to be updated and maintained -- it cannot be made once and set in stone, or, only made at the 'end of development' of the project. Programmers often need to create new messages that the program can display, and they should not have to fill out a ticket or write someone an email for each new string. An automatic system is needed to detect changes in the code that change the messages that the translators must target.

Gettext is designed so that as much as possible, programmers can use the same workflow they use when writing non-i18n programs, and the process of creating and maintaining the translation catalogues can proceed transparently to them.

This requires many small, and individually very simple, changes to the 'usual' development process, and ultimately to the build system, in the GNU gettext system.

  1. Translatable strings are marked.

Programmers annotate string literals corresponding to messages which should be translated using a 'translation marker'. Traditionally, these string literals are prefixed with the _ symbol in the source code.

_("Hello World!")

In C / C++, the _ symbol is then defined using the C preprocessor to be a macro or function which calls the gettext function from libintl. This function returns a const char * so from the point of view of the type system, _("Hello World!") is similar to "Hello World!", but it will be translated at run-time based on the current locale.

  1. Marked strings are extracted

A script is created to extract marked strings automatically from the source code.

The list is stored in a po-template file, marked with the .pot extension. This is the list of 'needed' strings.

Scripts like this already exist for a wide variety of languages. Additionally, a very mature tool xgettext exists which can parse the source code of a large number of languages and extract marked strings, and it can also be reconfigured easily to recognize different / additional markers.

  1. Translators go to work

Translation teams for each language then translate each of the strings. They produce a po file, marked with the .po extension. There are some logistical problems associated to merging the work of various translators, and GNU provides many utilities for this. The details are out of scope here.

  1. Translations are loaded

Finally, when the program is actually run, the po files for each language need to be compiled into a binary format compatible with libintl. GNU gettext specifies that translations are stored on disk in an optimized hash-table structure, using some architecture-dependent format (i.e. depends on endianness and such.) This format was fixed in the 90s and cannot change without breaking compatibility with libintl and the many other tools that read mo files. Files in this format are marked with the .mo extension (machine-object).

Since the .mo files are dependent on the target architecture, they cannot be (or at least, it would be unusual to) commit them into your version control system. By contrast, the po files, which are UTF-8 and human readable, can be and usually are.

To obtain the mo files then, everyone who builds your project needs to download or compile the special GNU gettext 'msgfmt' tool which converts po files to mo files.

Usually, msgfmt is then integrated into the build system for the project, so that all of the translations are compiled each time you compile the program.

Documentation in GNU gettext:

Drawbacks

Step 4 is the place where, IMHO, the train goes a bit off the tracks.

The decision to introduce a binary format may have made sense when computers were much slower and had much less working memory. GNU Gettext libintl is used by a wide variety of programs, some of which are like sed and are meant to start and run to completion in much less than a second typically. Using a binary format for translation cataloges ensures that the overhead of localizing sed will be minimal.

However, for programs that run on modern hardware, and run for longer than a few seconds, the overhead of parsing a po-file is typically negligible.

And what are the costs we pay of using mo files?

For starters

  • Catalogues must be compiled by anyone who wants to compile or distribute the program (with translations). This requires everyone to obtain the gettext tools, and the build system must find them and run them.
  • If a translator is working on translations and wants to see how the strings appear in your program, they now must understand how to work with mo files and how to use these tools at command-line in order to actually get the strings into the program. In practice, many translators have a hard time with this, and instead rely on the translation manager / central organization to take care of actually updating the po / mo files.
  • If there is a bug / a string fails to translate, an opaque binary format now may obscure the reasons. When using libintl, libintl itself expects the directory tree of the LC_MESSAGES to be arranged in a particular way, and gives the program no error if it cannot find translations. This leads GNU to give advice like "if it doesn't work then use strace over your entire program" in their FAQ. With spirit-po, the library doesn't interact with the filesystem, if the translator asks "why aren't my strings appearing, isn't it reading my translation file", you know the answer because you wrote the code that the talks to the filesystem.

Difficulties with user-generated translations

However, even greater problems arise when we want to use the system to allow users to generate new translated content. An obvious use-case is, players want to create mod content for a game. The game code / content itself is written using i18n techniques and so is translated to many languages, but what about the mod content? It may happen that mod content can be translated into several languages by the modders, but then i18n system used by the main game needs to be capable of incorporating their translations. As a different example, suppose that we are writing a program that curates an electronic art gallery. Users want to be able to upload images or movies, together with a short caption, and additionally, they want to be able to provide translations for these captions in various languages, and hire their own translators to do this.

These users have big headaches caused by the mo system. Even if they obtain translations from translators, it's not enough, because they now have to distribute mo files for every targetted architecture, for every targetted language. Because in the gettext system, msgfmt is generally run at compile time, if their content is not available at compile-time then it cannot be translated, or, some kind of ad-hoc mo distribution network must be created to handle new translation content after compilation has occurred...

These complexities are compounded by a design decision within libintl, which is that only a single mo file is active at any one time when using libintl.

In the libintl documentation, the term "textdomain" is used to describe a group of translatable strings. "Textdomain" is a concept orthogonal to locale -- when a single string is translated, three things must be known

  • the string itself
  • the locale
  • the textdomain

Different text domains correspond to completely separate dictionaries of strings to be translated, and at run-time, they correspond to distinct .mo files.

Ostensibly, text domains are a feature that would be convenient if the same string needs to be translated in different ways even in the same language in some different context. (But, gettext also provides context markers for that...) Or, they might be convenient for organizational purposes -- to split up one massive .pot file into many smaller and more manageable .pot files, by putting strings in in different text domains.

However, more often the text domains introduce needless complexity and burden the programmers. The text domain is essentially a global variable that the programmer manipulates using functions like this:

http://linux.die.net/man/3/textdomain

There is no way to 'merge' two text domains in memory -- libintl isn't built with that in mind, its job is to zip through these pre-built search structures rapidly. libintl is not designed with the idea that the programmer wants to know explicitly if a string was found in a particular catalog. Instead, the gettext function is supposed to take a string, and return a translated form if found in the catalog, and return the input otherwise.

In some programs, it's more natural if instead, translations can be spread over many .mo files and then merged into one master database which can be queried easily. In spirit-po this is much simpler and easier to arrange directly, while in libintl it's very poorly supported.

Difficulties with libintl itself

A final mundane, but quite painful problem with .mo files:

By far the most common library to actually read the .mo files is libintl.

However, libintl being quite old in design, opens the catalog files itself using the standard C file-reading routines, and provides no mechanism to load files from a stream or use custom file-loading routines.

This is very bad because on some systems the standard C file routines don't work right for various reasons. Especially, when compiling using mingw, sometimes the resulting programs won't be able to open files whose paths contain utf8 but non-ascii characters in them.

Because libintl provides no facilities to override its choice of file-reading facilities, in the way that other libraries like SDL or lua do, there's no way for you to work around problems like this.

These problems can be worked around by using boost::locale instead of libintl, and imbuing the iostream to properly handle utf8 as has been illustrated many times on the interwebs, but if one would prefer a header-only library then spirit-po may be more attractive.

(If you are targetting emscripten for instance, you may not be thrilled about adding a dependency on compiled boost.)

Conclusion

There is no formal specification of the PO format; instead, the related parts of the Gettext manual serve as its working definition.

c.f. Pology Documentation.

spirit-po is a pretty simple library with a small scope. The only thing that would make it laborious to 'roll your own' is that we are reading a data format with no formal spec. To compensate for this, spirit-po has a pretty rigorous unit-testing and validation system, which validates the whole interface against libintl against a wide-range of different publicly available po files from a wide range of languages and FOSS projects.

spirit-po is intended as an alternative to boost::locale and libintl for applications where an extremely simple and light-weight implementation is desired -- where not using mo files is a big plus, and you don't particularly want a library that talks to the OS or to the filesystem directly itself. spirit-po is the kind of library which does only what it says on the tin, in a self-contained way.

Thus spirit-po is not an all-in-one i18n solution, but it gets you a big step of the way there, and it's very small, transparent, and configurable.