Localizing “Papers, Please” (2014) | Better HN

143 comments

71 comments · 14 top-level

teej9y ago· 12 in thread

For those who aren't aware: Papers, Please is a video game about paperwork. You play a border agent in a fictional eastern bloc country, checking passports, visas, and work permits. It's surprising and tense and incredibly good. It's currently on sale for $3.99 on Steam, I highly recommend it.

Paperwork, and the inhumanity of being a cog in a machine, with the machine not forcing you to do evil but rather making you very good at self-generating evil out of a desperate desire to survive.

Another game with a similar take on moral choices is This War of Mine. Both are fairly serious works of literature; do not play either with the expectation that you will feel good about yourself or the human condition afterwards.

chii9y ago

most games marketed to young adults or teens are a power fantasy - after all, they are what sells well because everyone has power fantasies, and those games fulfill them.

But occasionally, a good game comes out which makes you think about the world, and evoke other kinds of emotions like papers please.

enraged_camel9y ago

I bought This War of Mine yesterday, but haven't dared touch it. I don't want to get depressed playing it.

Will I get depressed?

This War of Mine isn't nearly as good or as important as Papers Please, though.

sondr39y ago

It's also on iPads (maybe Android tablets as well, don't have one) and is one of the best ports I've ever tried. I might even prefer it on the iPad simply because of being able to multi touch. It is a really great game, I recommend trying it simply for the experience.

ClassyJacket9y ago

It should be noted that the iOS version was censored by Apple since the original contained nudity in the form of body scans.

mercer9y ago

It's the kind of game that makes me believe in games as a distinct art form.

soup109y ago

it's a bore, but maybe politically interesting for some

teej9y ago

I would call the gameplay mundane, but that's the point. Completing a tedious task with high accuracy and throughput while the complexity and stakes ramp up. I found it to be a very entertaining challenge.

obstinate9y ago

In the same sense as Tetris is -- that is, unchallenging at slow pace, extremely difficult to do quickly.

That's like, your opinion. I didn't find it boring at all.

It is far from boring.

unsigner9y ago· 11 in thread

Don't ever use the original string as key in the localization table. That will force you to translate "high" difficulty the same as "high" resolution, for example.

When translating a fixed set of messages, you translate the entire phrase, not individual words. So the keys would be "high difficulty" and "high resolution".

This might not be possible, e.g. if the UI says Difficulty: high/low And the separate UI elements are fed separate stings.

And even the same phrase may get a different translation at a different place, depending on context.

roel_v9y ago

What GP is saying (I think - because your point is so obvious to anyone who has done a few hours of i10n that I didn't think it reasonable interpretation) is that you shouldn't use those phrases as the key in your message table. So don't use 'high_difficulty' as the key to look things up. It makes development easier, because you see the sort-of correct strings in your code or level editor, and usually the keys are in the language of the development team, so it lets you forget about the issue most of the time (as opposed to having to look up the value when you use keys that are not connected to their string value).

I got this wrong several times (of course) when I did my first few i10n projects. First time I used the strings as keys and ran into the problem described above. Time after that, I went the 'pure' way and used GUID's as a key which was a pain in the ass to used and caused the wrong messages to show up in some places a few times. It also made the translators hate me a lot. After that I went what I'd describe the 'pragmatic' way. Every time you encounter a string message, you make up a string identifier that sort of describes the message. So 'High importance!' would be e.g. HIGH_IMPORTANCE. But a multi-sentence message might be 'INTRODUCTION_PARA'. If the id already exists, and there is no obvious alternative, you just call it HIGH_IMPORTANCE_2 - in other words, you don't think about the key too much, you just use quick and dirty keys, you don't change them EVER, and you make sure you have good tools to prevent clashes, even across 'module' boundaries (where 'module' can be 'source files', 'shared libraries' or 'projects that use the same string resources').

You also put formatting strings in the messages, and in a way that makes the order configurable by the translator (e.g. using boost::format and not sprintf). You also provide the translator with a UI that shows them the message key, the 'original' message and the translation in various languages that already exists. And you provide a way for the developer/designer to attach notes to each message, where necessary.

Finally you adapt your messages to be as 'neutral' or easy to translate as possible; how to do this is something you learn with experience. And have to test test test and write special cases where necessary, like where you absolutely need things like 'first' or things where you have weird capitalization rules and stuff like that.

I never liked gettext. First, the is the licence, which rules it out for many projects. Secondly many of the functionalities are overkill, and (while 'pure' from an engineering pov) cause more work than they save (the cases I described above as 'just implement something custom in code'). Third, the tools suck. None of the editors I ever tried were really comfortable to use. They must have gotten the last 10 years or so, which was the last time I looked, but usually they're 'open source user experience' quality' - which is fine for developer tools, but not to be used by non-tech users (which most people who end up doing the translations are, because let's face it, translation is usually an afterthought and a low-respect task).

sanqui9y ago

Windows had this problem in Czech, where you could find "volume", as in logical drive, translated as audio volume.

iOS still has this problem in French on its keyboard, where "return" (for line return) is translated as "retour"... which literally means "return" but in the context of a keyboard, means backspace (on physical keyboards the backspace key sometimes even has "retour" written on it).

I've seen the same thing with Polish, on Linux. I don't remember which desktop environment it was in. Instead of "wolumin" or such, there was "głośńość".

mattmanser9y ago

I understand why he did it, having just done l10n myself for a client, our application is now harder to work with, and harder to just even find things, for example if you search for a phrase you can see on the screen, you get taken to the string file, instead of the string location. It's just an extra step, but it's a little mental drain that you wish you didn't have to deal with. You can't even scan the html anymore very easily as it's filled with all these hard-to-quickly-parse ids instead of human readable text.

Another big problem with using existing strings for ids is that if you notice a typo or tweak the phrasing of a particular string, boom, there goes your placeholder.

watwut9y ago

Or you can do it the way front-end does and create a 'localizeText(id, default_translation)' function which you will use to print texts. Best of the two worlds.

another reason is different grammar even if it's same in English, while old will be same in English no matter if you write old woman, old man or old child in languages with gender distinction it will be different, same goes for singular vs plural

so while in English you need only one string for singular and plural for different genders, in other languages it need to be written 5 different ways and we still didn't get into grammar cases which also don't exist in English, so while in English you would use 20 times old in other language you need 20 different variations

many companies where it's the default language of products English or Chinese have later big issues to localize content properly, especially considering Chinese being even more primitive language than English. good luck trying to explain all these distinctions to Chinese developers

The keys are for a specific context, so that makes collisions unlikely. You'll notice the example context "turned-over-ezic-docs" has only three strings in it. Of course, if he ever did have a key collision, it would be easy to solve. At worst, he'd have to replace the English strings with random, unique strings and make English a translation just like any other. He decided he'd pay that toll only if he needed to. And he didn't need to.

eropple9y ago· 11 in thread

I would recommend against one's own XML format and doubly against CSV/some homegrown delimited format. Instead, consider something like Excel 2003 XML (one of the easier ones), OpenDocument (also pretty easy in many languages), or Office OpenXML (easy in .NET, a bit harder elsewhere) to store your translation data.

Potfiles are another option, but the tooling is pretty clunky and, in games in particular, people don't seem particularly attuned to their use. And they're not great for editing, though they might be for storage--when dealing with tabular stuff, it just makes a lot of sense to use tools that present a tabular interface. It makes life a lot easier.

Just do what the industry does and use XLIFF[0]

Professional translators will already use compatible editors and for occasional translators there are open source ones available.

[0] https://en.m.wikipedia.org/wiki/XLIFF

Blah, how did I forget XLIFF? Unlike a lot of OASIS specs I think it seems pretty reasonable and alright, but I've never actually seen it implemented anywhere (games or general startup-y web stuff). I'm glad you brought it up, though, because it did slip my mind. =)

microcolonel9y ago

If it's literally a table of strings, why on earth would anyone use ODF/OOXML? CSV is perfectly fine for editing in any functioning spreadsheet software, works reasonably well in version control (especially since a given commit won't touch multiple columns). In his case, he's using the XML format which Haxe will parse into compile-time-checked references right in his source code; sounds like a great reason to use this standardized(but Haxe-specific) XML format.

mikesickler9y ago

yeah I don't see any reason not to use that XML format. CSV can be problematic because it doesn't declare it's own encoding and doesn't play as nice with translation tools as something with named fields, such as XML or JSON.

et13379y ago

CSV has no formatting. Who wants to resize columns and set up text wrapping every time you open the file. Could save it as .xlsx and then export to CSV, but that's another step and it's not hard to parse simple spreadsheets in XML. Worth it in my book, because it enables fan translators to contribute easily since everyone has Excel.

kccqzy9y ago

Can you explain why you oppose CSV? And by the way CSV is a standardized format with its own RFC; it is definitely not home grown. There are mature parsers being able to handle commas and quotes and linebreaks and other special characters in CSVs.

And by the way CSV is a standardized format with its own RFC; it is definitely not home grown.

That same RFC explicitly notes that CSV is entirely ad-hoc and homegrown, and that it (the RFC) is an attempt to clean up the existing mess.

It boils down to this: Excel is the de facto standard for translators and localizers I've worked with, and so tooling that works with that is a smarter bet than well-actuallying them about how trash that standard tool is at my favorite niche case. It's about people, not your code.

Beyond that, hand-editing a CSV file (because eventually you have to do that) with those special characters in it is a huge pain. It remains better than JSON, etc., because being record-based is automatically better than not for this stuff, but it's not a good option. I'm well aware that CSV is theoretically standardized; I have written standard-compliant parsers and writers. (It is awful.) And then, once I had painstakingly written that writer to spec, the next guy's--no, not Excel--trashed my data because CSV has a spec that nobody cares about.

That's adorable! CSV is notoriously loose, has a bajillion edge cases, differs wildly on region, and is not even really a format at all. Basically nobody follows RFC 4180, nor does anyone care about its existence.

The other day I had to fix a bug in our CSV importer; it turns out that when you install Excel, it changes the mimetype of CSV files from 'text/csv' to 'application/vnd-ms.excel' system wide. Wow! These sorts of shenanigans are never ending in the futile endeavour to support CSV.

douche9y ago

That seems like some serious overkill, unless you are relying on your translators to produce production assets.

We have used json, xml, ini-style files, or csvs, which, as i8n goes, has been pretty easy

If you're not relying on your translators to produce production assets, then you will be converting them from a format they are comfortable with (which tends towards Excel, if not a web app of some flavor) into whatever you're using. Unless you have a particularly rough memory budget, you will almost certainly be saving yourself time and effort by cutting out a step.

cbanek9y ago· 7 in thread

One other interesting problem with localization involves the use of printf. Even if you're looking up strings based on IDs in another file (which is a good pattern), sometimes you'll need to move things around based on language. For example, if you're doing right to left languages, you might put the number before, or after the string, and the other way for left to right languages. So like ("%d %s" vs "%s %d").

The way that we got around this was adding another level of indirection, and putting printf format strings also as localized data.

The format strings have to be localized data in any case, because they usually contain literal text, not just placeholders. The real problem here is that you need to change the order of arguments in a printf call - if the string changes from %s%d to %d%s, the order of arguments in the call must change, as well.

If you're on POSIX, you can use positional arguments for that:

   printf("%1$d %2$s", d, s);
   printf("%2$s %1$d", d, s);

Because it's not standard C, VC++ does not support it directly in printf, but it offers _printf_p with such support, and you can always #define printf _printf_p.

jwilk9y ago

Or you can use GNU Gettext, which provides featureful replacements for printf() functions.

> Even if you're looking up strings based on IDs in another file (which is a good pattern) […] putting printf format strings also as localized data.

AFAIK localising "formatting literals" is the more normal method, it avoids redundancies as you don't need two different systems (ids and format strings) and provide more flexibility with respect to e.g. cardinalities. Most ID-based systems bundle formatting support as well, if you're using an ID-based system you basically shouldn't call the language's string formatting functions.

Furthermore translating literal sections individually (without formatting context) will often yield an incorrect result as the entire phrase needs to be shuffled around, or words need to be inflected, or a literal translation suitable for "standalone" expressions does not work for the entire phrase.

More granular is generally worse for translations.

> AFAIK localising "formatting literals" is the more normal method, it avoids redundancies as you don't need two different systems

I never understood why people think this is a good idea. The exact same sequence of letters in an English phrase, which you would like to use instead of IDs, can mean two different things in two different places - and those two different meaning could have different forms in other languages. Denormalizing translation database like that seems semantically incorrect (and strikes me as programmer laziness).

I agree that in general, more granular is worse for translations - there's too much risk your split will pierce the contextual whole that's required for some translations.

The Unicode CLDR has a whole database of formatted strings for each locale, for more or less the reason you describe. Formatting dates or big numbers (12345 -> 12.3k) is impossible to achieve without a generic formatting language.

Pluralization is another nightmare of its own. Look into how Russian and similar languages pluralize. It has to do with the value of the number modulo 10, similar to English ordinals.

CLDR: http://cldr.unicode.org/

wingerlang9y ago

I went to a presentation by a company that dealt with translation, he mentioned this issue and his recommendation was to simply not try to be smart with it and have separate strings where pluralisation is done properly in each one.

jwilk9y ago

It's the language grammar that may require particular order of formatting directives.

LTR vs RTL is about rendering text and unrelated to this.

rasmafazi9y ago· 7 in thread

Sometimes you just have to bite the bullet. For interesting subjects, which always have global reach, the virtual conversations are conducted in English. There is also a place for vernacular -- it is part of people's cultural identity -- but not in a formal knowledge setting. English is a bit like Latin used to be: the language of knowledge, technology, and business. If the subject has global reach, you will miss out on the interesting bits of knowledge, simply because you are trying to do it in vernacular. Doing anything in vernacular, will just lock you up in a small and uninteresting national silo. Nothing of any interest is national. But yes, I use vernacular. I also speak it with my kids, but I don't read it -- unless it is poetry or literature -- and I don't use it in software or in business.

marvin9y ago

There are dozens of highly-functioning economies in the world that barely use English at all. I just got back from a holiday in France, which has its own fighter jets, nuclear weapons and aircraft carriers, yet only speaks English in jobs that are directly related to foreign communications. A random person you meet on the street will be very unlikely to understand you if you speak English to them.

It's a pipe dream to expect an average French person to enjoy art in its native language, and as a businessperson you will limit you market by doing this.

Granted, the English-speaking world is currently the world leader in technological capability and economic power, but it's quite myopic to assume that this makes everyone else irrelevant.

When most people play games, the goal is to relax. While I will concede that English has largely become the de facto language of trade, business, and technology, and I'm not going to go into hand-wringing because of that fact, demanding that people partake of their leisure in a language that's uncomfortable to them is several steps too far.

> unless it is poetry or literature

...So, art? Like games?

microcolonel9y ago

Sometimes I get i18n fatigue too. I think the world would be a better place if everyone's languages fit in ASCII.

That said, the cat's kinda out of the bag. UTF-8 is at least well-done, and the algorithms are widely available. I study Japanese and have started studying Russian and Chinese; I think maybe the best way to convince people to learn English is to walk the walk. Who knows, maybe everything will go very wrong again before we get a chance to standardize.

I'm also working on an engineered language with a test suite/corpus maintained alongside the language. Maybe in the ashes of the old new world there'll be room for something like this.

English doesn't even fit in ASCII.

To write it properly, we need left- and right-facing single and double quotes, diareses and accents for words like naïve, façade and café, en- and em-dashes and the ellipsis.

Longer documents will require symbols like † and ‡, bullets and §. The currency symbols £, €, ¢ and ₹ are used by countries where English is an official language.

douche9y ago

Somedays, I dream of a world where the ancient chinese were exposed to alphabaeic scripts and decided that was a good idea, instead of sticking with characters.

nine_k9y ago

One correction: consider Chinese. It's increasingly important, though less international.

tschwimmer9y ago· 4 in thread

Awesome article. I'm always impressed by the distance people will go for their passion. Lucas talks about ultimately having to hand draw Cyrillic versions of _each_ of the game's ten fonts. Very cool!

how do you know it's passion when he provided more languages option only after his PAID game became successful? i think there is different word for it in English

The level of effort depicted is more than you'd expect for a cynical cash-in.

watwut9y ago

I agree it is not passion in either case. I think it is called taking the job seriously and responsibly.

raverbashing9y ago

When he's still doing games when a regular job would make him more money

haikuginger9y ago· 2 in thread

This article makes me unreasonably glad to be working in a framework (Django) with good i18n tooling and few special needs re: textual images.

raverbashing9y ago

Django solves the pluralizing issues (with some limitations) but it won't (can't) solve gender issues in translation.

See https://docs.djangoproject.com/en/dev/ref/templates/builtins...

It doesn't help when the plural is 1/2/many or something different (example: Arabic/Icelandic, etc) http://docs.translatehouse.org/projects/localization-guide/e...

haikuginger9y ago

Django provides ungettext for the pluralizing problem, and pgettext, if implemented conscientiously, for the gender problem.

paines9y ago· 2 in thread

Is the Steam version localized and available in german language?

The second part ("less technical stuff") specifically notes that they used Steam's private branches to beta-test the localisation and that the supported languages are Italian, Japanese, Spanish, French, German, Russian and Brazilian Portuguese.

So yes and yes.

paines9y ago

Thank you. Didn't saw that there is a 2nd part!

surgi9y ago· 1 in thread

Loosely related to the title: Why not create a complete modular version not only localised, but also tied to individual country's flows and processes? So it could serve as an education material. (mind:blown)

breakingcups9y ago

The game is about a fictional country though. Why should the localized versions be about a real country?

If you haven't seen the trailer, it's worth watching.[1]

Glory to Artstozka!

[1] https://www.youtube.com/watch?v=_QP5X6fcukM

mproud9y ago

This should be amended as (2014).

Localizing well has a lot of complexity - gender, cardinal, ordinal, etc. rules, and then how to combine them with locale-specific special cases (e.g. in Spanish, a 15-year-old birthday girl is a quinceañera)

I am attempting to solve this with a small library that offers full CLDR coverage and a special expression language.

See https://www.lokalized.com

Currently for Java 8 but am porting to JS and Python (probably Swift after those)

jdonaldson9y ago

Haxe really shines at converting compile-time assets into static types. The other related trick is to use json as a config object, and access the fully typed equivalent as a static instance within your code. It's also possible to do this with database queries.

I realize other languages provide support for this, but in my experience with Haxe it's way easier to implement something custom. The macro translation layer for manipulating the AST is flexible and speedy, and the compiler is wired directly into autocompletion requests. There's very little impedance between my fingertips and the desired outcome.

mattmanser9y ago

Having just done some l10n for a client, the thing that annoys me is how even the most powerful editors, such as VS, have such awful tools for l10n. .Net's actual i18n support is pretty good overall, but the editor support is bad.

I literally had to build my own. With 2,500 different strings for a total of 10,000 words I wouldn't even consider our application even that big, it must be a nightmare in bigger projects. We haven't even done the sales site yet because the product's being upsold through a partner.

We came up with our own id naming system, then created an xlsx/resx importer/exporter that uploaded to GSheets to allow us to share files with translators. The ids and comments fields allowed us to add extra meta data, to split the strings into logical sections and sheets and order them properly. Be able to add links to the page that section of translations are on so the translator could see the context. This then additionally allowed us to highlight if a translator had missed any lines when we re-imported it, add their own questions/comments, etc. Also, as we were using sendwithus, we used the importer/exporter to allow us to import pot files from them to keep everything in one place.

Then to support those tools, I created a tool to search for phrases used before, find out the ordering from the meta data, quickly copy ids of strings we want to re-use, see missing spreadsheet tabs.

Programmatically, we had to add support for automatically translating enums into strings (think project status for example), add l10n to our audit logs so customers could see their audits in the correct language and we'd see them in English, modify how .Net did l10n of dates because their built in one is really odd with en-GB which is where we are based (shortdate is Jan 01 2025 in en-US but inexplicably 01 January 2025 in en-GB and all sorts of other oddities).

Then we used a modified version of pseudoizer (thanks John Robbins + Scott Hanselman![1]) to allow us to easily see untranslated strings while we went through the whole site without having a finished translation (we used ja-JP instead of Polish to really see the differences in date strings, currency, etc.). We ended up modifying it because it goes a bit mental with adding !!!! for things like tabs.

Probably spent a week on those tools, but boy was it worth it.

I've not tried intellij's l10n support, maybe it's better, but VS's is very lacklustre.

[1]https://www.hanselman.com/blog/GlobalizationInternationaliza...

j / k navigate · click thread line to collapse