Archive

Archive for the ‘I18n’ Category

Internationalization Puzzler Resolved

April 30th, 2009

A few weeks ago I posted on an I18n problem with IBM Websphere that was causing corrupted characters to display. In short, Websphere had told the browser to ignore the stated page encoding (UTF-8) and to display the page as if encoded for Latin-1. Not the Jedi way.

Our engineers had to get this escalated to tier 3 with IBM support. This seemed ridiculous to me, because we can’t have been the only Websphere site trying to display Spanish and Portuguese, and other people must have complained about such a silly problem, but it took tier 3 to get us a solution, and that’s all that matters.

The short answer: we need to change all of our top level jsp’s to explicitly set the response encoding to UTF-8 (response.setContentType(“text/html; charset=UTF-8″);) . Once the engineers had done that, the container finally returned a consistent result in UTF-8. It’s still a bit confusing why the UTF-8 encoding was returned on some pages and not on others but it all seems to work now, so we happily closed this case with IBM.

I append the entire response from IBM, simply so that it will live in one more place on the Web for future searches.

Two points before diving into your server1-non-working trace:

I. How WebSphere application server set Default Response Encoding:

If autoResponseEncoding is true:
1. Check request locale, set it if exists
2. Get encoding from request.getCharacterEncoding()
3. set the encoding according to the above locale
4. set to default ISO-8859-1

finally setContentType() with the charset set to above encoding.

Please note that autoResponseEncoding is independent from
autoRequestEncoding or client.encoding.override.

II. From Servlet Spec:
1. setContentType(): only work if response has not been committed (i.e
before getWriter)
2. Dispatch Include (SRV.8.3)
Any attempt to set headers or call any method that affects the
headers of the response will be ignored.

==============================================

Now this is your server1-non-working’s analysis:

1. autoResponseEnding set the encoding according to step (I.3) above:
from the locale (en_us)
thus the encoding is ISO-8859-1

setContentType type –> text/html; charset=ISO-8859-1

2. The request to /home.wfl will result in a dispatch forward to
[/WEB-INF/jsp/home.jsp] which in turn including other resources:

[/WEB-INF/jsp/home.jsp]

setContentType type –> text/html

+include /WEB-INF/jsp/include/header.jsp
++including /WEB-INF/jsp/include/includes.jsp]
++including /WEB-INF/jsp/include/syncstatus.jsp]

(there are several attempt to call the setContentType() within the
including JSP but got ignored …though I do not see any attemp to set
charset to UTF-8)

3. getWriter() with the encoding that found/set in step 1: ISO-8859-1.

=================================

You might want to check the top level JSP (i.e. home.jsp in this case)
and setContentType accordingly and before the response is committed.
Do not set it in the including resources as it will be ignored.
————————

Thank you for using IBM products and support.

John White I18n, Web localization, internationalization testing

Why I Pseudo-translate, and Why You Should, Too

April 16th, 2009

I always regret having to pseudo-translate, because it involves an extra step, annoys the engineers, and generally slows things down. BUT, whenever I don’t pseudo-translate, I regret it even more.

Earlier posts discuss the process in greater detail, and I do it to try to flush out ugly problems when time is cheap, rather than when it gets expensive. On most projects, I won’t start translation until after pseudo-translation and testing.

In short, pseudo-translation and internationalization testing help you find all of the incorrect assumptions the developers had when they created the software, for example:

  • “English is the only language this app will have to process.” This assumption becomes obvious when multibyte characters from Asian languages – or even just accented characters from Latin-based languages – are corrupted. To clean this up properly, you need to examine not only the code that renders text on screen, but also the code that processes characters entered by users.
  • “Dates should always be formatted mm/dd/yy, and the period is the decimal separator.” All modern operating systems and browsers now handle this for you, so don’t bother writing it yourself. Use the common resources in the OS or browser.
  • “Many of our error messages begin with the same five words, so we’ll append the rest of the message at run time to keep our code trimmed down.” This is sure to disrupt sentence structure, especially in Asian languages. The engineers will have to fix it and put up with less-trim code, but the localization process will appreciate the improvement.
  • “We’ll never localize this message. I’ll leave it in the source code.” Pseudo-translation and testing finds these instances immediately, and the engineer needs to comb the string out of the code and into the translatable resources.

So take the extra step of pseudo-translating, especially if you’re starting a software or Web project that involves an architecture in which you’ve never localized before: C# and Ajax come to mind, since many engineering teams are still cutting their teeth on these, after years of C++, HTML and Java.

Remember: Internationalization is only going to be of interest to engineers once – the first time – so you should get as much out of it as you can that first time.

John White I18n, internationalization testing, pseudo translation

Internationalization Puzzler: Page Encoding

April 3rd, 2009

For a Web localization project, we’ve pseudo-translated the Java-based site, which is running on IBM Websphere.

To pseudo-translate, we padded all of the strings with leading ¿¡ÃÉ and trailing ßÎÕÜ (target languages this round are Latin-1). Chars are UTF-8 encoded and all pages are generated with metatag charset=utf-8.

As Websphere sends the pages back, many of them look fine; e.g.:

good_chars

However, many of the pages display the characters as corrupted:

bad_chars

Oddly, the browser reports that these bad pages are encoded for Western European (ISO), in spite of the fact that the charset in the page source shows UTF-8. If you switch the browser to display the page at UTF-8, the characters show up properly.

It appears that Websphere is telling the browser, “I know what’s best. Ignore the UTF-8 in the charset and handle this page as ISO,” and the browser obliges.

Even more maddeningly, this does not happen on all pages, but only some pages in the site. All pages in the site (so I’m told) are created identically.

Happens with both Firefox and IE. The engineers have experimented with Tomcat, which does not act up like this, but we need to make Websphere work.

Have you ever seen this? Any ideas on what could be tricking the browser?

John White I18n, Web localization, internationalization testing, pseudo translation

Localizing Code Snippets – Part II

August 21st, 2008
Comments Off

Last week I posted on the dilemma of how to localize Code Snippets, the selected pieces of your documentation that you shoehorn into XML files so that Visual Studio can present them in tool-tip-like fashion to the user while s/he is writing code that depends on your documentation.

My goal was to ensure that the process of grabbing these bits of documentation (mostly one-sentence descriptions and usage tips) was internationalized, so that we could run it on translated documentation and save money. This has proved more difficult than anticipated.

Here is the lesson: If you think it’s hard to get internal support for internationalizing your company’s revenue-generating products, just try to get support for internationalizing the myriad hacks, scripts, macros and shortcuts your developers use to create those products.

In this client’s case, it makes more sense to translate the documentation, then re-use that translation memory on all of the Code Snippet files derived from the documentation. It will cost more money (mostly for translation engineering and QA, rather than for new translation) in the short run, but less headache and delay in the long run. Not to mention fewer battles I need to fight.

Discretion is the better part of localization valor.

John White I18n, documentation internationalization, documentation localization, internationalization, localization engineering, pseudo translation

Localizing Code Snippets

August 14th, 2008
Comments Off

“Why would I localize code snippets?” you ask. (Go ahead; ask.)

Everybody knows you don’t translate snippets of code. Even if you found a translator brave enough to take on something like int IBACKLIGHT_GetBacklightInfo(IBacklight *p, AEEBacklightInfo * pBacklightInfo), the compiler would just laugh and spit out error messages.

However, if you’re a developer (say, of Windows applications) working in an integrated development environment (say, Microsoft Visual Studio), you may want to refer very quickly to the correct syntax and description of a feature without searching for it in the reference manual. The Code Snippet enhancement to Visual Studio makes this possible with a small popup box that contains thumbnail documentation on the particular interface the developer wants to use. It’s similar in concept and appearance to the “What’s This?” contextual help offered by right-clicking on options in many Windows applications.

How does the thumbnail documentation get in there? It’s a tortuous path, but the enhancement pulls text from XML-formatted .snippet files. You can fill the .snippet files with the information yourself, or you can populate them from your main documentation source using Perl scripts and XSL transformation. So while you’re not really translating code snippets, you’re translating Code Snippets.

And therein lies the problem.

One of our clients is implementing Code Snippets, but the Perl scripts and XSL transformation scripts they’re using to extract the documentation, don’t support Unicode. I found this out because I pseudo-translated some of the source documentation and ran the scripts on them. Much of the text didn’t survive to the .snippet files, so we’re on a quest to find the offending portions of the scripts and suggest internationalization changes.

We’ve determined that the translated documentation in the Code Snippets will display properly in Visual Studio; the perilous part of the journey is the process of extracting the desired subset of documentation and pouring it into the .snippet files. Don’t expect that your developers will automatically enable the code for this; you’ll probably have to politely persist to have it done right.

Alternatives:

  • Wait until all of your documentation has been translated, then translate the .snippet files. It’s more time-consuming and it will cost you more, but working this far downstream may be easier than getting your developers to clean up their scripts.
  • Make your Japanese developers tolerate English documentation in the Code Snippets.

Neither one is really the Jedi way. Work with your developers on this.

John White I18n, documentation internationalization, documentation localization, internationalization, localization engineering, pseudo translation

Giant Localization Leap Backwards

June 19th, 2008

“All of the strings are embedded in the code.”

There was a time when I welcomed – or at least was not very much surprised by – sentences like this one. They came from engineers in response to my questions about the readiness of their software strings to be localized. Strings embedded in code, of course, are more or less inaccessible to localization techniques, since nobody wants to hand off an entire code base to a translator, and no translator wants to wade through an entire code base trying to find strings to translate.

So, when one of my client’s engineers said it to me yesterday in reference to an application in a larger product we plan to localize, I briefly welcomed it. It means more work.

But then I realized that combing all of the strings out of the code and into separate, accessible files will require a great deal of time and effort (not mine). Engineers don’t usually enjoy working on this kind of task, so it will fall to the bottom of the priority stack, and the product manager won’t go to bat for it, and so this particular application will stick out like a sore thumb as a non-localized component in an otherwise localized product suite.

“Is there a phased approach we could take to enabling this app for localization?” the engineer asked.

I appreciated his attempt to save the game, but a partially localized product is rather ugly. We could enable and translate the menu and dialog strings for this release, and go back for the error messages in the next release, but the mongrel product is not very appealing to users in the meantime.

This is disappointing, because we’ve made such long localization-strides elsewhere in the product suite, and dealing with this newly acquired app feels like such a giant leap backwards. I guess I’ll work up some estimates on the time required to enable the application, then make my case to the product manager and development lead to generate some interest and start the process from the beginning.

Isn’t that why we localization project managers and international product managers were sent here?

What do you do in your company when engineers tell you that all the strings are embedded in the code?

John White I18n, hard-coded strings, international product manager, internationalization, localization manager, new to localization, product manager, string localization

How to pseudo-translate, Part II

March 16th, 2007
Comments Off

You only speak one language, so maybe you’ll never be a translator, but you have a chance as a pseudo-translator.

Pseudo-translation is the process of replacing or adding characters to your software strings to try and break the software, or at least uncover strings that are still embedded in the code and need to be externalized for proper localization. (Part I of this post describes why anybody would want to do such a thing.) Pseudo-translation is a big piece of internationalization (I18n), which you should undertake before you bother handing anything off to the translators.

Here’s an example of a few strings from a C resource file, with their respective, pseudo-translations:

IDS_TITLE_OPEN_SKIN “Select Device”
IDS_TITLE_OPEN_SKIN “??S?l?ct D?v?c???”

IDS_MY_FOLDER “Directory:”
IDS_MY_FOLDER “??D?r?ct?r?:??”

IDS_MY_OPEN “&Open”
IDS_MY_OPEN “??&Op?n?”

IDS_WINDOW_NOT_ENOUGH_MEM
“Windows has not enough memory. You may lower the heap size specified in the configuration file.”
IDS_WINDOW_NOT_ENOUGH_MEM
“??W?nd?ws h?s n?t ?n??gh m?m?r?. Y?? m?? l?w?r th? h??p s?z? sp?c?f??d ?n th? c?nf?g?r?t??n f?l?.????????????????????”

IDS_TARGET_INITIALIZATION_FAILED
“Failed to load or initialize the target.”
IDS_TARGET_INITIALIZATION_FAILED
“??F??l?d t? l??d ?r ?n?t??l?z? th? t?rg?t.????????”

In these strings, Japanese characters have been pushed in to replace the vowels in all English words. The goal of using Ja characters is to ensure that, when compiled, the strings will look and behave as they should under Windows Japanese; it’s important to pseudo-translate with the right result in mind.

Some observations:

  1. Each string begins with Ja characters, since that will be the case in the real Japanese translation, and it’s a situation worth testing.
  2. Each string contains enough English characters to allow the tester to “gist” the string from the context. This is helpful because pseudo-translation can often destroy the meaning of the string.
  3. Each string has a ratio of swell, with trailing characters adding 20% to the length of the string. This helps flush out fields and controls in which strings will be truncated.

Okapi Rainbow is an excellent (if somewhat inscrutable) text-manipulation utility for just this purpose. When run on all of the string files in the development project, the result is a set of resources which, when recompiled, will run as a pseudo-translated binary. With a testbench running the appropriate operating system, a tester can get a good idea of the I18n work in store for the developers.

Rare is the product that passes pseudo-translation testing on the first try, either because of strings left behind in the code, resizing issues, string truncation, buffer overflows, or just plain bad luck.

Even if your code isn’t perfect, though, look on the bright side: You’re now a pseudo-translator.

John White I18n, hard-coded strings, internationalization testing, localization utilities, okapi rainbow, pseudo translation

How to pseudo-translate, Part I

March 6th, 2007
Comments Off

Before you localize your software product, wouldn’t you like to have an idea of what’s going to break as a result?

If you’ve written it in English, it will surprise and alarm you to learn that that’s no assurance that it will work when the user interface (UI) is in Chinese or Arabic or maybe even Spanish. The most conspicuous vulnerabilities are:

  • text swell, in which “prompt” becomes “Eingabeausforderung” in German, for example, and the 40 pixels of width you’ve reserved in the English UI results in only a small part of the German appearing;
  • corrupted characters, which will show up in the UI as question marks or little black boxes because characters such as à, ü, ¿, ß, Ø and ??? aren’t in the code page or encoding under which your software is compiled;
  • illegible or invalid names of files and paths, which occur when installing your software on an operating system that will handle more kinds of characters than your product will;
  • crashes, which occur when your software mishandles the strange characters so badly that the program just giggles briefly and then dies;
  • ethnocentric business logic, which leads to ridiculous results when users select unanticipated countries or currencies;
  • hard-coded anything, whether currency symbols, standards of measurement (metric vs. English) or UI strings.

In the past, localization efforts have become stranded on these beaches late in the voyage, after the text has been translated and the binaries rebuilt. It needn’t be that way.

Internationalization testing is the process of pushing alien characters and situations down your software’s throat to see what breaks. The more complex the software, the more complex the testing, such that there are companies that specialize in internationalization as much as if not more than localization.

It’s not rocket science, but it doesn’t happen on its own, either. And, you don’t want your customers worldwide doing any more of your internationalization testing than absolutely necessary, because they really don’t appreciate buying the product and then testing it.

The process requires some cooperation between Engineering and QA, which should already be in place for the domestic product and can easily be extended to the international products as well. An upcoming post will explain some of the tools and techniques for proper internationalization testing.

John White I18n, UI localization, hard-coded strings, internationalization, internationalization testing, localization engineering, pseudo translation, resource file localization

Internationalization and the smart installer

August 25th, 2006
Comments Off

Have we been thankful enough for InstallShield? I think it’s a royal headache for the release engineers that have to get used to it, but it’s a dream for a localization project manager:

  • InstallShield does most of the hard work. Most of the strings are already translated into more languages than most companies know what to do with.
  • Customized strings live in a single, text-based value.shl file, which the release engineers peel off and hand me for translation.
  • By default it creates language-specific branches in source control, which prevents, say, your Russian release from getting pasted in as a mere revision to your original English release.

The value.shl file is very simple, and ours changes so infrequently that it’s easiest for me to update it myself (version numbers, copyright dates, URLs), without need to hand it off for translation.

Of course, it did drive the release engineers batty in the early days, especially when I wandered in asking for 3 Asian and 2 Western installers every few months. The hard part for them is seeing far enough down the road to build a maintainable structure in source control. It never occurred to them to start out with branches labeled /en/ or /0009-English/ because they never foresaw the need for other languages, so they painted themselves into corners but didn’t realize it until Chinese came along one day.

People in this industry write about introducing worldwide consciousness to the overall mindset of the organization, and evangelizing the gospel of localization; that’s the 50,000 foot-/16,129 meter-level. Must be nice. I spend most of my time crawling in a trench in source control somewhere, trying to soften periods into decimal separators without getting flamed.

John White I18n, installshield, internationalization, localizing installer

Fixing that small internationalization gaffe

August 23rd, 2006
Comments Off

The engineers resolved the internationalization problem. Sort of.

They’ve modified the logic so that it no longer depends on the hardcoded presence of “&Tools” to pull the resources in correctly from two separate DLLs. However, it still looks for the literal “&Edit” in each DLL. If it doesn’t find it, the submenu items do not show up. I know, because I broke it again with a random pseudo-translation pass that rendered “&Edit” as “&ßéüdßéüt” in one resource file and “&ßéüñdßéüñt” in the other.

“Well, what do you expect?” asked the developer, when I explained this to him. “Get your pseudo-act together and you won’t find problems like this.”

I granted him that it was very unlikely that “&Edit” would be translated differently in two places – well, it could happen, but it should not happen – but that was not the point. It’s just not good programming practice to depend on string literals like that, whether localization engineering is a concern or not. “Why don’t you make the dependency on the string ID instead? Localization will never go near that.”

“Submit a ticket on it and we’ll see for next time,” he replied. “I’ve got other dragons to slay right now.”

So, I filed the request and the enhancement is in the great cosmic wash of the engineering team’s Issue Review system.

John White I18n, internationalization, pseudo translation