Home > I18n, internationalization testing, pseudo translation, Web localization > Internationalization Puzzler: Page Encoding

Internationalization Puzzler: Page Encoding

For a Web localization project, we’ve pseudo-translated the Java-based site, which is running on IBM Websphere.

To pseudo-translate, we padded all of the strings with leading ¿¡ÃÉ and trailing ßÎÕÜ (target languages this round are Latin-1). Chars are UTF-8 encoded and all pages are generated with metatag charset=utf-8.

As Websphere sends the pages back, many of them look fine; e.g.:

good_chars

However, many of the pages display the characters as corrupted:

bad_chars

Oddly, the browser reports that these bad pages are encoded for Western European (ISO), in spite of the fact that the charset in the page source shows UTF-8. If you switch the browser to display the page at UTF-8, the characters show up properly.

It appears that Websphere is telling the browser, “I know what’s best. Ignore the UTF-8 in the charset and handle this page as ISO,” and the browser obliges.

Even more maddeningly, this does not happen on all pages, but only some pages in the site. All pages in the site (so I’m told) are created identically.

Happens with both Firefox and IE. The engineers have experimented with Tomcat, which does not act up like this, but we need to make Websphere work.

Have you ever seen this? Any ideas on what could be tricking the browser?

  1. April 6th, 2009 at 14:19 | #1

    My first guess is that your HTTP Content-Type header is saying ISO-8859-1, even though your HTML “meta http-equiv” header is saying UTF-8. I’d suggest first using a tool to look at the HTTP headers you are actually receiving, then take a look at the Websphere settings that control what HTTP headers it sends.

    The W3C tutorial on character encodings at http://www.w3.org/International/O-charset has good information and links about what the HTTP and HTML should say.

  1. No trackbacks yet.