Archive

Archive for the ‘documentation localization’ Category

Localizing DITA Projects

December 4th, 2008 Comments off

Have you seen DITA projects land in your inbox yet? The full promise of XML is about to become your next headache.

If you don’t know what DITA is, here’s the thumbnail from the Open Toolkit’s User Guide:

“DITA (Darwin Information Typing Architecture) is an XML-based, end-to-end architecture for authoring, producing, and delivering information (often called content) as discrete, typed topics.”

In short, the source content you hand off for localization lives in XML files. If you get to the party soon enough, you can help your own cause by asking the authors to use specific XML tags in their authoring to make it easy for you to find text you need to translate and to ignore text you don’t need to translate. The authors will surely fall all over themselves to make you happy with this new technology, so take advantage of it while it’s still novel.

The problem with XML is that it’s ugly and nobody can use it as documentation in that format, so it needs to be transformed into HTML, PDF, CHM, XHTML, or some other gestalt that people will use. The DITA Open Toolkit is an open-source means for performing this transformation, using scripts and languages to shape the content.

Your problem as a localization professional is not in the XML; it’s in the transformation.

How do you know that the scripts your writers use for the source language (let’s say, English) will work when you have to run them on XML files translated into Korean or Hebrew or Russian? (Well, they will run; the question is whether the result is good or garbage.)

With a kit like the Open Toolkit, things run as advertised when used right out of the box. The open-source project even devotes a chapter of its user guide to “Localizing (translating) your DITA content,” and they are kind enough to provide pre-translated text like “Parent Topic,” “Previous,” “Next,” which you can hook with the xml:lang attribute. The tricky part lies in the customization.

One Tech Pubs team engaged a team of script programmers to customize the toolkit. They’ve introduced strings like “Copyright Statement” and “Enter keyword” and placed a “Last updated” datestamp on every page in the help project. They’ve also implemented a search function (gulp!) so users can locate content in the help files. There’s nothing wrong with this customization work, except that nobody was thinking of other languages while doing it. Now we’re sorting out the location of the custom strings, the way to get the toolkit to format dates according to locale, and how to convince the search function that characters can take up more than one byte.

You will face the same problems. You’ll need to internationalize your writers’ customizations so that things work properly in your target language.

So when your writers tell you how much easier your life will be now that content is in XML, don’t forget to look a bit further down the road at what they’re using to transform that XML into something useful. That’s where you’ll put in the hours.

Wordcount Woes – Part 2

October 2nd, 2008 Comments off

If you’re working client-side, how many words have you paid for that translators didn’t even need to touch?

I posted a couple of weeks ago on translatable words that vendors may miss in analyzing files. Alert reader arithmandar commented that slide decks can be even worse, if there is a lot of verbiage on the master slide that does not get easily captured (although Trados finds these words, according to him/her). Flash is another story altogether, and arithmandar’s recommendation is that a Flash engineer should probably perform the analysis.

The other side of the coin is also unpleasant, but for the other party: Clients can hand off vast expanses of words that nobody will translate, artificially inflating the wordcount and estimate.

  • Code samples – If your documentation contains examples of code used in your product (e.g., in an API reference), there is no point in having that included in the wordcount, because nobody translates code.
  • XML/HTML/DITA/Doxygen tags – I hope your vendor is parsing these files to ignore text (especially href text) in the tags. Otherwise, not only will you get back pages that won’t work worth a darn, but you’ll also be charged for the words.
  • Legal language – Some companies want their license agreements, trademark/copyright statements, and other legal pages left untranslated. (Usually these are American companies.)
  • Directives – Certain directives and warnings apply to certain countries only. The documentation for computer monitors and medical devices often contains a few pages of such directives, which appear in the language of the country requiring them. There is usually set language for these directives, so free translation is not appreciated; have your colleagues in Compliance obtain the language for you, paste it in yourself, and point it out to your vendor.

Mind you, there are costs associated with finding and removing all of these words: Do you want to spend time extracting the words? Do you want to hire somebody to find and extract them? Will your savings offset those costs?

If the words to be ignored add up to enough money – as they often do for a couple of our clients – pull them all into a text file and send them to your vendor with instructions to align them against themselves for all languages in the translation memory database. That way, when the vendor analyzes your files, the untranslatable words will fall out at 100% matches.

Do you have ideas on how to handle such text?

Wordcount Woes – Part 1

September 11th, 2008 1 comment

Do you spend much time fretting about wordcount?

My hunch is that translators worry about it more than agencies do, because it’s often the only metric by which translators earn their daily bread. Agencies have project management, layout, graphics, consulting, rush charges and other metrics to observe, but most translators have one line-item on their invoices: wordcount.

I suppose that we all live and die by it because everybody’s calculations get down to wordcount – either source or target text – sooner or later, but no two tools define words the same way, so wordcount can vary considerably.

Still, the bigger issue with wordcount is “wordcount leakage.” If you’re working vendor-side, how many times have you quoted on a project, then realized that you had overlooked a chunk of text?

  • Graphics are the biggest culprit. The document contains charts and diagrams that require translation, but TM tools don’t find those words. Many vendors wisely exclude such text from wordcount and cover it in an hourly or per-graphic charge. (Nobody can ever find the source files for the graphics so that you can localize them properly, but that’s a whole other talk show.)
  • Bookmarked text is also slippery. It appears as text (sentences, paragraphs) in one place, and is referred into other places in the document. True, you only translate it in one place, but you need to deal with it – layout, formatting, page flow – in other places as well.
  • Conditional text, a favorite of Framemaker professionals, can also cause you trouble. If you don’t calculate wordcount with the conditions set to expose all of the text, you may miss it. The author should arrange for this before handoff.
  • Embedded documents (spreadsheets, word processing, HTML, presentations) are very sneaky. We just saw this the other day with an MS Word document that contained several embedded spreadsheets visible only as 1cm square icons on the page; double-clicking the icons opened up the embedded files. The TM tools don’t see those words, but the client certainly would have if they had come back untranslated. Fortunately, we caught this in time.

The Moral: Two pairs of eyes should review every file before the TM analysis, NOT one pair of eyes and a TM software package.

Localizing Code Snippets – Part II

August 21st, 2008 Comments off

Last week I posted on the dilemma of how to localize Code Snippets, the selected pieces of your documentation that you shoehorn into XML files so that Visual Studio can present them in tool-tip-like fashion to the user while s/he is writing code that depends on your documentation.

My goal was to ensure that the process of grabbing these bits of documentation (mostly one-sentence descriptions and usage tips) was internationalized, so that we could run it on translated documentation and save money. This has proved more difficult than anticipated.

Here is the lesson: If you think it’s hard to get internal support for internationalizing your company’s revenue-generating products, just try to get support for internationalizing the myriad hacks, scripts, macros and shortcuts your developers use to create those products.

In this client’s case, it makes more sense to translate the documentation, then re-use that translation memory on all of the Code Snippet files derived from the documentation. It will cost more money (mostly for translation engineering and QA, rather than for new translation) in the short run, but less headache and delay in the long run. Not to mention fewer battles I need to fight.

Discretion is the better part of localization valor.

Localizing Code Snippets

August 14th, 2008 Comments off

“Why would I localize code snippets?” you ask. (Go ahead; ask.)

Everybody knows you don’t translate snippets of code. Even if you found a translator brave enough to take on something like int IBACKLIGHT_GetBacklightInfo(IBacklight *p, AEEBacklightInfo * pBacklightInfo), the compiler would just laugh and spit out error messages.

However, if you’re a developer (say, of Windows applications) working in an integrated development environment (say, Microsoft Visual Studio), you may want to refer very quickly to the correct syntax and description of a feature without searching for it in the reference manual. The Code Snippet enhancement to Visual Studio makes this possible with a small popup box that contains thumbnail documentation on the particular interface the developer wants to use. It’s similar in concept and appearance to the “What’s This?” contextual help offered by right-clicking on options in many Windows applications.

How does the thumbnail documentation get in there? It’s a tortuous path, but the enhancement pulls text from XML-formatted .snippet files. You can fill the .snippet files with the information yourself, or you can populate them from your main documentation source using Perl scripts and XSL transformation. So while you’re not really translating code snippets, you’re translating Code Snippets.

And therein lies the problem.

One of our clients is implementing Code Snippets, but the Perl scripts and XSL transformation scripts they’re using to extract the documentation, don’t support Unicode. I found this out because I pseudo-translated some of the source documentation and ran the scripts on them. Much of the text didn’t survive to the .snippet files, so we’re on a quest to find the offending portions of the scripts and suggest internationalization changes.

We’ve determined that the translated documentation in the Code Snippets will display properly in Visual Studio; the perilous part of the journey is the process of extracting the desired subset of documentation and pouring it into the .snippet files. Don’t expect that your developers will automatically enable the code for this; you’ll probably have to politely persist to have it done right.

Alternatives:

  • Wait until all of your documentation has been translated, then translate the .snippet files. It’s more time-consuming and it will cost you more, but working this far downstream may be easier than getting your developers to clean up their scripts.
  • Make your Japanese developers tolerate English documentation in the Code Snippets.

Neither one is really the Jedi way. Work with your developers on this.

Getting your Documentation Ready for Localization

July 10th, 2008 7 comments

Have you had to prepare your documentation for localization yet? My experience is that in almost all companies, writers have far too many other oppressive concerns gnawing at them to think about writing for localization.

A few days ago an industry colleague sent me a message asking, “Do you have experience making recommendations for how documentation can be authored for localization? I am looking to make our doc  process more efficient to reduce costs.”

I replied that, given his stature and tenure in the industry, there was not likely anything I could suggest that he hadn’t already considered. Nevertheless, I sent him a list of ideas, in increasing order of difficulty:

  1. Make sure all the writers’ computers are plugged in. (A bit of ironic humor I could not resist.)
  2. Is it easy to get from the authoring tool(s) into TM, and back out into publishable format? This is my current headache with an API reference manual we localize for one client, because moving from source language to the translator tools and back to target format is a colossal headache. If you have similar problems, devote some cycles at the format-layer, even if it means writing an interface between your content management system and the translation tool.
  3. There are “authoring memory” tools that can suggest and re-use already-xlated source text, so that writers don’t say nearly the same thing multiple times and incur unnecessary TM penalties. Sajan has one, and SDLX contains one as well. I’ve never used either one, but I can imagine that success with the tools would require somebody with the documentation-familiarity of a technical writer and the global consciousness of a localization manager. Like you.
  4. I’ve presented on localization to a variety of audiences, and have consistently found tech writers to be the most interested in it, vastly more so than developers. When you show writers how the TM tools work, tell them how they can save money and re-use content, and let them know that you care about the impact of their work on international products, they will smell the coffee and engage. This takes a bit of evangelism, but it’s worth it if the writers change their own practices.
  5. Convert everything to XML. Although Renato and Don of Common Sense Advisory joke that that will fix any L10n problem, it’s nonetheless a good, long-term direction in which to move. It’s easier to re-use text, and easier to mark text that should/should not be translated. That will save you money.
  6. Start a program of controlled language authoring (dumbing down the sentences, always writing in a structure that machine translation will recognize, etc.). I guess that GM and Caterpillar are poster children for this kind of thing, but it puts the writers (and you, in the bargain) through the change of life, which is why I mention it last.

What about you? Have you faced this in your organization? How have you made document localization easier for the company, without driving your writers crazy?

If you liked this post, have a look at Getting Writers to Care about Localized Documents.

Localizing Robohelp Files – The Basics

May 29th, 2008 Comments off

We get a lot of search engine queries like “localize Robohelp file” and “translate help project.” I’m pretty sure that most of them come from technical writers who have used Robohelp to create help projects (Compiled HTML Help Format), and who have suddenly received the assignment to get the projects localized.

The short answer
Find a localization company who can demonstrate to your satisfaction that it has done this before, and hand off the entire English version of your project – .hpj, .hhc, .hhk, .htm/.html and, of course, the .chm. Then go back to your regularly scheduled crisis. You should give the final version a quick smoke test before releasing it, for your own edification as well as to see whether anything is conspicuously missing or wrong.

The medium answer
Maybe you don’t have the inclination or budget to have this done professionally, and you want to localize the CHM in house. Or perhaps you’re the in-country partner of a company whose product needs localizing, and you’ve convinced yourself that it cannot be that much harder than translating a text file, so why not try it?

You’re partially right: it’s not impossible. In fact, it’s even possible to decompile all of the HTML pages out of the binary CHM and start work from there. But your best bet is to obtain the entire help project mentioned above and then use translation memory software to simplify the process. Once you’ve finished translating, you’ll need to compile the localized CHM using Robohelp or another help-authoring product (even hhc.exe).

The long answer
This is the medium answer with a bit more detail and several warnings.

  • There may be a way to translate inside the compiled help file, but I wouldn’t trust it. Fundamentally, it’s necessary to translate all of the HTML pages, then recompile the CHM; thus, it requires translation talent and some light engineering talent. If you don’t have either one, then stop and go back to The Short Answer.
  • hhc.exe is the Microsoft HTML Help compiler that comes with Windows. It’s part of the HTML Help Workshop freely available from Microsoft. This workshop is not an authoring environment like Robohelp, but it offers the engineering muscle to create a CHM once you have created all of the HTML content. If you have to localize a CHM without recourse to the original project, you can use hhc.exe to decompile all of the HTML pages out of the CHM.
  • Robohelp combines an authoring environment for creating the HTML pages and the hooks to the HTML Help compiler. As such, it is the one-stop shopping solution for creating a CHM. However, it is known to introduce formatting and features that confuse the standard compiler, such that some Robohelp projects need to be compiled in Robohelp.
  • Robohelp was developed by BlueSky Software, which morphed into eHelp, which was acquired by Macromedia, which Adobe bought. Along the way it made some decisions about Asian languages that resulted in the need to compile Asian language projects with the Asian language version of Robohelp. This non-international approach was complicated by the fact that not all English versions of Robohelp were available for Asian languages. Perhaps Adobe has dealt with this by now, but if you’re still authoring in early versions, be prepared for your localization vendor to tell you that it needs to use an even earlier Asian- language version.
  • Because the hierarchical table of contents is not HTML, you may find that you need to assign to it a different encoding from that of the HTML pages for everything to show up properly in the localized CHM, especially in double-byte languages.
  • The main value in a CHM lies in the links from one page to another. In a complex project, these links can get quite long. Translators should stay away from them, and the best way to accomplish that is with translation memory software such as Déjà Vu, SDL Trados, across or Wordfast. These tools insulate tags and other untouchable elements from even novice translators.

We’ve marveled at how many search engine queries there are about localizing these projects, and we think that Robohelp and the other authoring environments have done a poor job explaining what’s involved.

If you liked this article have a look atLocalizing Robohelp Projects.”

If it isn’t broken…break it!

May 22nd, 2008 2 comments

What’s the most effective way to bump up your translation costs unnecessarily?

Probably by localizing something that nobody will ever want in a foreign language, of course. But nobody would ever approve an expense like that, so it wouldn’t have the opportunity to affect your translation costs.

There’s a much sneakier, more pernicious way of wasting translation money: Tinkering with the original text (for example, English).

Suppose you localized your product or documentation from 2002 through 2007. You’d have five years’ worth of translation memory (TM) economies and glossary entries going for you, with thousands of exactly matched words that incurred no translation cost from one version to the next. Then suppose that someone decided in 2008 to go in and “clean up” the original English text to make it more “readable” or “user-friendly.”

What do you think would happen the next time you handed off this content for TM analysis? Suddenly, non-matches would pop up where exact matches used to be. Among the causes:

  • Combining short sentences together
  • Breaking long sentences apart
  • Making stylistic changes to common terms (e.g., changing “phone” to “telephone” or “handset”)
  • Standardizing disparate terms (e.g., selecting one of “Proceed as follows,” “Perform the following steps,” “Following is the required procedure” and propagating throughout the documentation)
  • Typographical or grammatical corrections

You might tolerate these modifications in the interest of improving your product in all languages – not just English – but the sad truth is that you may find that they make no difference in the localized products. You’d pay for words that the translator did not need to touch. This is an unfortunate artifact of the way in which translation jobs are estimated, but the analysis software cannot predict that the changes will make no difference to the translation; only the translator sees that.

Note that re-organizing content should not cost you additional translation money; as long as the sentence is the same (i.e., an exact match), it doesn’t matter where it’s located in the product.

So, are you better off leaving errors and other undesirables in your original-language content? No. It would be a mistake to let concern for translation cost impede your product improvement effort, like having the tail wag the dog. Still, to the extent you can control it, you should try to avoid purely stylistic changes that make no difference in how your customers use your product. A good editor can make a hundred such changes per hour, not realizing the ramifications on translation costs.

If you learned something from this post, you might like to read Improved Docs through Localization or Getting the Writers to Care about Localized Documents.

Doxygen and localization

May 15th, 2008 1 comment

Are you localizing any documentation projects that use Doxygen? It’s an open-source tool for documenting source code.

If your documentation set includes things like an API reference or extensive details in programming code, Doxygen allows you to embed tags in the original code or header files, then automatically create entire help systems organized around the tagged text. Doxygen does not compile anything, but takes the tagged bits of source files, turns them into HTML pages, then links them for viewing in a browser.

Like most tools, it’s a breath of fresh air when it works properly, but it can require a lot of re-plumbing and retrofitting.

As far as localization goes, it can be a life-saver. In theory, you can have the header files themselves localized, then run them through Doxygen as you would the original English files. Working this far upstream can be a big advantage.

Some months ago a client embarked on a conversion of a help system to Doxygen. While it was still in the proof-of-concept stage, we pseudo-translated some header files and tested the tool for global-readiness.

The good news is that the developers of Doxygen have enabled it for multiple languages. It encodes pages in UTF-8 (or other character sets), so translated text displays properly in the browser. It’s possible to set the OUTPUT_LANGUAGE parameter to your target language (e.g., Japanese, in our test scenario) so that the datestamp and other text supplied by Doxygen displays in Japanese, rather than in the default English.

There are some I18n problems with Doxygen, though.

  • Each header file page begins with “Collaboration diagram for” followed by the page title. When the page title contains double-byte characters, the Japanese characters for “Collaboration diagram for” are corrupted. It appears that Doxygen is not pushing UTF-8 characters for this phrase, though it pushes UTF-8 characters in other places.
  • Some hyperlinked words in body text will require translation. If so, it will be important to ensure that they are translated the same everywhere. Note, however, that Doxygen will not generate the necessary file if the hyperlink has double-byte characters in it (not even on a Japanese OS).
  • Doxygen allows for generation of the .hhc, .hhp and .hhk files needed for Compiled HTML Help (CHM). It can also be configured to execute hhc.exe and compile the project. However, Doxygen outputs the .hhc file in UTF-8 format, which is incompatible with the table of contents pane in the Help viewer. To fix this, open the .hhc in Notepad (preferably on a Japanese OS) and save it back out as Shift-JIS (“ANSI” in Japanese Notepad). Then recompile the CHM by invoking hhc.exe from the command line and the contents will show up properly.
  • Searches using single- or double-byte characters do not work in the resulting CHM.

These strike me as rather large, empty boxes on the checklist of global-readiness. Still, the source code is available, so if your organization has already started down the Doxygen path, you can clean up problems like these for your worldwide versions.

Interested in this topic? You might enjoy another article I’ve written called Localizing Robohelp Projects.

Putting More "Sim" in your "SimShip"

April 17th, 2008 Comments off

How are you doing on your simultaneous shipment (“simship”)? This is a common term in the industry that refers to releasing your domestic and localized products at the same time. Is your organization getting closer to simship? It shouldn’t be getting further from it.

What measures have you put in place to reduce your time to market for localized versions? It’s never easy to pry finalized content from writers and engineers in time to have it translated, but that’s the dragon that most of us have to slay, so we focus on it a lot. How can we peel off content and get the translation process started sooner?

In the same way that eating lightly 5 times a day keeps you from getting really hungry and eating voraciously 3 times a day, we’ve found that handing off smaller bits of content even before they’re finished keeps us from having to panic when somebody calls for a localized version.

We manage projects for a client who has the advantages of lots of sub-releases (3.1.2, 3.1.3, 3.1.5) between main releases (3.1, 3.2), and few overseas customers who want the sub-releases. (They also have the disadvantage of lacking a content management system that would make this much easier.) Even if your situation is not an exact match, you’ll find that some principles apply anyway.

  • The biggest nut in the product is a 3500-page API reference guide in HTML. (Most products have a big, fat component that dwarfs all of the others.)
  • One month before each release, we assume that any new pages are about 95% final, so we hand them off for translation.
  • By the release date, we know whether we need to release a localized version of the entire product or not. If so, we proceed to hand off all of the rest of the product for translation, knowing that there will be some re-work of the new pages handed off a month before; if not, we hand off only the changed pages.
  • Thus, we almost always have pages from the API reference guide in translation. If we need them for a release, we have a lot of momentum already; if we don’t need them for a release, we put the translations into our back pocket and wait until it’s time for the next localized version.

This costs some more money than normal because of the inevitable re-translation that goes on, not to mention the hours refreshing the localization kit, and preparing files for translators. But this cost is acceptably low compared to the look of anguish on the international product manager’s face when we have to say, “It will take about three months to finish the Korean version because of all of the change since we last localized it.”

We also need to assume that, sooner or later, there will be a request for the product in certain languages. If business conditions change and the new translations never see release, then the effort has been wasted for those languages, but that’s a normal business risk.