Perl, Unicode and i18N FAQ

Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Perl, Unicode and i18N FAQ

Perl, Unicode and i18N FAQ

Author/Maintainer	James, Silicon Valley Perl Mongers, ex-Netscape and Yahoo contractor. Email me if you would need Perl/i18N/web consulting.
Contributors/Reviewers	LW, GS, AK, PP, RT, TM, DH, TB, JH, MD
Copyright	1999-2001, James, released under the Perl Artistic Licence
Link	http://rf.net/~james/perli18n.html
Date	2002 02 18
Audience	Perl programmers and porters interested in Perl, Unicode and Internationalization. Fonts are not addressed in this document.
Disclaimer	If there's an error or omission, blame the author, not the reference.
Version	Draft 0.3.37- Please send me your comments!
Perl & Unicode Sightings	Simon Cozens has written a paper on Perl and Unicode James gave a talk based on this document to a diverse audience at the XVIII Unicode Conference in Hong Kong, May 2001. James presented portions of this document to a full house at the XVI Unicode Conference in Amsterdam, March 2000. Globalsight mentioned having a Perl wrapper around ICU, so should be encouraged to open-source it. Although picu is already alpha! Unicode: A Primer, by Tony Graham, published in Jan. 2000, has some good information on Perl and Unicode. Programming Perl, Third Edition, briefly covers Perl and Unicode in Chapter 15.
Programming Tools	This is what I use: SUSE 7.1 Linux (man ascii, iso_8859_1) Perl 5.6.1 (perldoc perllocale, perlunicode) ICU 1.8.1 (now X-licensed) and API reference List Archives yudit (indispensable Unicode text editor) xterm -u8 Unicode Standard 3.0 charts database RFCs at http://www.faqs.org

News

The Apache Conference 2001 in San Jose featured a talk by ASF member Eric Cholet on writing Internationalized Applications with Perl and Template::Toolkit. Basically, he used conditional logic in the template for each supported language. The Mason template developers later talked about supporting locales in their templates without conditional logic, over a few beers. Eric also emphasized the importance of offering accented and non-accented search widgets for European users. Apparently our non-English friends often type in search queries without bothering to accent the phrase.

Brian Stell has started the PICU (Perl Wrappers for ICU (International Components for Unicode)) project which is being hosted on SourceForge. Both of us attended the ICU Workshop Sept. 11-12, 2001 at IBM and Brian already has XS code working for a subset of ICU. I would like to thank IBM and the Unicode Committee for their contributions to Open Source regarding Unicode and ICU. You can see my O'Reilly Perl Conference 2001 talk on PICU slides here.

Introduction

Unicode is a 16-bit character set encoding (surrogates aside) and related semantics for simultaneously representing all modern written languages (and more). Unicode is the key technology for globalizing software, and has been implemented in Internet and database software.

With that power comes a price: Unicode is a complicated standard that requires skill and tools support to implement. This document was written to explain Unicode and international programming to two audiences, Perl porters (developers) and Perl users.

Q0. Do you have a checklist for internationalizing an application?
Q1. I think that I'm a clever programmer. What's so hard about internationalization?
Q2. Do you have a glossary of commonly used terms and acronyms?

Perl and locales, Unicode, porting, modules and CPAN

Q3: What locale support does Perl have?
Q4. What support does Perl have for Unicode?
Q5. How do operating systems implement Unicode and i18N?
Q6. I'm a Perl Porter. What should I know about i18N and C?
Q7. I'm a Perl Porter. What should I know about Perl and Unicode?
Q8. I'm a CPAN module author. What should I know about Perl and Unicode?
Q9. Do regular expressions work with locales?
Q10. Do regular expressions work with Unicode?
Q11. What are these CPAN Unicode modules for?
Q11b. What about i18N POD?
Q12. What is JPerl?

More General Unicode and Programming Information

Q13. Can I just do nothing and let my program be agnostic of character set?
Q14. Why and where should I use Unicode instead of native encodings?
Q15. What is Unicode normalization and why is it important?
Q16. How do I do auto-detection of Unicode streams?
Q17. Is Unicode big endian or little endian?
Q18. Is there an EBCDIC-safe transformation of Unicode?
Q19. Are there security implications in i18N?
Q20. Are there performance issues in i18N?
Q21. How do I localize strings in my program?
Q22. I do database programming with Perl. Can I use Unicode?
Q23. I do database programming with Perl. What are the i18N issues?
Q24. How do other programming languages implement Unicode and i18N?

Internationalized Web Programming

Q25. What support for Unicode do web browsers have?
Q26. How can I i18N my web pages and CGI programs?
Q27. How should I structure my web server directories for international content?
Q28. Can web servers automatically detect the language of the browser?
Q29. What format do I send strings to the translator?

Internationalized Email Programming

Q30. What are common encodings for email?

iDNS

Q30b. What is happening with internationalized DNS?

Timezones

Q30c. How can I manage timezones in Perl?

References

Q31. Any good references?

Perl Hacks

Q32. How do I convert US-ASCII to UTF-16 on Windows NT?
Q33. How do I transform the name of a character encoding to the MIME charset name?

Q0. Do you have a checklist for internationalizing an application?

Each application needs a different level of i18N support, so there's no formula.Often a single application will need varying levels of support for users and administrators. Generally companies want to do the minimal amount of work possible, without limiting future improvements, because of the cost of developing and testing changes.

Here is a basic checklist for an i18N design document for web applications:

What i18N experience do your developers have? Any RFC or Unicode gods?
Which client and server hardware platforms?
Which client and server operating systems?
Which application servers - database, web server, middleware, etc. and how will they be configured?
Which browsers need to be supported for the users? Which character sets are supported?
Which browsers need to be supported for the application administrators?
Which user locales need to be supported?
Which admin locales need to be supported?
What are you using for a message catalog?
Which input methods are needed?
What user and admin character sets are needed?
Describe the static content? How many pages? How often is it updated?
Which localized address, phone, currency, f/x, tax, calendar, holidays, timezone, casing and collation features are needed?
Who's going to do ongoing i18N maintenance, especially with email products?
Who's going to do the legal, financial and trademark paperwork for foreign domain names? Outside the US it's not as easy!
Do your developers know a foreign language so they can test the system?
Get somebody to review the resulting document, preferably an experienced i18N engineer.

Here is a basic checklist for a localization design document:

Who will manage the relationship with the translator?
Who will do your translation? Are they in-country or local?
What file format do they want?
How much context do they need? (screen shots, manuals)
Who will verify their work?
How often will content be updated?

If your development team does not have i18N experience, consider either hiring former Adobe, Apple or Netscape i18N engineers or using the services of GlobalSight, SimulTrans or Basis Technology.

Airplanes don't fly until the paperwork equals the weight of the aircraft. Same with i18N.

Q1. I think that I'm a clever programmer. What's so hard about internationalization?

A1. Internationalizing a product involves issues about program design, application language features, cultural practices, fonts and often legacy clients. Most programmers face a rude awakening when first internationalizing an application after a career of only ASCII. Little details often become big headaches.

For a typical tale of woe, see Richard Gillam's Adding internationalization support to the base standard for JavaScript: Lessons learned in internationalizing the ECMAScript standard

Q2. Do you have a glossary of commonly used terms and acronyms?

A2. Here is high-level glossary of terms. For more detailed Unicode definitions, consult the Unicode Standard 3.0 and ISO 10646-1:2000.

Glyph A glyph is a particular image which represents a character or part of a character.

Coded Character Set A mapping from a set of abstract characters to the set of non-negative integers. This range of integers need not be contiguous.

Locale A specific language, geographic location and character set (and sometimes script). An example is 'fr_FR.ISO-8859-1', although locale strings are seldom standard on across platforms at present. Often only language_country is specified, or even just language. In a client-server environment (like the web), 3 locales are usually considered - server, data, and client locales.

Internationalization (i18N) The technical aspects (character sets, date formats, sorting, number formatting, string resources) of supporting multiple locales in 1 product.

Localization (L10N) The practical aspects (language, custom, fashion, color, etc. ) of expressing an application in a particular locale. Roughly, i18N is considered an engineering process while L10N is considered a translation process.

Globalization (g11n) The cultural aspects of supporting multiple locales in a non-offensive and universally intuitive manner. Think Olympic or airport signage.

Canonicalization (c14n) The process of standardizing text according to well-defined rules.

Japanization The conversion of a product into the Japanese language and character set. There are 4 character sets used in Japanese computer text: hiragana, katakana, kanji and romaji (English alphabet.) Sometimes kanji characters have ruby aka furigana ("attached character") annotation above them to aid in irregular or difficult readings of personal or geographics names and for school children. Mojibake means "scrambled character" and is used to describe the unreadable appearance of electronic displays when the wrong character decoding is used. Because kanji may have multiple readings (meanings) depending on context, machine conversion to hiragana is unreliable. The kinsoku rule says that Japanese sentences are separated by periods that may not wrap to the beginning of a line.

CJKV Chinese, Japanese, Korean and Vietnamese are often considered together because they all use multi-byte encodings.

Unicode Unified Code for characters. Current version is 3.0. Most modern character sets have been incorporated already, and also many ancient ones (I have been informed that Indonesian Jawi is represented by Arabic and Extended Arabic codepoints. I need to doublecheck that since there is also a Indic script used in Java.) Unicode is a complex character set that is unlike ASCII in many ways, some of them being: A glyph may be composed from multiple codepoints in more than 1 ordering; national character sets may not consist of contiguous codepoints; symbols such as bullets, smiley faces and braille are included; binary sorting of Unicode character sequences is likely to be meaningless unless the sequences are normalized first.

UTF-32 32-bit Unicode 3.0 Transformation Format

UTF-16 16-bit Unicode 3.0 Transformation Format is a 16-bit encoding, with 16-bit surrogates for private characters and future use (Chinese characters, ancient languages, special symbols, etc.)

UTF-8 8-bit Unicode 3.0 Transformation Format is a variable width encoding form updated from UTF-2, often used with older C libraries and to save space with European text. UTF-8 may be 1 to 6 octets long, although usually 4 octets without surrogates. It is defined in RFC 2279

UTF-2 8-bit Unicode 1.1 Transformation Format is a variable width encoding form that was superseded by UTF-8. UTF-2 was used in Oracle 7.3.4

UTF-7 7-bit Unicode 2.0 Transformation Format is a variable width encoding form (used with older email that was not 8-bit clean and not MIME.) See RFC 2152

Unicode compliant Character encoding implementation that conforms to a particular version of the Unicode spec, for certain features. May only implement a subset if so documented. (For example, a Unicode compliant app might only support certain languages (typically Western European), or even allow only US-ASCII!)

Code value (codepoint) Unicode value for a character that is all or part of a glyph. The same codepoint may represent multiple glyphs, especially in Han unification (Chinese, Japanese, Korean.) Accents alone may have their own codepoint.

Pre-composed character A Unicode character consisting of one code value. Some accented characters, notably Western European, have their own codepoints.

Base character A Unicode character that does not graphically combine with preceding characters, and is not a control or format character.

Combining character A Unicode character that graphically combines with a preceding base character. Typically accents and diacritical marks.

Composed Character A Unicode character made of combined codepoints, usually non-spacing mark (accent) characters. Often the same accented glyph may consist of codepoints in different orderings, for example a character with accents above and below the character (like Thai.)

Combining character A character that normally appears after a base character, and is an accent or other diacritical mark that is added to the previous base character.

Compatibility character A character included in the Unicode standard that has been included for compatibility with a legacy encoding. Usually it looks similar enough to another non-compatibility character to be replaced with it when appropriate. An example is the set of Japanese half-width hiragana code values, which were included for round-trip compatibility with other character set encodings for use in smaller character cells, even though a Unicode application could achieve the same appearance with application-defined font rendering.

Normalization There are four functions performed on Unicode character sequences so that two sequences may be compared in a meaningful way. Normalization is necessary because decomposed characters may have accents in different orders before normalization, but be the same glyph. Normalization is especially important to perform when computer language identifiers, filenames, mail folder names, digital signatures and emitting XML or JavaScript are involved.

Collation Order Table and/or algorithm for sorting strings specific to a locale and usage (dictionary, phonebook, etc.) Unicode has not specified collation order at this time, but should in 3.0.

UCS ISO/IEC 10646 Universal Multiple-Octet Coded Character Set. Both UCS and Uncode standards now share identical code values. The major difference between UCS and Unicode is that UCS is mostly concerned with defining code values, while Unicode adds semantics to the code values.

UCS-2 16-bit Universal Character Set (no surrogate pairs)

UCS-4 31-bit Universal Character Set.

Character Property Unicode code values have default properties such as case, numeric value, directionality and mirrored as defined in the Unicode Character Database.

Combining Class A numeric value given to each combining Unicode character that determines with which other combining characters it typographically interacts.

Byte Order Mark (BOM) Unicode code value U+FEFF may optionally be prepended in serialized forms (files, streams) of Unicode characters. By default, files are assumed to be in network byte ordering (big-endian). BOM is discussed at greater length in the document.

Official Unicode FAQ
Unicode Technical Report # 17 - Character Encoding Model
W3C Character Model for the Web
Forms of Unicode, Mark Davis
A Unicode HOWTO with definitions
ISO 639-2/T: Language Codes for terminological use
RFC 1766: Tags for the Identification of Languages
Country Codes: ISO 3166, Microsoft, and Macintosh
ISO 639-1 and ISO 639-2: International Standards for Language Codes. ISO 15924: International Standard for names of scripts

Q3: What locale support does Perl have?

A3: Locale has been well-supported in Perl for OEM character sets since Perl 5.004, using the underlying C libraries as the foundation.

Locale is not well-supported for Unicode yet. Locale is still important in a Unicode world, contrary to common misunderstanding, for:

character collation (comparison and sorting)
currency and calendar display formatting
message catalog item selection
language-dependent glyph disambiguation. CJK Asian languages may share a single codepoint for multiple glyphs because of Han Unification.

There are many tedious details that both the operating system and the programmer have to cooperate on to make locale work.

perldoc perllocale is an excellent reference. It is important to read this document because it is not intuitive which operators are locale-sensitive.

The Perl Cookbook Sections 6.2 and 6.12 also discuss Perl regular expressions and locale.

A simple programming example from the pod is:

    require 5.004;

    use POSIX 'locale_h';
    use locale;

    $old_locale = setlocale(LC_CTYPE, "fr_CA.ISO8859-1");
    # locale-specific code ...
    setlocale(LC_CTYPE, $old_locale);

Example by jhi: 
using locales in Perl is a two (well, three) step process:

        (1) use POSIX 'locale_h';
        (2) setlocale(LC_..., ...);
        (3) use locale;

The first one makes the LC_... constants visible.
The second one does the libc call.
The third one allows LC_CTYPE to modify your \w.

The following works for me in Solaris:

#!/usr/bin/perl -lw

use POSIX 'locale_h';
setlocale(LC_CTYPE, "fr_CA") or warn "uh oh... $!";
use locale;
print setlocale(LC_CTYPE);  # prints 'fr_CA'
my $test = "test" . chr(200);
print $test;
$test =~ s/(\w+)/[$1]/;
print $test;

Below is a fun test program. Watch how en_US changes character set depending on the previous locale.

use strict;
use diagnostics;

use locale;
use POSIX qw (locale_h);

my @lang = ('default','en_US', 'es_ES', 'fr_CA', 'C', 'en_us', 'POSIX');

foreach my $lang (@lang) {
   if ($lang eq 'default') {
      $lang = setlocale(LC_CTYPE);
   }
   else {
      setlocale(LC_CTYPE, $lang)
   }
   print "$lang:\n";
   print +(sort grep /\w/, map { chr() } 0..255), "\n";
   print "\n";
}

C:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz

en_US:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyzÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ
ÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ

es_ES:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyzÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ
ÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ

fr_CA:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyzÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ
ÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ

C:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz

en_us:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz

POSIX:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz

Q4. What support does Perl have for Unicode?

A4. Here there be dragons! As of now (Feb, 2002), my opinion is that there are 2 paths you can take until 5.8.0 is released this year:

with Perl 5.005_03 and CPAN Unicode modules, you can accomplish a lot by treating Unicode strings as regular binary strings within Perl
with Perl 5.61, you will be able to use internal Unicode support to get simple things to work in UTF-8 with Perl with some effort, and complicated things to work with much greater effort.
You can see 5.6 patch summaries here: ftp://ftp.linux.activestate.com/pub/staff/gsar/APC/changes-latest.gz
Search for utf to see what was broken in the official release of 5.6.0

Chapter 15 of Programming Perl, Third Edition, describes Perl's Unicode support. 1000 copies were released in time for the O'Reilly Perl Conference in July, 2000 (I have one). Overall it's a good collection of facts regarding Perl and Unicode, although it could be much improved with locale information. My opinion is that locale mgmt. is everything in i18N.

The UTF-8 character set has been experimentally supported internally since Perl 5.005_50 (the development releases) when requested with use utf8;

This means that in 5.005_50 or later:

strings are stored as UTF-8 (and flagged as UTF-8)
length() returns character length, not byte length

Unicode is stored as UTF-8 in Perl for 3 reasons:

UTF-8 is compatible with ASCII
UTF-8 is more compact for Western European languages than UTF-16, because non-accented characters only require 1 byte in UTF-8.
UTF-8 saves conversion since data is often exchanged in this encoding.

Since Perl SVs (Scalar Values) store a length for every string, UTF-16 or UCS-4 could also be used to represent strings without affecting user code.

There was a Perl and Unicode BOF at the Perl Conference on Wed, July 19. Lots of porters were there, including GS, AK, JH and NI. GS repeated the outstanding issues (printf, normalization, io disciplines, etc.), James repeated his issues (normalization, locale mgmt.), and GS scrolled through a Unicode::Normal module. Nobody made any additional commitments, so the mood was kind of somber.

Some of the implementation issues GS et al. were working on in Nov 1999:

I/O channels and disciplines, and syntax for marking handles. Input from files or pipes would be mapped on the fly from the source encoding to Perl's internal encoding, either UTF or OEM
enhance open(), readpipe() to take additional arguments to specify flags
$^B and binmode() with flags for marking implicitly opened handles and for marking handles after opening them
dynamically scoped $^U hint, and lexically scoped "use unicode;" pragma for requesting widechar system APIs where available (similar to how $^W and "use warnings;" behave for warnings)
incorporate OEM => Unicode and Unicode => OEM translation functions for I/O channel translations (Windows provides these in its base library)
"use byte;" pragma to suppress bigger logical character-size (Done in 5.6)
warnings on utf8/byte mismatches to make it easier for OEM universes to migrate to Unicode
implement xsubpp WIDECHAR option for XSUBs to announce wide API awareness
automatically allow utf8 in identifiers
support for v5.5.64 style version numbers using a packed utf8 string (Done in 5.6)
rework utf8 docs to reflect changes
UTF-8 will be the internal encoding, not configurable

Perl 5.6 was the first stable release with some core support for UTF-8. It was available March 23, 2000. However there is still much work to be done. Read perldoc perlunicode and perldoc utf8 to see the capabilities and limitations.

From the ToDo-5.6 file:

Unicode support
    finish byte <-> utf8 and localencoding <-> utf8 conversions
    make substr($bytestr,0,0,$charstr) do the right conversion
    add Unicode::Map equivivalent to core
    add support for I/O disciplines
        - a way to specify disciplines when opening things:
            open(F, "<:crlf :utf16", $file)
        - a way to specify disciplines for an already opened handle:
            binmode(STDIN, ":slurp :raw")
        - a way to set default disciplines for all handle constructors:
            use open IN => ":any", OUT => ":utf8", SYS => ":utf16"
    eliminate need for "use utf8;"
    autoload byte.pm when byte:: is seen by the parser
    check uv_to_utf8() calls for buffer overflow
    (see also "Locales", "Regexen", and "Miscellaneous")

Locales
    deprecate traditional/legacy locales?
    How do locales work across packages?
    figure out how to support Unicode locales
        suggestion: integrate the IBM Classes for Unicode (ICU)
        http://oss.software.ibm.com/developerworks/opensource/icu/project/
        and check out also the Locale Converter:
        http://alphaworks.ibm.com/tech/localeconverter
    ICU is "portable, open-source Unicode library with:
    charset-independent locales (with multiple locales simultaneously
    supported in same thread; character conversions; formatting/parsing
    for numbers, currencies, date/time and messages; message catalogs
    (resources) ; transliteration, collation, normalization, and text
    boundaries (grapheme, word, line-break))".
    There is also 'iconv', either from XPG4 or GNU (glibc).
    iconv is about character set conversions.
    Either ICU or iconv would be valuable to get integrated
    into Perl, Configure already probes for libiconv and . 

Regexen
   a way to do full character set arithmetics: now one can do
        addition, negate a whole class, and negate certain subclasses
        (e.g. \D, [:^digit:]), but a more generic way to add/subtract/
        intersect characters/classes, like described in the Unicode technical
        report oRegular Expression Guidelines,
        http://www.unicode.org/unicode/reports/tr18/
        (amusingly, the TR notes that difference and intersection
         can be done using "Perl-style look-ahead")
        difference syntax?  maybe [[:alpha:][^abc]] meaning
        "all alphabetic expect a, b, and c"? or [[:alpha:]-[abc]]?
        (maybe bad, as we explicitly disallow such 'ranges')
        intersection syntax? maybe [[..]&[...]]?
   POSIX [=bar=] and [.zap.] would nice too but there's no API for them
        =bar= could be done with Unicode, though, see the Unicode TR #15 about
        normalization forms:
        http://www.unicode.org/unicode/reports/tr15/
        this is also a part of the Unicode 3.0:
        http://www.unicode.org/unicode/uni2book/u2.html
        executive summary: there are several different levels of 'equivalence'
   approximate matching

Miscellaneous
    Unicode collation? http://www.unicode.org/unicode/reports/tr10/

Q5. How do operating systems implement Unicode and i18N?

A5. Operating Systems

Linux 2.2

The kernel remains mostly agnostic of what character encoding is used in files and file names, as long as it is ASCII compatible. File content, pipe content, file names, environment variables, source code, etc. all can be in UTF-8. Linux (like Unix) does not provide any per-file or per-syscall tagging of character sets and instead the preferred system character set can be specified per process using LC_CTYPE. Users should aim at using only a single character set throughout their applications. This is today mostly the respective regional ISO 8859 variant and will in the future become UTF-8. Work is being done on making most applications usable with UTF-8 and there is hope that Linux will be able to switch over completely from ASCII and ISO 8859 to UTF-8 in only a few years. Full UTF-8 locale support will be available starting with glibc 2.2. UTF-8 support for xterm will be available with XFree86 4.0. There are no plans in the Linux/POSIX world to duplicate the entire API for 16-bit Unicode as it was done for Win32. UTF-8 will simply replace ASCII at most levels eventually in any inter-process communication. UCS-4 in the form of wchar_t might be used internally by a few applications for which UTF-8 is inconvenient to process.

locale -s on Red Hat 6.0 indicates there is locale support for 'ja_JP.EUC' but not S-JIS.

There is a nifty Gnome Panel applet called Character Picker that allows one to select accented characters and paste them into an app. (Hint: set focus to the applet and key the character you want help with. Also includes some greek and trademark symbols.) strace is your friend for debugging system calls on linux.

Li18NUX - Linux Internationalization Initiative
Bruno Haible's Linux Unicode HOWTO
Markus Kuhn's excellent UTF-8 and Unicode FAQ for Unix/Linux Universal Locales for Linux

KDE is based on Qt, a cross-platform C++ GUI application framework. The Qt site has i18N and Unicode descriptions.

BSD

The flavors of BSD are at a very early stage of i18N. Because the gettext library is GPL, BSD won't use it. Their message catalog, catopen, has not been widely used. There is not yet a i18N system installer. An effort is being made to develop a Unicode file system, but otherwise there is only native character encoding support. Most of the i18N developers are Asians adding support for their locale only. xim has implemented an X IME.

Sun Solaris 2.7

Solaris supports variable length Extended Unix Code (EUC) and fixed length 4-byte wide characters (wchar_t).

Solaris has good support for i18N as documented in the following manuals. Besides the usual C library support, Solaris supports string resource localization with the gettext() function. truss is your friend in debugging system calls on Solaris.

Solaris has locale support for UTF-8, offering several locales for Western European languages, Japanese and Korean. A typical locale string is "en_US.UTF-8".

References

SunSoft Solaris Porting Guide
SunSoft Developer's Guide to Internationalization
SunSoft Solaris International Developer's Guide
Sun i18N Guidelines for C and C++

HP/UX

See Solaris 2.7.

Windows NT 4.0

UTF-16 internally, with both Unicode (wide) API and simulated ASCII calls. Microsoft Surrogates Paper

Windows 95/98

Uses Windows native formats and codepages, not Unicode.

Windows CE

Unicode with only wide calls.

Mac OS

MacOS 9 uses maximally decomposed UTF-16 filenames, and stores the file type ('text' or 'utxt') in the file system table.

MacPerl FAQ describes differences between Mac and Unix character sets

Apple Developer Documentation

Mac Mach Unix

Unicode internally.

Q6. I'm a Perl Porter. What should I know about i18N and C?

A6. Lots of things to watch.

For 8-bit non-UTF-8 encodings:

use ctype character routines, not direct character comparison (islower(), isalpha(), etc.)
use constants from limits.h
use unsigned char's and avoid sign extension problems

For UTF-8:

routines like isALPHA_utf8(p) may be used

Here is a typical character processing loop:

while (s < send && isALNUM_utf8(s))
      s += UTF8SKIP(s);

Sun i18N Guidelines for C and C++

Q7. I'm a Perl Porter. What should I know about Perl and Unicode?

A7. Lots of things to watch.

Normalization is required before comparing identifiers, strings, digital signatures, filenames, etc. whenever there is a chance that character sequences may not be pre-normalized. My opinion is that since Perl is the language of choice for text processing, Perl should have complete (late) normalization support.
Locale must still be observed with internal Unicode character sequences for the same reasons locale is required for OEM character support: collation, font selection and message catalog item selection.
Byte Order Mark (BOM) is probably the most misunderstood feature of Unicode, because its usage is not well-defined at this point. K. Kent had this to say on the Unicode mailing list: In ISO 10646 view, there is no need for any "BOM", or "signature" as it is called in an informative annex to 10646, at all. UCS-2, UCS-4, and UTF-16, *when* serialised into bytes, all *must* be serialised in big-endian order. That would be the end of story if it weren't for that wretched annex (and Unicode...). The annex *allows* for the use "signatures" for all of the encoding forms (UCS-4, UCS-2, UTF-16, *and* UTF-8). It also says that the "signature" ("BOM" in Unicode terminology) *can* be used to correct an erroneous/ non-conforming byte order (but you are in no way required to; terminating with an error message is quite ok when detecting a non-big-endian order). The "signature" is not at all intended to be about byte order in 10646, it is only to give a strong hint about which encoding form is used (all big-endian). And then there is Unicode that allows for both big- and little-endian byte serialisation, as well as some applications that put a "signature", in full conformity with 10646, also on UTF-8 encoded files. Windows 2000 Notepad inserts one.
An attempt is being made by the Unicode committee to disambiguate U+FEFF, which now represents BOM and ZWNBSP, into just BOM. ZWNBSP will be U+2060.
XML has its own requirements for when a BOM is required or not, see elsewhere in this document for more details.
Unrecognized Unicode characters are to be passed through as is.
Unicode Technical Report #13 for details on New Line Processing
Consider reusing code from IBM Classes for Unicode (aka International Components for Unicode aka ICU), iconv or XML::Parser

GS has suggested that a good way for less experienced porters to contribute code to the Unicode porting effort is to overload built-in Perl operators and call an XS module to do the new functions in C. Then it is a simple matter to have a more experienced porter patch the Perl core with your new features.

Perl 5.7 perlunicode POD

Q8. I'm a CPAN module author. What should I know about Perl and Unicode?

A8. The Dec 99 plan is to hide the internal encoding from Perl programmers. The encoding can be invisible to user programmers because the string routines operate on character units without regard to the actual bits.

Obviously this is not true if the programmer does a use byte; and tries to do his own character conversion unless the original encoding is known beforehand.

i18N developers will always need some kind of access to the raw bytes in strings when troubleshooting character conversion problems (mojibake).

Here's an interesting email:

On Tue, Sep 12, 2000 at 12:24:50AM +0200, Gisle Aas wrote:
> Jarkko Hietaniemi  writes:
>   
> > Please take a look at the (very rough) first draft of Encode, an extension
> > for character encoding conversions for Perl 5:
> >
> >     http://www.iki.fi/jhi/Encode.tgz
> > 
> > Download, plop it into the Perl 5.7 source directory, unpack,
> > re-Configure, rebuild.  (Or, if you have a Perl 5.7 in your path, 
> > cd to ext/Encode, perl Makefile.PL, make).

UTF8::Hack
Test suites are valuable for checking your code.

A stress test file
Q9. Do regular expressions work with locales?

A9. Yes, see perldoc perlre.

In short, \w and \s (along with their converses \W and \S respectively) are locale-dependent. This is less useful than it appears because \w represents Perl identifier word characters [a-zA-Z0-9_] (in en locale), rather than cultural words.

Q10. Do regular expressions work with Unicode?

A10. There is some support, see perldoc perlre.

New in 5.6: \p{IsSpace} matches any Unicode character that possesses the IsSpace property. \P{IsSpace} would match not IsSpace. Most of the Unicode property tables are bundled with Perl5.6.0, with the exception of UniHan and NormalizerTest. The files are renamed to fit the 8.3 filename naming convention if necessary for portability reasons.

Temporarily, tr// has UC (utf-8 to char) and CU (char to utf-8) options.
This feature will likely be eliminated in favor of many to many mapping functions. Unicode can be used as a pivot for converting any charset to another, although not all characters have matches in another charset.

Here's some one-liners for Latin-1 to UTF-8 and vice versa:

    s/([\x80-\xFF])/chr(0xC0|ord($1)>>6).chr(0x80|ord($1)&0x3F)/eg;  
    s/([\xC0-\xDF])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg;

Q11. What are these CPAN Unicode modules for?

A11. The modules listed below will likely be obsoleted by internal C routines in Perl 6.0, although are still very useful for older versions of Perl.

You will need a C compiler to build most of them. Somebody has to post some ppm packages or most Windows people are out of luck until they install a compiler.

Jcode

Jcode is used to convert between EUC/S-JIS/Unicode using the Unicode EASTASIA character mapping tables. Also supported are tr and other operations. Both OO and procedural programming models are supported, along with compatibility with older jcode.pl.

use Jcode;

# the new method does charset autodetection on the supplied argument
my $euc = Jcode->new("string from a set of supported charsets");

my $sjis = $euc->sjis;
my $ucs2 = $euc->ucs2;
my $utf8 = $euc->utf8;

Dan Kogai wrote Jcode.

Unicode::Map

Map is used to map characters to and from UCS-2. The actual mappings available are defined in the REGISTRY file. Various code pages and tables are stored as files.

Here's a demo from the POD documentation. Pipe through od -c to see the null bytes.

use Unicode::Map;

    $Map = new Unicode::Map({ ID => "ISO-8859-1" });

    $_16bit = $Map -> to_unicode ("Hello world!");
    #  => $_16bit == "\0H\0e\0l\0l\0o\0 \0w\0o\0r\0l\0d\0!"
    print $_16bit,"\n";

    $_8bit = $Map -> from_unicode ($_16bit);
    #  => $_8bit == "Hello world!"
    print $_8bit,"\n";

    $Map = new Unicode::Map;

    $_16bit = $Map -> to_unicode ("ISO-8859-1", "Hello world!");
    #  => $_16bit == "\0H\0e\0l\0l\0o\0 \0w\0o\0r\0l\0d\0!"
    print $_16bit,"\n";

    $_8bit = $Map -> from_unicode ("ISO-8859-7", $_16bit);
    #  => $_8bit == "Hello world!"
    print $_8bit,"\n";

Martin Schwartz wrote Unicode::Map.
James will likely be the maintainer when new changes are proposed. One is adding a map file for EUC-JP.

Unicode::Map8

Map8 is used to map 8-bit character encodings to UCS2 (16-bit) and back. The programmer may build the translation table on the fly. Does not handle Unicode surrogate pairs as a single character.

    require Unicode::Map8;

    my $no_map = Unicode::Map8->new("ISO646-NO") || die;
    my $l1_map = Unicode::Map8->new("latin1")    || die;

    my $ustr = $no_map->to16("V}re norske tegn b|r {res\n");
    my $lstr = $l1_map->to8($ustr);
    print $lstr;

    print $no_map->tou("V}re norske tegn b|r {res\n")->utf8;

Gisle Aas wrote Unicode::Map8.

Unicode::String

This module does various mappings, including various Unicode Transformation Formats.

    use Unicode::String qw(utf8 latin1 utf16);

    $u = utf8("The Unicode Standard is a uniform ");
    $u .= utf8("encoding scheme for written characters and text");

    # convert to various external formats
    print "UCS-4: ",  $u->ucs4,   "\n"; # 4 byte characters
    print "UTF-16: ", $u->utf16,  "\n"; # 2 byte characters + surrogates
    print "UTF-8: ",  $u->utf8,   "\n"; # 1-4 byte characters
    print "UTF-7: ",  $u->utf7,   "\n"; # 7-bit clean format
    print "Latin1: ", $u->latin1, "\n"; # lossy
    print "Hex: ",    $u->hex,    "\n"; # a hexadecimal string

Gisle Aas wrote Unicode::String.

i18N::Collate

This module compares 8-bit scalar data according to the current locale. i18N::Collate has been has been deprecated since 5.003_06.

    use i18N::Collate;

    setlocale(LC_COLLATE, 'en_US');
    $s1 = new i18N::Collate "scalar_data_1";
    $s2 = new i18N::Collate "scalar_data_2";

    if ($s1 lt $s2) {
       print $$s1, " before ", $$s2;
    }

Here's a simple program that converts from native encodings to various UTFs using the above CPAN modules:

use CGI qw / :standard /;

use Unicode::Map;
use Unicode::String qw(utf8 latin1 utf16);

   # initialize Unicode::String
   Unicode::String->stringify_as("utf16");

   my $encoding = param('encoding') || ''; # like Shift-JIS - see Map's REGISTRY
   my $in = param('in') || ''; # input text in native encoding

   my $Map = new Unicode::Map({ ID => $encoding });

   my $map_out = $Map -> to_unicode($in);
   my $us = Unicode::String -> new($map_out);
   my $us_utf8 = $us -> utf8;
   my $us_utf16 = $us -> utf16;
   my $us_utf8_esc = CGI::escape($us_utf8);

   print header(),
         start_html(),
         "$encoding = $in, UTF-16 = $us, UTF-8 = $us_utf8, URL-escaped UTF-8 = $us_utf8_esc\n",
         end_html();

Newer Modules

PP emailed me about:

lib/Locale/Constants.pm         Locale::Codes
lib/Locale/Country.pm           Locale::Codes
lib/Locale/Currency.pm          Locale::Codes
lib/Locale/Language.pm          Locale::Codes
lib/Locale/Maketext.pm          Locale::Maketext

there is also a pod version of the TPJ article that discussed the
reasoning behind the Locale::Maketext (which tends to embed more code into
a lexicon in an attempt to deal with nuances of things like ordinal vs.
cardinal numbering between languages).  The article, by Sean M. Burke and
Jordan Lachler, is in:

lib/Locale/Maketext/TPJ13.pod   Locale::Maketext documentation article

So perhaps Locale::Maketext is worth mentioning in your document(?).

Another CPAN module possibly worth mentioning is Locale::Msgcat by
Christophe Wolfhugel.  Although it suffers from a couple of problems: the
h2xs output was not packaged all the well by Christophe (e.g. the
DESRIPTION mentions: "Perl extension for blah blah blah"), it is based on
numerically indexed MSG catalogs (and as you already point out numeric
indexing lexicons is difficult to program with and/or update and
maintain).  Nonetheless I think that XPG4 inspired msg catalogs, via
catopen(), catgets(), and catclose() will be around for quite a while
since there are built into so many C and C++ implementations (indeed are
required by XPG4, which is why there is a msgcat utility with gcc).
At any rate I think that it would be useful to mention Locale::Msgcat so
as to get it more widely ported and/or updated.

The Encode module that Jarkko and Nick Ing-Simmons wrote is now in
bleedperl and is already set up to deal with approximately 106 
coded character set encodings:

% grep 'Encoding tables' MANIFEST | wc -l
       106

But that result is actually an overcount since certain tables will be
counted twice in their *.enc and *.ucm forms, e.g.:

ext/Encode/Encode/koi8-r.enc    Encoding tables
ext/Encode/Encode/koi8-r.ucm    Encoding tables

SADAHIRO Tomoyuki

 Now Sort::UCA is available
   from http://homepage1.nifty.com/nomenclator/perl/indexE.htm

 NAME
 Sort::UCA - use UCA (Unicode Collation Algorithm)

 SYNOPSIS
   use Sort::UCA;
   #construct
   $uca = Sort::UCA->new(%tailoring);
   #sort
   @sorted = $uca->sort(@not_sorted);
   #compare
   $result = $uca->cmp($a, $b); # returns 1, 0, or -1.

 SEE ALSO
 http://www.unicode.org/unicode/reports/tr10/

Q11b. What about i18N POD?

PP says:

There have been recent queries on p5p regarding translations of fixed pod
sets (e.g. those that came with perl 5.6.1) to other languages.  I think
it is widely supported as a "good thing" but it may be unlikely that the
translated pods would be included in the perl tar ball in future versions
- so as to keep tar ball size down and to avoid maintenance problems with
languages that the perl tar ball maintainers are not qualified to keep up
to date.

That said I can offer this bit of advice: for widest distribution try
to restrict your pod to to the 7 bit ASCII char set.  I found out that
with the 8 bit ISO-8859-1 chars in perlebcdic.pod that there are some
*roff implementations that do not grok the Latin-1 Char set well (nroff on
locale C Solaris 2.7 being one, the GNU version of nroff on OS/390 being
another).  Note that the pod spec allows for html inspired E names
for the printable Latin-1 characters but little else.

Having said that I think that the question actually pertains not to wide
distribution of pod, rather to narrow distribution of pod.  E.g. using
scandanavian pod with an appropriate char set.  Given my experience with
the *roff implementations I guess I would recommend testing things out
with whatever pod2* tools you intend to use.  A limited example of this
would be writing pod for translation only to html: you might make more
liberal use of the L<> construct than you would if the pod needed to go
both to pod2man and pod2html.  Likewise if I knew that the Linux
implementation of pod2finnish could easily grok ISO- then I might
not care at all if the nroff on Solaris 2.7 does not handle the
ISO- char set all that well.

I think that maintaining a list of such cross-system incompatabilities
would be as daunting a job as, say, specifying which characters in
Mac-Roman do not map well to the latest rendition of Windows codepage
1252; which is to say a combinitorial explosion that would be difficult to
verify for accuracy.




Q12. What is JPerl?

A12. JPerl (Japanized Perl) may also be used for Japanese localization.

JPerl for MS-Windows
JPerl patch for Unix

Lunde's CJKV also has references to JPerl on pp. 412 and 444-446.

If I remember correctly, kipp would create a patch that would Japanize Perl
after each Perl release. Finally, JH rolled most of his changes into
core Perl.

There is also a Macintosh Version of JPerl


/>/usr/local/bin/perl -v

This is perl, version 5.004_04 built for i386-freebsd

Copyright 1987-1997, Larry Wall

Japanization patch 4 by Yasushi Saito, 1996

Modified by Hirofumi Watanabe, 1996-1998
jperl5.004_04-980303
LATIN version

Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5.0 source kit.



Q13. Can I just do nothing and let my program be agnostic?

A13. Some legacy applications work without knowledge of which or how many native encodings are being used as long as the round-trip from client to server and back does not mangle the data. Generally this only works when no date or time display or casing or sorting are required and the client encoding can be assumed.

This would work better for European character sets which are usually single-byte in a single encoding than Japanese for example. Japanese has a multitude of multi-byte encodings, the most common being ISO-2022-JP, S-JIS, EUC-JP and many more.

Web applications adopt the common practice of storing data in Unicode and sending data to the browser in native encodings.  When multiple character encodings are required on a single web page, (like Japanese combined with non-English text), UTF-8 is used.


Q14. Why and where should I use Unicode instead of native encodings?

A14. The simple answer is that Unicode should be used internally for processing and databases now, and in the future sent to the client program.

The advantage of internal Unicode is the ability to simultaneously store multiple languages in a common format without creating separate databases for Korean, Japanese, Chinese, and English, for example.

The current advantage of sending and receiving native character sets to the user is legacy support for browsers or other older clients and operating systems.

Although Netscape and IE 4.0+ have basic Unicode support, use after testing.  (The MyNetscape feature of netscape.com uses UTF-8 when displaying languages that would normally require multiple character sets on the same web page, like Japanese and non-English languages.)

Browsers generally seem to have the most difficulty displaying UTF-8 in forms, alt text and JavaScript strings (older versions of Netscape browsers would not parse non-ASCII text correctly just before a double-quote).

Andreas Koenig's Perl Conference 3.0 HTML slides are in UTF-8


Q15. What is Unicode normalization and why is it important?

A15. Normalization is a transformation applied to a Unicode character sequence.  The transformation is performed according to the Unicode Standard and involves table lookups to do substitution and reordering of characters as defined in the Unicode Character Database.
Normalization is necessary because Unicode allows multiple representations of the same string with combining characters and precomposed characters. Normalized character sequences allow accurate binary comparison of strings. This is especially important for purposes such as digital signatures, variable identifiers, mailbox folder names and directory searches.

Early normalization is performing normalization as close as possible to the text source, before transmitting the text to another application. This must be done for XML, HTML, URIs, and other W3C standard systems.

Late normalization is performing normalization after receiving text from another process or network transfer.

Canonical equivalence of 2 character sequences is determined by applying operations defined in the Unicode standard, without changing the visual appearance of resulting text.

Compatibility equivalence of 2 character sequences is determined by applying operations defined in the Unicode standard, and may change the visual appearance of text (for example, input text with both half and full-width katakana would result in full-width only output).

There are four normal forms (D, C, KD and KC), documented in Unicode Standard Annex #15.

Briefly:

Form D - Canonical Decomposition
Form C - Canonical Decomposition followed by Canonical Composition (required in W3C, c14n)
Form KD - Compatibility Decomposition
Form KC - Compatibility Decompostion followed by Canonical Composition


Martin Dürst of the World Wide Web Consortium (W3) has written reference code in Perl for normalization, called Charlint - A Character Normalization Tool


Q16. How do I do auto-detection of Unicode streams?

A16. There is no mandatory signature, making this hard to do reliably. The optional Byte Order Mark (BOM) U+FFFE may help if present. UTF-7 and UTF-8 strings have properties that would make them identifiable most of the time unless there is substantial leading US-ASCII.

Without a signature you would need a moderate amount of text to do a reliable detection.  An example of an input source that is probably not long enough would be a search widget on a web page. Statistical analysis of octets is another unreliable method.

Frank Tang's site has some UTF-8 auto detection scripts

See Ken Lunde's CJKV book, Appendix W: Perl Code Examples, for numerous character encoding, detection, and regular expressions in perl.


Q17. Is Unicode big endian or little endian?

A17. Unicode characters are not considered to be endian when stored in memory.

However, UTF-16 and UTF-32 must be serialized when saved in a file, emitted as a stream, or transferred over the network. Network byte ordering (most significant byte first or big-endian) is preferred, but you would not likely see it on Intel hardware. Mozilla 5 supports both UTF-16-LE and UTF-16-BE, etc!


Q18. Is there an EBCDIC-safe transformation of Unicode?

A18. Yes. UTF-EBCDIC stands for EBCDIC-friendly Unicode (or UCS) Transformation Format.
Unicode Technical Report #16

PP says:
As of Perl 5.6.1 (and also in bleedperl developers versions 5.7.1++) the use utf8;
pragma on an EBCDIC machine will have the same effect with respect to
UTF-EBCDIC that use utf8; has with respect to UTF-8 on an ASCII based
machine.  Hence there is about as much support for UTF-EBCDIC in current
Perls as there is for UTF-8 on ASCII-based machines.


Where to Use UTF-EBCDIC?

UTF-EBCDIC is intended to be used inside EBCDIC systems or in closed networks where there is a dependency on EBCDIC hard-coding assumptions. It is not meant to be used for open interchange among heterogeneous platforms using different data encodings. Due to specific requirements for ASCII encoding for line endings in some Internet protocols, UTF-EBCDIC is unsuitable for use over the Internet using such protocols. UTF-8 or UTF-16 forms should be used in open interchange.

Peter Prymmer's Perl EBCDIC document


Q19. Are there security implications in i18N?

A19. Yes. Since locale affects Perl's understanding of strings, operations like character comparison may produce unexpected results when the locale is manipulated.  

The UTF-8 Corrigendum better defines UTF-8 by forbidding generation and interpretation of non-shortest forms, which will help disambiguate characters.

Perl taints strings affected by locale.

An example is using a regex with \w to check user input. \w may accept character codes for punctuation characters if the locale "says" they are alpha codes.

From the Sun man attributes page:

For applications to get full support of internationalization services,  dynamic  binding  has  to be applied.  Statically bound programs  will  only  get  support  for  C  and  POSIX locales.

CERT* Advisory CA-97.10 - Topic: Vulnerability in Natural Language Service


Some comments from Bruce Schneier


Q20. Are there performance issues in i18N?

A20. Yes. Locale-dependent collation is generally slower than binary comparisons.

Some of the reasons are:

While ASCII is single-byte, UTF-8 and UTF-16 are multi-byte encodings.
Often client data is accepted in native encodings and stored in Unicode, requiring conversion.
Unicode transformations require more storage space than native encodings generally, due to the tables required.


A contributor reports that when using regular expressions like /^([\w]*)([\W]*)/ in perl 5.6, performance has been measured to be similar to that in pre-5.6 on straight ASCII.


Q21. How do I localize strings in my program?

A21. The best way is to rip them out and save them in an external string resource file.  The program API should load strings by string_id (preferably a descriptive text literal rather than a cryptic numeric code like Windows resources) and lang_locale (lang_locale should be either default locale or explicit locale). An additional complication is that text may have parameters that are substituted in locale- dependent positioning (Perl does not have the printf reorder edit descriptors $1, etc. yet). Some programmers provide the default locale's string as an argument to the catalog API call. This may be ok where the program is small and the localizer is also a developer with access to the source code, but I do not recommend this for large projects or where strings are often updated. Leaving interface text in the source code is bad.

Another advantage of using string resource files is that translators can work on the files without touching your source code. This reduces cost, avoids inadvertant source mgmt. errors and limits write access to the repository.

There's 4 strategies for resource file formats:

ICU resource bundles

ICU locales are literals of the form lang_region_variant that inherit from a top-level locale known as root. Resource elements similarly inherit from root.
stored in Perl hash or array
in gettext-compatible external file
Phillip Vandry has written Locale::gettext (a perl wrapper around C routines), and James has written a CPAN module called Gettext [sic] that does a backquote call to gettext to do this.
in non-gettext-compatible external file or database. dbm and dbFile seem to be popular.

Mike Shoyher has written Locale::PGetText (which is pure perl and does not rely on the system locale, but does not follow the gettext API standard or file format and uses dbm). I have seen a program use dbFile.

Even more interesting would be using database tables for the string catalog.


Programming for Internationalization FAQ talks about gettext, by Michael K. Gschwind.


GNU gettext manual and 
GNU gettext download


SunDocs has information on gettext here.


Q22. I do database programming with Perl. Can I use Unicode?

A22. This is well-supported with Oracle 8 and Sybase. Usually your database must be installed with the correct character set. WS from GlobalSight recommends setting your data locale to whichever your office most commonly uses for reporting.

Oracle 8

Oracle has broad i18N support. It is important to read the correct documentation for your version of Oracle.

Likely for legacy reasons, Oracle servers must be configured with 2 character sets:

the default database character set must be single-byte or variable-length multi-byte and should be a superset of the national character set (explained below).  This character set affects almost everything - table names, most field names, PL/SQL symbols, etc. char(20) and varchar(20) always mean 20 bytes.
the national character set can be fixed-width or variable-with multi-byte. The only fields affected are NCHAR, NVARCHAR2, and NCLOB.  nchar(20) and nvarchar2(20) mean 20 *characters* if fixed-width, 20 *bytes* if variable-width.

UTF-8 is variable-width multi-byte and is a valid charset for both cases.

The server's national character set is set with NLS_LANG = AMERICAN_AMERICA.UTF8, for example.

Oracle National Language Support (NLS)

Oracle 8 National Language Support: An Oracle Technical White Paper 


Oracle iAS (Internet Application Server) is a very interesting product. It is bundled with Apache, and integrates Apache language content negotiation with Java local support for impressive locale mgmt. There is a demo called World-O-Books.

Oracle will likely add a UTF-16 encoding in 2001.

It has been reported that DBI does not work with UTF-8 stored in Oracle nvarchar2 fields.

Sybase

Configure with UTF-8. Sybase has an alpha version of UTF-16 with beta release in November, 2000 and full release in 2001. New column types unichar and unitext have been added, which may be optionally configured for surrogate use or not. They added UTF-16 to meet user requests for a wide Unicode datapath for Java and NT applications. 

Microsoft SQL Server

Michka Kaplan has this to say:


SQL Server supports the datatypes NTEXT, NCHAR, and NVARCHAR, all of which are of type UCS-2. When such a column indexed, then the index is Unicode. SQL Server 7.0 only supports one language collation at the server level.... this choice affects the actual ordering of all such indexes.


SQL Server 2000 supports a COLLATE keyword that allows you to specify a collation at the database or field level and thus choose a  different language for such columns/indexes if you like (I discuss practical details and implications of this feature in an upcoming article in the Visual Basic Programmer's Journal, tentatively scheduled for November). See also his book, Internationalization with Visual Basic.


In any case, you can certainly query and such field in either SQL 7.0 or in SQL 2000.

Informix

*** TBA ***

DB2

*** TBA ***

MySQL

MySQL supports many encodings, but not Unicode.  And these many encodings can only be used 1 per database. (So you could create a Korean-only or Japanese-only database.)

The MySQL manual lists Unicode as one of the "things that must done in the real near future."

By default, accented characters are collated to the same value. This can be overridden by/compensated by:


defining the column as binary
defining the column as non-unique
creating a new character table


A work-around would be to use the default ISO-8859-1 and have all your database routines convert to UTF-8 and translate to native encodings. You would lose the ability to sort automatically and would need to do your own conversion to native character encodings for presentation to the client.

Progress Software is considering combining ICU and MySQL. Progress has experience adding Unicode to their Progress database product.

Basis Technology has added Unicode support to MySQL using Rosette's Unicode library and may be involved with adding ICU support also.

PostgreSQL

There is an install option --enable-multibyte-=UNICODE for unicode support. The option --enable-unicode-conversion enables Unicode to legacy conversion tables. No further information so far on how well it works.


PostgreSQL Localization Documentation


Q23. I do database programming with Perl. What are the i18N issues?

A23. The usual issues must be dealy with, like collating (sorting) and conversion to native encodings for application input and output.

A new issue is ensuring your database columns are wide enough.  UTF-8 for example can triple the size of a Japanese column.

Also, if your fields were designed for English text, they will have to be up to 100% wider for other languages, like French.

So to be safe, I would quadruple originally English ASCII text columns for use with UTF-8 databases that will store CJKV characters.

While many people use HTML entity codes ("entification") to represent accented characters in static HTML, this aggravates collation so I recommend using actual character values in databases and content management systems instead of entity values. A post-processing filter can always be used if there is a need for entity codes. One such requirement would be for foreign language support in the older Mac browser versions. Also beware of user page creation tools that gratuitously emit smart quotes HTML output.


Q24. How do other programming languages implement Unicode and i18N?

A24. Programming Languages

Ken Lunde's CJKV, p. 410 also describes some features of C/C++, Java, Python and TCL.

Tony Graham's A Primer for Unicode also has a language survey chapter.

Unicode: APrimer  has a good comparison of programming languages

Java language

Java has what programmers generally acclaim as the best Unicode support. Since Java 1.1, i18N support is more or less based on ICU.

characters and strings - always Unicode UTF-16. Java 1.0 supports Unicode 1.1, and Java 1.1 supports Unicode 2.0.
character set conversion - Java 1.1 has methods to import/export native encodings or UTF-8
localized number formatting - yes
localized date and time - yes
localized calendar functions - yes 
collation - yes
localized string resources - yes, resource bundles are used to load string resources by locale and can even handle dynamic substitution of parameters.


Oddly, there is an interface module to Java called Java.pm, a Perl module for extracting information from Java class files. No functions to parse Java resource bundles though.

See Also

O'Reilly's Java In a Nutshell, Chapter 11

Sun's Java i18N Tutorial

IBM's Unicode site


C language

C provides the wchar_t and wchar_t * type for handling wide characters and strings. Since wchar_t varies in width by compiler, for portable programming you should define a macro of the appropriate width rather than use wchar_t directly.

ISO C Amendment 1
and ISO C 99 both added a rich set of new library
functions for handling wchar_t strings to the language. C can also
handle Unicode by using UTF-8 as the multi-byte encoding in the char *
type, such that upgrades from ASCII to UTF-8 can be implemented with
relatively minor changes in most existing software.

Regular C string functions like strcmp may not work with embedded null values.

Since most operating systems are substantially written in C, the C locale libraries are available to applications as well.

ICU (International Classes for Unicode) may be used to internationalize C/C++ programs.

Basis Technology's Rosette Library C/C++ is commonly used for adding Unicode support to custom programs. For a demo, try Lycos.

If you are porting old C/C++ source to Unicode on Unix or Windows, you may want to try OneRealm's Eliminator program for converting and auditing source code.

C++ language

See C language above.

ECMAscript (aka JavaScript, LiveScript)


strings - always UTF-16. There is no character type. No normalization is done internally, so it expects to receive Unicode Normalization Form C, which the W3C recommends.
localized number formatting - yes
localized date and time - yes
localized calendar calculations - primitive


Adding internationalization support to the base standard for JavaScript: Lessons learned in internationalizing the ECMAScript standard, Richard Gillam

Ada95 language

Ada95 was designed for Unicode support and the Ada95 standard library
features special ISO 10646-1 data types Wide_Character and Wide_String,
as well as numerous associated procedures and functions. The GNU Ada95
compiler (gnat-3.11 or newer) supports UTF-8 as the external encoding of
wide characters. This allows you to use UTF-8 in both source code and
application I/O. To activate it in the application, use "WCEM=8" in the
FORM string when opening a file, and use compiler option "-gnatW8" if
the source code is in UTF-8. See the GNAT and Ada95 reference manuals for details.

XML

XML requires parsers to support both UTF-8 and UTF-16. Optionally, XML parsers may support other character sets, such as ASCII.

A contributor writes:

"It's like so: if you're using UTF-16, you *must* have the BOM.  If 
you're using UTF-16-LE or BE, then even though these are really UTF-16, 
you can avoid using the BOM.  Some people over at the IETF are trying 
to *forbid* the BOM with the -BE and -LE versions, which essentially 
means that they shouldn't be used for XML, since XML is all about 
interoperability, and byte order just can't relied on outsde the 
firewall."

Extensible Markup Language (XML) 1.0 - Character Encoding in Entities

Extensible Markup Language (XML) 1.0 - Autodetection of Character Encodings (Non-Normative)

The CPAN module XML::Parser has its own character transcoding routines (multi-byte no less), which are documented in the source in Expat/expat/xmlparse/xmlparse.h, Expat/encoding.h and Expat/Expat.xs. There is an interesting description of the lack of standards in the S-JIS and EUC-JP character set encodings in Japanese_Encodings.msg.

XML::Parser uses wchar_t, which the Unicode Standard recommends against as wchar_t varies in width depending on the compiler. XML::Parser also engages in old-fashioned C character processing loops, which are inadvisable in i18N programming.

Python

Python i18N SIG

TCL

How to write a transformation (channel)


Q25. What support for Unicode do web browsers have?

A25. This varies depending on browser version.

Netscape 4.x and IE 4.x and above support UTF-8 and possibly other transformation formats reasonably well. Statistics I have seen show that 99% of page views are with 4.x and above. UTF-8 is fine for content pages.

Some problems that have been observed are:


in Windows Communicator 4.7, View Source on a UTF-8 encoded page while the ISO-8859-1 character set is selected will show hollow rectangles between characters.
in pre-4.06 Netscape Communicator, JavaScript strings can break if they end in a binary value that could be confused with a quote character
forms maybe tricky with UTF-8
the default Japanese fonts used with Navigator for Linux have been described as "unreadable".  Try TwinBridge for different fonts.
It has been reported that you lose the ability to get script-less auto-download of language support for ARA/HEB/THA/VIE/CHT/CHS/KOR/JPN for people using IE 5.0/5.01/5.5/etc.



Q26. How can I i18N my web pages and CGI programs?

A26. In a nutshell, by generating the correct headers and HTML meta tags.

HTML4 uses UCS as its document character set, but the document may be transferred with a different character encoding. That is why numeric entity codes index UCS code values rather than S-JIS or ISO-8859-1 or whatever character encoding was used to transfer the document.

Please seee Notes on Internationalization for details. To summarize, it recommends configuring your web server to send the correct Content-type character encoding response and not using META CHARSET tags, especially with ISO-8859-1 text. Some older browsers are known to ignore charsets that are not ISO-8859-1, so this may not be a perfect solution. 

When generating static pages, I would emit the correct META tag to help browsers of foreign readers default to the correct character encoding. (Do a view source on this page to see one.)

The correct meta tag is helpful even when the server could send the correct header, because some browser versions do not understand non ISO-8859-1 charset set header names, like ISO-8859-5 for example.
Otherwise older browsers may see the file as an unrecognizable encoding and treat it as a binary that must be saved to disk.
Although the language meta tag should appear as the first tag in the head block, it may not make that much difference where it is placed as some believe the entire head tag is processed as a unit.

To make your HTML render correctly, avoid font face tags in CJKV pages that users may not have. Otherwise mojibake or hollow squares will likely result (observed with Netscape browsers on both Windows and and Mac platforms.) Using a very general font tag like <font face="sans-serif,Arial,Helvetica,serif"> may work by providing enough defaults so some font will match. Also, the non-breaking space tag may cause visual problems when displaying Japanese text on the Mac in older Netscape browsers.

A work-around for displaying multiple native encodings on the scree simultaneously is to use multiple frames, one for each charset.

It has been reported that as of now (Sept. 12), UTF-8 is not usable as a general charset for public Internet use when Asian languages are displayed on Netscape 4.x because of issues of an available Unicode font and low likelihood that your audience can find and install a Unicode font.

HTML::Entities::{de,en}code_entities can be used to translate HTML entity codes from de to en for example.
W3C SGML Entities

When generating dynamic pages, emit the correct server response header:

    use CGI.pm qw /:standard/;
    print header("text/html; charset=iso-8859-1"),
       start_html();

If you want to generate the correct content type and language for creating static HTML pages with CGI.pm, you can modify the start_html subroutine in CGI.pm to 
emit the custom head tags first, then:
print start_html( -head => q{

});


Forms may be constructed with a hidden field with charset input and locale (language and country) fields.

IE5 and later will fill a field called _charset_.
<input type="hidden" name="_charset_">

IE4 and IE5 will submit characters that do not fit into the charset used for form submission as HTML numeric character references (&#12345;). This may need special decoding.

Forms may also be constructed have a hidden field with known sample text to be tested for mangling. This would be especially effective with Japanese, since multiple character sets are in common use, including ISO-2202-JP, Shift-JIS, and JIS.

RFC2070
recommends using the ACCEPT-CHARSET hint, like ACCEPT-CHARSET="utf-8" on both the FORM element and the FORM controls. There is substantial ambiguity in how the user, browser client, and server would interact with ACCEPT-CHARSET, so this method would need some experimentation.

With


FORM METHOD=GET ...  and
FORM METHOD=POST ENCTYPE="application/x-www-form-urlencoded" ...


the %HH codes represent octets but there is not standard for specifying the character encoding. Only ENCTYPE=multipart/formdata (sp?) provides the opportunity to send the encoding information. So far only lynx uses it.  

Altavista's international country selector page illustrates using .GIF images to render multiple character sets on the same non-Unicode encoded page.

The recommended character set encoding for Japanese web pages is Shift-JIS.

Fascinating email on browser charset issues


Q27. How should I structure my web server directories for international content?

A27. If you have more than a few pages for each locale, the usual practice is to have separate directories for each locale to make content management easier. This is especially effective for truly localized content like local news, or when some locales get more frequent updates than others.

Here's an example:

http://home.netscape.com/fr/index.html, or perhaps even better depending on your requirements:


http://home.netscape.com/fr_FR/index.html


Q28. Can web servers automatically detect the language of the browser and display the correct localized page?

A28. Yes. HTTP/1.1 defines the details of how content negotiation works, including language content.

WWW browsers send an Accept-Language request header specifying which languages are preferred for responses. This technique works fairly well, although some versions of Netscape Navigator send an improperly formatted request parameter. Also, switching language preferences in either Navigator or IE 4 doesn't always "take" without first deleting a language hint.

Few sites do content-negotiation on language, and interestingly enough I do not know of any major portals doing this. One site that does is Sun's documentation library at SunDocs. Debian.org does a very nice job of using Apache content negotiation wih languages and also has some nice help info too on Setting the Default Language.

Apache's Content Negotiation features will select the right page to return, whether HTML or image file. Annoyingly, the match logic is very literal, so a browser request for en-us will not match a server entry of en except as a last resort. Any other exact match will win over en, even if en-us was first preference.

There are 2 ways of doing content negotiation with Apache: type or variant maps and multiviews.

Variant Maps

In httpd.conf, disable Options Multiviews if configured and add

AddType type-var var
DirectoryIndex index.var


Then create an index.var file like this:

URI: start; vary="type,language"

URI: index.html.en
Content-type: text/html
Content-language: en-GB

URI: index.html.en
Content-type: text/html
Content-language: en-US

URI: index.html.en
Content-type: text/html
Content-language: en

URI: index.html.fr
Content-type: text/html
Content-language: fr-CA

URI: index.html.fr
Content-type: text/html
Content-language: fr

URI: index.html.es
Content-type: text/html
Content-language: es


Multiviews

The multiviews technique works like this. This does add extra server load, as each content directory must be scanned for the variant document names.

index.html is localized into variant documents such as:


index.html.en (or index.en.html)
index.html.fr
index.html.es
index.html (or index.html.html as a reader has recommended) symlinked to one of the above as the last resort.


Here's an example of http.conf directives for this:

     # in httpd.conf
     AddLanguage en .en
     AddLanguage fr .fr
     AddLanguage es .es

     #
     # LanguagePriority allows you to give precedence to some languages
     # in case of a tie during content negotiation.
     # Just list the languages in decreasing order of preference.
     #
     LanguagePriority en es fr de pt
     
     Options Multiviews
     
     # end of httpd.conf fragment


Starting in Apache 1.2, you may also create documents with multiple language extensions.

O'Reilly's Apache - The Definitive Guide, Chapter 6


ApacheWeek article on Language Negotiation

Another ApacheWeek article on Language Negotiation

Netscape Enterprise Web Server

Yes, the Enterprise International Server edition supports multiple languages with content directories. Also LDAP may be populated with UTF-8 data, and server side javascript and the search feature can have a default language specified.

When a request for this page is made by a browser with an accept language request header for: http://www.someplace.com/somepage.html the server translates that to the following URL: http://www.someplace.com/xx/somepage.html, where xx is a language code.

If not found, the default page is served.

The magnus.conf directives that control this are:

ClientLanguage (en|fr|de|etc.) for client error messages like "Not Found" or "Access Denied"
DefaultLanguage(en|fr|de|etc.) for other error messages
AcceptLanguage (on|off)



Netscape Enterprise Server Administrator's Guide for Unix

Microsoft IIS

*** TBA ***


Q29. What format do I send strings to the translator?

A29. Ask the translator for the recommended string file format
before sending strings out for translation.
They may have a specific file format in mind for use with
translation memory tools. Some use Microsoft Word and
prefer their import format for creating a table.

If there's no preference (or you just can't wait), here's an acceptable text format that is both human and machine readable:
project:
copyright:
date:
version:
charset:

id: MESSAGE_ID1
en: english text
xx: translated text
note: Notes.
-
id: MESSAGE_ID2
en: english text
xx: translated text
note: Notes.
-

The note field can be used to tell the translator hints about context and parts of speech (noun, verb, etc.) If the final charset will be Unicode, you should consistently use Unicode throughout the production process to avoid potential conversion errors. The delimiting mostly blank line aids readability for humans.

Here's a first pass at an XML string resource format that should be compatible with current translation tools. See http://www.opentag.com/ for more details.
<?xml version="1.0" encoding="UTF-8"?>
<strings>
   <version>1.0</version>
   <string id="test">
      <en>test</en>
      <fr>l'examen</fr>
      <note>noun</note>
   </string>
</strings>

Here's a parser:


use XML::Parser;
use XML::Simple;

my $file = './strings.xml';
my $xs1 = XML::Simple->new();

my $doc = $xs1->XMLin($file);

foreach my $key (keys (%{$doc->{string}})){
   print "string=$key\n";
   print "   en=", $doc->{string}->{$key}{en}, "\n";
   print "   fr=", $doc->{string}->{$key}{fr}, "\n";
   print "   note=", $doc->{string}->{$key}{note}, "\n";
}


If your tools have xml:lang support, you can try this:

<?xml version="1.0" encoding="UTF-8"?>
<strings>
   <version>1.0</version>
   <string id="test" xml:lang="en">test</string>
   <string id="test" xml:lang="fr">l'examen</string>
</strings>



Q30. What are common encodings for email?

A30. Both transfer and character set encodings must be considered.

SMTP is a 7-bit protocol that supports US-ASCII. To support 8-bit data over 7-bits, many transfer encodings were developed:


quoted-printable - quoted-printable encodes bytes greater than 127 as hex values. It is mostly human-readable for Western European languages.
base64 - base64 encodes 3 bytes as 4 bytes.
uuencode - similar to base64, but with control information. See man uuencode.
binhex - Macintosh format
xxencode - obsolete alternative to uuencode, very similar algorithm.

Mail may be mangled such that multiple transfer encodings are used simultaneously (most commonly observed in webmail products).
Also, MIME encoding is often used now.

Here are standard and common character set encodings used in email:

English: US-ASCII
Latin languages: ISO-8859-1 or Windows code pages
Non-latin languages: ISO-8859-x or Windows code pages
Japanese: Standard - ISO-2022-JP, Common in US - S-JIS, Common in US but not recommended - EUC-JP
Korean: Standard - ISO-2022-KR and EUC-KR - 
Korean mail encodings are described here
Simplified Chinese (Mainland PRC China, Singapore): GB2312
Traditional Chinese (Taiwan, Hong Kong): Big5, EUC-TW


MIME allows multiple attachments, each in a different character set encoding.

What complicates email replies is that previous text is quoted in the body of the email. The quoted text is in a specific character set encoding that may not be what your MTA or MUA prefer, so transcoding may need to be done. An example would be replying to email in EUC-JP when you prefer S-JIS.

Here are some comments from a Japanese email user:
If you write Japanese email, it's strongly recommended
to use ISO-2022-JP. The contemporary way to send japanese email is,

  * Mime-Version is 1.0,
  * Content-Type is Text/Plain; charset=iso-2022-jp,
  * Content-Transfer-Encoding is 7bit (not Base64 nor quoted-printable)
   (some people/mailer don't attach CTE header. I don't know
    which is better.)

In early days, Japanese use ISO-2022-JP without MIME header. But
now, it's not recommended.

You should not use Shift_JIS and EUC-JP, even though RFC permit
to use these encodings, because there are many mailers in japan
not to handle these encodings still now.
In (near?) future, we may use UTF-8, but we should not use it now.


UTF-8 is not commonly used yet, but here's a sample header:

MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Mailer: Becky! ver. 2.00 (beta 10)
Content-Type: text/plain; charset="UTF-8"


Internet Mail Consortium: Implementing Internationalization in Internet Mail

GNU UUDeview and UUEnview Tools

Decoding Internet Attachments - A Tutorial


Q30b. What is happening with internationalized DNS?

A30b. The goal was to have a standard for
hosting non-English domain names in local scripts by the end of 2000.
For example, if a user wants to use Chinese ideograms as a domain name, that will be supported by both DNS resolvers and servers.

This will be accomplished by internationalizing the domain names, but not DNS itself, by storing the Unicode name as an encoded ASCII sequence in DNS. An example of an encoded ASCII sequence from the RACE algorithm of Japanese text is bq--gbqxg.com

The complications with internationalized DNS are:

DNS is a critical Internet service that has to work reliably and uniquely
BIND is a legacy product that is only slowly upgraded by hostmasters (if at all)
some machines are still using DNS in ROM!
hostmasters want human-readable (and English, too) logs generated for trouble-shooting.


Right now care must be taken when implementing iDNS since most the of standards documents are still in draft.

IETF IDN WG

draft-ietf-idn-race-03.txt - RACE: Row-based ASCII Compatible Encoding for IDN

Perl Sample Script for draft-ietf-idn-race-02.txt   
Perl Module for draft-ietf-idn-race-02.txt 

draft-ietf-idn-lace-01.txt - LACE: Length-based ASCII Compatible Encoding for IDN

Convert::RACE and Convert::LACE Perl modules by Tatsuhiko Miyagawa


Internet Mail Consortium nameprep test tool

i-DNS.net International

Center for Internet Research (CIRC) iDNS (Obsolete)



Q30c. How do I manage timezones in Perl?

Adding timezone support to an application your first time is a challenge.

Timezone boundaries are in fact dictated by law and politics, not
geography. This means that a program needs a
table by date and political entity (city or county level) that is updated on
a monthly basis with GMT offsets to be of any use.

The generally accepted practice nowadays is to store the
users' timezone setting preference as Region/City.

The city level is a fine enough granularity to define timezones
accurately (so far). Of note is that city timezone policies are less
subject to change than country borders, especially in
politically fluid regions like Africa. Legacy alphabetic (short) identifiers
like 'PST' are deprecated, or translated from and to
Region/City as needed. Storing EST as a preference is
obviously wrong for half the year, and am ambiguous,
and storing a GMT offset is even worse since then you're
not sure what the timezone is at all really.

Perl support for timezones has strengthened recently.

The most appropriate module for new applications
is Perl Wrappers for ICU. It supports both Region/City and
legacy short identifiers.

For legacy alphabetic identifier only support, Date::Manip
and Time::Zone are available on CPAN.

Olson tz data (used in most Unix implementations) and other documents
ftp://elsie.nci.nih.gov/pub/

A site that has interesting maps and info on timezones:
http://www.worldtimezone.com/

More information:
http://www.twinsun.com/tz/tz-link.htm

Just about any Unix install program these days uses a Region/City picker.

The Netmax Firewall product is notable in being a mod_perl
application with very sophisticated timezone support based on
a graphical image world map picker for Region/City.

A30c. Date::Manip can do timezone conversions between Standard timezones, not Daylight Savings timezones.

(If you are doing client-server programming, and the server stores dates in a common timezone (like PST or GMT), and the server's clock is set to Standard or Daylight time, and all your users also obey Standard or Daylight time, then Date::Manip will give you the correct timezone conversions. Otherwise the time will be offset 1 hour.)

ICU/picu is probably your best bet to do more flexible timezone conversions.

A Summary of the International Standard Date and Time Notation

Military World Time Zone Map




Q31. Any good references?

A31. Here's a list:

perldoc perlunicode in perl 5.6+
perldoc perllocale
CJKV Information Processing, Ken Lunde, O'Reilly & Associates (Errata)
The Unicode Standard : Version 3.0, The Unicode Consortium
A Practical Guide to Software Localization, Bert Esselink, 1998
Vancouver Web Pages is a site with cool stuff and has applied material on localization, translation memory tools and project management.

How to make a Multilingual Web server 
http://vancouver-webpages.com/multilingual/howto.html
Multilingual Resources 
http://vancouver-webpages.com/multilingual/resources.shtml

IETF Draft - Internationalization of the Hypertext Markup Language
Jim Breen's Japanese Page
Ken Lunde's Home Page
W3C - Internationalization / Localization Non-western Character sets, Languages, and Writing Systems
W3C - Charlint - A Character Normalization Tool
Notes on Internationalization by by A.J.Flavell
View Perl5-Porters archives
Subscribe to Perl-Unicode mailing list
honyaku@onelist.com Japanese translation mailing list

Archive
nihongo-computing@msdi.co.jp Japanese computing list

Archive
The ASCII Consortium : Official Website (Humor)
ON THE STATUS OF THE LATIN LETTER ÞORN AND OF ITS SORTING ORDER (More Humor?)
Character Set Considered Harmful
Unicode: A Primer, Tony Graham (Chapter Index)
SimulTrans Book Club


Miscellaneous References

Ken Lunde has written a character set conversion C tool called jconv to convert Japanese JIS, SJIS and EUC encodings. Typical usage is
jconv -is -oe <sjis.txt >euc.txt

Some other transcoding utilities are Laux's Kakitori, Ichikawa's NKF (there is a NKF Perl wrapper also), Sato's QKC, Yano's RXJIS.

iconv can also be used (on Unix systems):

iconv -f UJIS -t ISO-2022-JP JIS.txt


Viewing and Inputting Japanese Text in Windows

Ruby and Perl Code

Communicator 4.7 Release Notes - International

Japanese Email Software

ISO C Standard

If you are looking for tools for chinese, you will find many things here
http://www.mandarintools.com/
and here for japanese :
 http://ftp.cc.monash.edu.au/pub/nihongo/00INDEX.html

This free editor for Windows (95/98/NT/CE) is able to handle many different
encodings for japanese including EUC-JP and UTF-8.
 http://www.physics.ucla.edu/~grosenth/jwpce.html
Unicode resources on the net.


IBM Classes for Unicode (ICU), ex-Taligent, available in C++ and Java

http://czyborra.com/unicode/ucoverage lists the coverage of an *-iso10646-1 BDF font or a Unicode mapping file by Roman csyborra

The 19 × 21 × 28 = 11'172 Hangul syllables are all that are needed for Modern Hangul.

#!/usr/local/bin/perl
foreach $choseong (split(/;/,
"G;GG;N;D;DD;L;M;B;BB;S;SS;;J;JJ;C;K;T;P;H"))
{
  foreach $jungseong (split(/;/,
  "A;AE;YA;YAE;EO;E;YEO;YE;O;WA;WAE;OE;YO;U;WEO;WE;WI;YU;EU;YI;I"))
  {
    foreach $jongseong (split(/;/,
    ";G;GG;GS;N;NJ;NH;D;L;LG;LM;LB;LS;LT;LP;LH;M;B;BS;S;SS;NG;J;C;K;T;P;H"))
    {
      printf("U+%X:%s\n", 0xAC00 + $i++,
             "HANGUL SYLLABLE $choseong$jungseong$jongseong");
} } }





Q32. How do I convert US-ASCII to UTF-16 on Windows NT?

A32. Here's Peter Prymmer's code to convert US-ASCII to UTF-16 on Windows NT:

$unicode_string = "\0" . join("\0",split(//,$ascii_string));



Q33. How do I transform the name of a character encoding to the MIME charset name?

A33. Here's code to transform name of a encoding to the MIME charset name:

#!/usr/bin/perl
# Read the name of a character encoding from stdin and transform it into
# the corresponding standardized MIME charset name, as registered on
# (or pipelined for) ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
# Markus Kuhn  -- 2000-05-24

while () {
    tr/a-z/A-Z/;
    if (/8859[-_\/](\d+)(:.*)?$/) {
	print "ISO-8859-$1\n";
    } elsif (/LATIN[-_\/]?([1-4])$/) {
	print "ISO-8859-$1\n";
    } elsif (/LATIN[-_\/]?CYRILLIC/) {
	print "ISO-8859-5\n";
    } elsif (/LATIN[-_\/]?ARABIC/) {
	print "ISO-8859-6\n";
    } elsif (/LATIN[-_\/]?GREEK/) {
	print "ISO-8859-7\n";
    } elsif (/LATIN[-_\/]?HEBREW/) {
	print "ISO-8859-8\n";
    } elsif (/LATIN[-_\/]?5$/) {
	print "ISO-8859-9\n";
    } elsif (/LATIN[-_\/]?6$/) {
	print "ISO-8859-10\n";
    } elsif (/LATIN[-_\/]?7$/) {
	print "ISO-8859-13\n";
    } elsif (/LATIN[-_\/]?8$/) {
	print "ISO-8859-14\n";
    } elsif (/LATIN[-_\/]?9$/) {
	print "ISO-8859-15\n";
    } elsif (/LATIN[-_\/]?10$/) {
	print "ISO-8859-16\n";
    } elsif (/UTF[-_\/]?8/ || /UTF$/) {
	print "UTF-8\n";
    } elsif (/(WINDOWS|WIN|CP|DOS|IBM|MSDOS)[-_\/]?(\d+)$/) {
	print "windows-$2\n";
    } elsif (/ASCII/ || /X3\.4/ || /[^\d]646[\.-_\/]?IRV/) {
	print "US-ASCII\n";
    } elsif (/2022[-_\/]?([A-Z\d-]+)(:.*)?$/) {
	print "ISO-2022-$1\n";
    } elsif (/SHIFT[-_\/]?JIS/) {
	print "Shift_JIS\n";
    } elsif (/BIG[-_\/]?5$/) {
	print "Big5\n";
    } else {
	print;
    }
}

Tony Graham's Unicode Slides

Concepts of C/UNIX Internationalization by Dave Johnson


Mike Brown's XML and Character encoding

Setting up Macintosh Web Browsers for Multilingual and Unicode Support

Japanisation FAQ for computers running Western Windows

XML Japanese Profile W3C Note 14 April 2000

Erik's Various Internet and Internationalization Links

email from spc on unicode list

Herman Ranes' Japanese and Esperanto page in UTF-8

Why Can't they Just Speak ? - 102 Languages in UTF-8

Alan Wood's Unicode Resources - Unicode and Multilingual Support in Web Browsers and HTML

I Can Eat Glass ... Project

Origin of euro

Unicode Mailing List Archives

Braille

p5pdigest article on Unicode?

Alan Wood's Unicode Resources: Windows Fonts that Support Unicode

IETF Internationalized Domain Names WG

Repertoires of characters used to write the indigenous languages of Europe: A CEN Workshop Agreement


} END {
A language is just a dialect with an army and a navy.


ActionMessage Email Marketing

Glyph	A glyph is a particular image which represents a character or part of a character.
Coded Character Set	A mapping from a set of abstract characters to the set of non-negative integers. This range of integers need not be contiguous.
Locale	A specific language, geographic location and character set (and sometimes script). An example is 'fr_FR.ISO-8859-1', although locale strings are seldom standard on across platforms at present. Often only language_country is specified, or even just language. In a client-server environment (like the web), 3 locales are usually considered - server, data, and client locales.
Internationalization (i18N)	The technical aspects (character sets, date formats, sorting, number formatting, string resources) of supporting multiple locales in 1 product.
Localization (L10N)	The practical aspects (language, custom, fashion, color, etc. ) of expressing an application in a particular locale. Roughly, i18N is considered an engineering process while L10N is considered a translation process.
Globalization (g11n)	The cultural aspects of supporting multiple locales in a non-offensive and universally intuitive manner. Think Olympic or airport signage.
Canonicalization (c14n)	The process of standardizing text according to well-defined rules.
Japanization	The conversion of a product into the Japanese language and character set. There are 4 character sets used in Japanese computer text: hiragana, katakana, kanji and romaji (English alphabet.) Sometimes kanji characters have ruby aka furigana ("attached character") annotation above them to aid in irregular or difficult readings of personal or geographics names and for school children. Mojibake means "scrambled character" and is used to describe the unreadable appearance of electronic displays when the wrong character decoding is used. Because kanji may have multiple readings (meanings) depending on context, machine conversion to hiragana is unreliable. The kinsoku rule says that Japanese sentences are separated by periods that may not wrap to the beginning of a line.
CJKV	Chinese, Japanese, Korean and Vietnamese are often considered together because they all use multi-byte encodings.
Unicode	Unified Code for characters. Current version is 3.0. Most modern character sets have been incorporated already, and also many ancient ones (I have been informed that Indonesian Jawi is represented by Arabic and Extended Arabic codepoints. I need to doublecheck that since there is also a Indic script used in Java.) Unicode is a complex character set that is unlike ASCII in many ways, some of them being: A glyph may be composed from multiple codepoints in more than 1 ordering; national character sets may not consist of contiguous codepoints; symbols such as bullets, smiley faces and braille are included; binary sorting of Unicode character sequences is likely to be meaningless unless the sequences are normalized first.
UTF-32	32-bit Unicode 3.0 Transformation Format
UTF-16	16-bit Unicode 3.0 Transformation Format is a 16-bit encoding, with 16-bit surrogates for private characters and future use (Chinese characters, ancient languages, special symbols, etc.)
UTF-8	8-bit Unicode 3.0 Transformation Format is a variable width encoding form updated from UTF-2, often used with older C libraries and to save space with European text. UTF-8 may be 1 to 6 octets long, although usually 4 octets without surrogates. It is defined in RFC 2279
UTF-2	8-bit Unicode 1.1 Transformation Format is a variable width encoding form that was superseded by UTF-8. UTF-2 was used in Oracle 7.3.4
UTF-7	7-bit Unicode 2.0 Transformation Format is a variable width encoding form (used with older email that was not 8-bit clean and not MIME.) See RFC 2152
Unicode compliant	Character encoding implementation that conforms to a particular version of the Unicode spec, for certain features. May only implement a subset if so documented. (For example, a Unicode compliant app might only support certain languages (typically Western European), or even allow only US-ASCII!)
Code value (codepoint)	Unicode value for a character that is all or part of a glyph. The same codepoint may represent multiple glyphs, especially in Han unification (Chinese, Japanese, Korean.) Accents alone may have their own codepoint.
Pre-composed character	A Unicode character consisting of one code value. Some accented characters, notably Western European, have their own codepoints.
Base character	A Unicode character that does not graphically combine with preceding characters, and is not a control or format character.
Combining character	A Unicode character that graphically combines with a preceding base character. Typically accents and diacritical marks.
Composed Character	A Unicode character made of combined codepoints, usually non-spacing mark (accent) characters. Often the same accented glyph may consist of codepoints in different orderings, for example a character with accents above and below the character (like Thai.)
Combining character	A character that normally appears after a base character, and is an accent or other diacritical mark that is added to the previous base character.
Compatibility character	A character included in the Unicode standard that has been included for compatibility with a legacy encoding. Usually it looks similar enough to another non-compatibility character to be replaced with it when appropriate. An example is the set of Japanese half-width hiragana code values, which were included for round-trip compatibility with other character set encodings for use in smaller character cells, even though a Unicode application could achieve the same appearance with application-defined font rendering.
Normalization	There are four functions performed on Unicode character sequences so that two sequences may be compared in a meaningful way. Normalization is necessary because decomposed characters may have accents in different orders before normalization, but be the same glyph. Normalization is especially important to perform when computer language identifiers, filenames, mail folder names, digital signatures and emitting XML or JavaScript are involved.
Collation Order	Table and/or algorithm for sorting strings specific to a locale and usage (dictionary, phonebook, etc.) Unicode has not specified collation order at this time, but should in 3.0.
UCS	ISO/IEC 10646 Universal Multiple-Octet Coded Character Set. Both UCS and Uncode standards now share identical code values. The major difference between UCS and Unicode is that UCS is mostly concerned with defining code values, while Unicode adds semantics to the code values.
UCS-2	16-bit Universal Character Set (no surrogate pairs)
UCS-4	31-bit Universal Character Set.
Character Property	Unicode code values have default properties such as case, numeric value, directionality and mirrored as defined in the Unicode Character Database.
Combining Class	A numeric value given to each combining Unicode character that determines with which other combining characters it typographically interacts.
Byte Order Mark (BOM)	Unicode code value U+FEFF may optionally be prepended in serialized forms (files, streams) of Unicode characters. By default, files are assumed to be in network byte ordering (big-endian). BOM is discussed at greater length in the document.