| Author/Maintainer | James, Silicon Valley Perl Mongers, ex-Netscape and Yahoo contractor. Email me if you would need Perl/i18N/web consulting. |
| Contributors/Reviewers | LW, GS, AK, PP, RT, TM, DH, TB, JH, MD |
| Copyright | 1999-2001, James, released under the Perl Artistic Licence |
| Link | http://rf.net/~james/perli18n.html |
| Date | 2002 02 18 |
| Audience | Perl programmers and porters interested in Perl, Unicode and Internationalization. Fonts are not addressed in this document. |
| Disclaimer | If there's an error or omission, blame the author, not the reference. |
| Version | Draft 0.3.37- Please send me your comments! |
| Perl & Unicode Sightings |
|
| Programming Tools |
This is what I use:
|
The Apache Conference 2001 in San Jose featured a talk by ASF member Eric Cholet on writing Internationalized Applications with Perl and Template::Toolkit. Basically, he used conditional logic in the template for each supported language. The Mason template developers later talked about supporting locales in their templates without conditional logic, over a few beers. Eric also emphasized the importance of offering accented and non-accented search widgets for European users. Apparently our non-English friends often type in search queries without bothering to accent the phrase.
Brian Stell has started the PICU (Perl Wrappers for ICU (International Components for Unicode)) project which is being hosted on SourceForge. Both of us attended the ICU Workshop Sept. 11-12, 2001 at IBM and Brian already has XS code working for a subset of ICU. I would like to thank IBM and the Unicode Committee for their contributions to Open Source regarding Unicode and ICU. You can see my O'Reilly Perl Conference 2001 talk on PICU slides here.
Unicode is a 16-bit character set encoding (surrogates aside) and related semantics for simultaneously representing all modern written languages (and more). Unicode is the key technology for globalizing software, and has been implemented in Internet and database software.
With that power comes a price: Unicode is a complicated standard that requires skill and tools support to implement. This document was written to explain Unicode and international programming to two audiences, Perl porters (developers) and Perl users.
Q0. Do you have a checklist for internationalizing an application?
Q1. I think that I'm a clever programmer. What's so hard about internationalization?
Q2. Do you have a glossary of commonly used terms and acronyms?
Q3: What locale support does Perl have?
Q4. What support does Perl have for Unicode?
Q5. How do operating systems implement Unicode and i18N?
Q6. I'm a Perl Porter. What should I know about i18N and C?
Q7. I'm a Perl Porter. What should I know about Perl and Unicode?
Q8. I'm a CPAN module author. What should I know about Perl and Unicode?
Q9. Do regular expressions work with locales?
Q10. Do regular expressions work with Unicode?
Q11. What are these CPAN Unicode modules for?
Q11b. What about i18N POD?
Q12. What is JPerl?
Q13. Can I just do nothing and let my program be agnostic of character set?
Q14. Why and where should I use Unicode instead of native encodings?
Q15. What is Unicode normalization and why is it important?
Q16. How do I do auto-detection of Unicode streams?
Q17. Is Unicode big endian or little endian?
Q18. Is there an EBCDIC-safe transformation of Unicode?
Q19. Are there security implications in i18N?
Q20. Are there performance issues in i18N?
Q21. How do I localize strings in my program?
Q22. I do database programming with Perl. Can I use Unicode?
Q23. I do database programming with Perl. What are the i18N issues?
Q24. How do other programming languages implement Unicode and i18N?
Q25. What support for Unicode do web browsers have?
Q26. How can I i18N my web pages and CGI programs?
Q27. How should I structure my web server directories for international content?
Q28. Can web servers automatically detect the language of the browser?
Q29. What format do I send strings to the translator?
Q30. What are common encodings for email?
Q30b. What is happening with internationalized DNS?
Q30c. How can I manage timezones in Perl?
Q32. How do I convert US-ASCII to UTF-16 on Windows NT?
Q33. How do I transform the name of a character encoding to the MIME charset name?
Q0. Do you have a checklist for internationalizing an application?
Each application needs a different level of i18N support, so there's no formula.Often a single application will need varying levels of support for users and administrators. Generally companies want to do the minimal amount of work possible, without limiting future improvements, because of the cost of developing and testing changes.
Here is a basic checklist for an i18N design document for web applications:
Here is a basic checklist for a localization design document:
If your development team does not have i18N experience, consider either hiring former Adobe, Apple or Netscape i18N engineers or using the services of GlobalSight, SimulTrans or Basis Technology.
Airplanes don't fly until the paperwork equals the weight of the aircraft. Same with i18N.
Q1. I think that I'm a clever programmer. What's so hard about internationalization?
A1. Internationalizing a product involves issues about program design, application language features, cultural practices, fonts and often legacy clients. Most programmers face a rude awakening when first internationalizing an application after a career of only ASCII. Little details often become big headaches.
For a typical tale of woe, see Richard Gillam's Adding internationalization support to the base standard for JavaScript: Lessons learned in internationalizing the ECMAScript standard
Q2. Do you have a glossary of commonly used terms and acronyms?
A2. Here is high-level glossary of terms. For more detailed Unicode definitions, consult the Unicode Standard 3.0 and ISO 10646-1:2000.
| Glyph | A glyph is a particular image which represents a character or part of a character. |
| Coded Character Set | A mapping from a set of abstract characters to the set of non-negative integers. This range of integers need not be contiguous. |
| Locale | A specific language, geographic location and character set (and sometimes script). An example is 'fr_FR.ISO-8859-1', although locale strings are seldom standard on across platforms at present. Often only language_country is specified, or even just language. In a client-server environment (like the web), 3 locales are usually considered - server, data, and client locales. |
| Internationalization (i18N) | The technical aspects (character sets, date formats, sorting, number formatting, string resources) of supporting multiple locales in 1 product. |
| Localization (L10N) | The practical aspects (language, custom, fashion, color, etc. ) of expressing an application in a particular locale. Roughly, i18N is considered an engineering process while L10N is considered a translation process. |
| Globalization (g11n) | The cultural aspects of supporting multiple locales in a non-offensive and universally intuitive manner. Think Olympic or airport signage. |
| Canonicalization (c14n) | The process of standardizing text according to well-defined rules. |
| Japanization | The conversion of a product into the Japanese language and character set. There are 4 character sets used in Japanese computer text: hiragana, katakana, kanji and romaji (English alphabet.) Sometimes kanji characters have ruby aka furigana ("attached character") annotation above them to aid in irregular or difficult readings of personal or geographics names and for school children. Mojibake means "scrambled character" and is used to describe the unreadable appearance of electronic displays when the wrong character decoding is used. Because kanji may have multiple readings (meanings) depending on context, machine conversion to hiragana is unreliable. The kinsoku rule says that Japanese sentences are separated by periods that may not wrap to the beginning of a line. |
| CJKV | Chinese, Japanese, Korean and Vietnamese are often considered together because they all use multi-byte encodings. |
| Unicode | Unified Code for characters. Current version is 3.0. Most modern character sets have been incorporated already, and also many ancient ones (I have been informed that Indonesian Jawi is represented by Arabic and Extended Arabic codepoints. I need to doublecheck that since there is also a Indic script used in Java.) Unicode is a complex character set that is unlike ASCII in many ways, some of them being: A glyph may be composed from multiple codepoints in more than 1 ordering; national character sets may not consist of contiguous codepoints; symbols such as bullets, smiley faces and braille are included; binary sorting of Unicode character sequences is likely to be meaningless unless the sequences are normalized first. |
| UTF-32 | 32-bit Unicode 3.0 Transformation Format |
| UTF-16 | 16-bit Unicode 3.0 Transformation Format is a 16-bit encoding, with 16-bit surrogates for private characters and future use (Chinese characters, ancient languages, special symbols, etc.) |
| UTF-8 | 8-bit Unicode 3.0 Transformation Format is a variable width encoding form updated from UTF-2, often used with older C libraries and to save space with European text. UTF-8 may be 1 to 6 octets long, although usually 4 octets without surrogates. It is defined in RFC 2279 |
| UTF-2 | 8-bit Unicode 1.1 Transformation Format is a variable width encoding form that was superseded by UTF-8. UTF-2 was used in Oracle 7.3.4 |
| UTF-7 | 7-bit Unicode 2.0 Transformation Format is a variable width encoding form (used with older email that was not 8-bit clean and not MIME.) See RFC 2152 |
| Unicode compliant | Character encoding implementation that conforms to a particular version of the Unicode spec, for certain features. May only implement a subset if so documented. (For example, a Unicode compliant app might only support certain languages (typically Western European), or even allow only US-ASCII!) |
| Code value (codepoint) | Unicode value for a character that is all or part of a glyph. The same codepoint may represent multiple glyphs, especially in Han unification (Chinese, Japanese, Korean.) Accents alone may have their own codepoint. |
| Pre-composed character | A Unicode character consisting of one code value. Some accented characters, notably Western European, have their own codepoints. |
| Base character | A Unicode character that does not graphically combine with preceding characters, and is not a control or format character. |
| Combining character | A Unicode character that graphically combines with a preceding base character. Typically accents and diacritical marks. |
| Composed Character | A Unicode character made of combined codepoints, usually non-spacing mark (accent) characters. Often the same accented glyph may consist of codepoints in different orderings, for example a character with accents above and below the character (like Thai.) |
| Combining character | A character that normally appears after a base character, and is an accent or other diacritical mark that is added to the previous base character. |
| Compatibility character | A character included in the Unicode standard that has been included for compatibility with a legacy encoding. Usually it looks similar enough to another non-compatibility character to be replaced with it when appropriate. An example is the set of Japanese half-width hiragana code values, which were included for round-trip compatibility with other character set encodings for use in smaller character cells, even though a Unicode application could achieve the same appearance with application-defined font rendering. |
| Normalization | There are four functions performed on Unicode character sequences so that two sequences may be compared in a meaningful way. Normalization is necessary because decomposed characters may have accents in different orders before normalization, but be the same glyph. Normalization is especially important to perform when computer language identifiers, filenames, mail folder names, digital signatures and emitting XML or JavaScript are involved. |
| Collation Order | Table and/or algorithm for sorting strings specific to a locale and usage (dictionary, phonebook, etc.) Unicode has not specified collation order at this time, but should in 3.0. |
| UCS | ISO/IEC 10646 Universal Multiple-Octet Coded Character Set. Both UCS and Uncode standards now share identical code values. The major difference between UCS and Unicode is that UCS is mostly concerned with defining code values, while Unicode adds semantics to the code values. |
| UCS-2 | 16-bit Universal Character Set (no surrogate pairs) |
| UCS-4 | 31-bit Universal Character Set. |
| Character Property | Unicode code values have default properties such as case, numeric value, directionality and mirrored as defined in the Unicode Character Database. |
| Combining Class | A numeric value given to each combining Unicode character that determines with which other combining characters it typographically interacts. |
| Byte Order Mark (BOM) | Unicode code value U+FEFF may optionally be prepended in serialized forms (files, streams) of Unicode characters. By default, files are assumed to be in network byte ordering (big-endian). BOM is discussed at greater length in the document. |
Official Unicode FAQ
Unicode Technical Report # 17 - Character Encoding Model
W3C Character Model for the Web
Forms of Unicode, Mark Davis
A Unicode HOWTO with definitions
ISO 639-2/T: Language Codes for terminological use
RFC 1766: Tags for the Identification of Languages
Country Codes: ISO 3166, Microsoft, and Macintosh
ISO 639-1 and ISO 639-2: International Standards for Language Codes. ISO 15924: International Standard for names of scripts
Q3: What locale support does Perl have?
A3: Locale has been well-supported in Perl for OEM character sets since Perl 5.004, using the underlying C libraries as the foundation.
Locale is not well-supported for Unicode yet. Locale is still important in a Unicode world, contrary to common misunderstanding, for:
There are many tedious details that both the operating system and the programmer have to cooperate on to make locale work.
perldoc perllocale is an excellent reference. It is important to read this document because it is not intuitive which operators are locale-sensitive.
The Perl Cookbook Sections 6.2 and 6.12 also discuss Perl regular expressions and locale.
A simple programming example from the pod is:
require 5.004;
use POSIX 'locale_h';
use locale;
$old_locale = setlocale(LC_CTYPE, "fr_CA.ISO8859-1");
# locale-specific code ...
setlocale(LC_CTYPE, $old_locale);
Example by jhi:
using locales in Perl is a two (well, three) step process:
(1) use POSIX 'locale_h';
(2) setlocale(LC_..., ...);
(3) use locale;
The first one makes the LC_... constants visible.
The second one does the libc call.
The third one allows LC_CTYPE to modify your \w.
The following works for me in Solaris:
#!/usr/bin/perl -lw
use POSIX 'locale_h';
setlocale(LC_CTYPE, "fr_CA") or warn "uh oh... $!";
use locale;
print setlocale(LC_CTYPE); # prints 'fr_CA'
my $test = "test" . chr(200);
print $test;
$test =~ s/(\w+)/[$1]/;
print $test;
Below is a fun test program. Watch how en_US changes character set depending on the previous locale.
use strict;
use diagnostics;
use locale;
use POSIX qw (locale_h);
my @lang = ('default','en_US', 'es_ES', 'fr_CA', 'C', 'en_us', 'POSIX');
foreach my $lang (@lang) {
if ($lang eq 'default') {
$lang = setlocale(LC_CTYPE);
}
else {
setlocale(LC_CTYPE, $lang)
}
print "$lang:\n";
print +(sort grep /\w/, map { chr() } 0..255), "\n";
print "\n";
}
C:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz
en_US:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyzÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ
ÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ
es_ES:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyzÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ
ÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ
fr_CA:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyzÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ
ÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ
C:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz
en_us:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz
POSIX:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz
Q4. What support does Perl have for Unicode?
A4. Here there be dragons! As of now (Feb, 2002), my opinion is that there are 2 paths you can take until 5.8.0 is released this year:
Chapter 15 of Programming Perl, Third Edition, describes Perl's Unicode support. 1000 copies were released in time for the O'Reilly Perl Conference in July, 2000 (I have one). Overall it's a good collection of facts regarding Perl and Unicode, although it could be much improved with locale information. My opinion is that locale mgmt. is everything in i18N.
The UTF-8 character set has been experimentally supported internally since Perl 5.005_50 (the development releases) when requested with use utf8;
This means that in 5.005_50 or later:
There was a Perl and Unicode BOF at the Perl Conference on Wed, July 19. Lots of porters were there, including GS, AK, JH and NI. GS repeated the outstanding issues (printf, normalization, io disciplines, etc.), James repeated his issues (normalization, locale mgmt.), and GS scrolled through a Unicode::Normal module. Nobody made any additional commitments, so the mood was kind of somber.
Some of the implementation issues GS et al. were working on in Nov 1999:
Perl 5.6 was the first stable release with some core support for UTF-8. It was available March 23, 2000. However there is still much work to be done. Read perldoc perlunicode and perldoc utf8 to see the capabilities and limitations.
From the ToDo-5.6 file:
Unicode support
finish byte <-> utf8 and localencoding <-> utf8 conversions
make substr($bytestr,0,0,$charstr) do the right conversion
add Unicode::Map equivivalent to core
add support for I/O disciplines
- a way to specify disciplines when opening things:
open(F, "<:crlf :utf16", $file)
- a way to specify disciplines for an already opened handle:
binmode(STDIN, ":slurp :raw")
- a way to set default disciplines for all handle constructors:
use open IN => ":any", OUT => ":utf8", SYS => ":utf16"
eliminate need for "use utf8;"
autoload byte.pm when byte:: is seen by the parser
check uv_to_utf8() calls for buffer overflow
(see also "Locales", "Regexen", and "Miscellaneous")
Locales
deprecate traditional/legacy locales?
How do locales work across packages?
figure out how to support Unicode locales
suggestion: integrate the IBM Classes for Unicode (ICU)
http://oss.software.ibm.com/developerworks/opensource/icu/project/
and check out also the Locale Converter:
http://alphaworks.ibm.com/tech/localeconverter
ICU is "portable, open-source Unicode library with:
charset-independent locales (with multiple locales simultaneously
supported in same thread; character conversions; formatting/parsing
for numbers, currencies, date/time and messages; message catalogs
(resources) ; transliteration, collation, normalization, and text
boundaries (grapheme, word, line-break))".
There is also 'iconv', either from XPG4 or GNU (glibc).
iconv is about character set conversions.
Either ICU or iconv would be valuable to get integrated
into Perl, Configure already probes for libiconv and .
Regexen
a way to do full character set arithmetics: now one can do
addition, negate a whole class, and negate certain subclasses
(e.g. \D, [:^digit:]), but a more generic way to add/subtract/
intersect characters/classes, like described in the Unicode technical
report oRegular Expression Guidelines,
http://www.unicode.org/unicode/reports/tr18/
(amusingly, the TR notes that difference and intersection
can be done using "Perl-style look-ahead")
difference syntax? maybe [[:alpha:][^abc]] meaning
"all alphabetic expect a, b, and c"? or [[:alpha:]-[abc]]?
(maybe bad, as we explicitly disallow such 'ranges')
intersection syntax? maybe [[..]&[...]]?
POSIX [=bar=] and [.zap.] would nice too but there's no API for them
=bar= could be done with Unicode, though, see the Unicode TR #15 about
normalization forms:
http://www.unicode.org/unicode/reports/tr15/
this is also a part of the Unicode 3.0:
http://www.unicode.org/unicode/uni2book/u2.html
executive summary: there are several different levels of 'equivalence'
approximate matching
Miscellaneous
Unicode collation? http://www.unicode.org/unicode/reports/tr10/
Q5. How do operating systems implement Unicode and i18N?
A5. Operating Systems
Linux 2.2
The kernel remains mostly agnostic of what character encoding is used in files and file names, as long as it is ASCII compatible. File content, pipe content, file names, environment variables, source code, etc. all can be in UTF-8. Linux (like Unix) does not provide any per-file or per-syscall tagging of character sets and instead the preferred system character set can be specified per process using LC_CTYPE. Users should aim at using only a single character set throughout their applications. This is today mostly the respective regional ISO 8859 variant and will in the future become UTF-8. Work is being done on making most applications usable with UTF-8 and there is hope that Linux will be able to switch over completely from ASCII and ISO 8859 to UTF-8 in only a few years. Full UTF-8 locale support will be available starting with glibc 2.2. UTF-8 support for xterm will be available with XFree86 4.0. There are no plans in the Linux/POSIX world to duplicate the entire API for 16-bit Unicode as it was done for Win32. UTF-8 will simply replace ASCII at most levels eventually in any inter-process communication. UCS-4 in the form of wchar_t might be used internally by a few applications for which UTF-8 is inconvenient to process.
locale -s on Red Hat 6.0 indicates there is locale support for 'ja_JP.EUC' but not S-JIS.
There is a nifty Gnome Panel applet called Character Picker that allows one to select accented characters and paste them into an app. (Hint: set focus to the applet and key the character you want help with. Also includes some greek and trademark symbols.) strace is your friend for debugging system calls on linux.
Li18NUX - Linux Internationalization Initiative
Bruno Haible's Linux Unicode HOWTO
Markus Kuhn's excellent UTF-8 and Unicode FAQ for Unix/Linux
Universal Locales for Linux
KDE is based on Qt, a cross-platform C++ GUI application framework. The Qt site has i18N and Unicode descriptions.
BSD
The flavors of BSD are at a very early stage of i18N. Because the gettext library is GPL, BSD won't use it. Their message catalog, catopen, has not been widely used. There is not yet a i18N system installer. An effort is being made to develop a Unicode file system, but otherwise there is only native character encoding support. Most of the i18N developers are Asians adding support for their locale only. xim has implemented an X IME.
Sun Solaris 2.7
Solaris supports variable length Extended Unix Code (EUC) and fixed length 4-byte wide characters (wchar_t).
Solaris has good support for i18N as documented in the following manuals. Besides the usual C library support, Solaris supports string resource localization with the gettext() function. truss is your friend in debugging system calls on Solaris.
Solaris has locale support for UTF-8, offering several locales for Western European languages, Japanese and Korean. A typical locale string is "en_US.UTF-8".
References
SunSoft Solaris Porting Guide
SunSoft Developer's Guide to Internationalization
SunSoft Solaris International Developer's Guide
Sun i18N Guidelines for C and C++
HP/UX
See Solaris 2.7.
Windows NT 4.0
UTF-16 internally, with both Unicode (wide) API and simulated ASCII calls. Microsoft Surrogates Paper
Windows 95/98
Uses Windows native formats and codepages, not Unicode.
Windows CE
Unicode with only wide calls.
Mac OS
MacOS 9 uses maximally decomposed UTF-16 filenames, and stores the file type ('text' or 'utxt') in the file system table.
MacPerl FAQ describes differences between Mac and Unix character sets
Mac Mach Unix
Unicode internally.
Q6. I'm a Perl Porter. What should I know about i18N and C?
A6. Lots of things to watch.
For 8-bit non-UTF-8 encodings:
For UTF-8:
Here is a typical character processing loop:
while (s < send && isALNUM_utf8(s))
s += UTF8SKIP(s);
Sun i18N Guidelines for C and C++
Q7. I'm a Perl Porter. What should I know about Perl and Unicode?
A7. Lots of things to watch.
GS has suggested that a good way for less experienced porters to contribute code to the Unicode porting effort is to overload built-in Perl operators and call an XS module to do the new functions in C. Then it is a simple matter to have a more experienced porter patch the Perl core with your new features.
Q8. I'm a CPAN module author. What should I know about Perl and Unicode?
A8. The Dec 99 plan is to hide the internal encoding from Perl programmers. The encoding can be invisible to user programmers because the string routines operate on character units without regard to the actual bits.
Obviously this is not true if the programmer does a use byte; and tries to do his own character conversion unless the original encoding is known beforehand.
i18N developers will always need some kind of access to the raw bytes in strings when troubleshooting character conversion problems (mojibake).
Here's an interesting email:
On Tue, Sep 12, 2000 at 12:24:50AM +0200, Gisle Aas wrote: > Jarkko Hietaniemiwrites: > > > Please take a look at the (very rough) first draft of Encode, an extension > > for character encoding conversions for Perl 5: > > > > http://www.iki.fi/jhi/Encode.tgz > > > > Download, plop it into the Perl 5.7 source directory, unpack, > > re-Configure, rebuild. (Or, if you have a Perl 5.7 in your path, > > cd to ext/Encode, perl Makefile.PL, make).
UTF8::Hack
Test suites are valuable for checking your code.
A stress test file
Q9. Do regular expressions work with locales?
A9. Yes, see perldoc perlre.
In short, \w and \s (along with their converses \W and \S respectively) are locale-dependent. This is less useful than it appears because \w represents Perl identifier word characters [a-zA-Z0-9_] (in en locale), rather than cultural words.
Q10. Do regular expressions work with Unicode?
A10. There is some support, see perldoc perlre.
New in 5.6: \p{IsSpace} matches any Unicode character that possesses the IsSpace property. \P{IsSpace} would match not IsSpace. Most of the Unicode property tables are bundled with Perl5.6.0, with the exception of UniHan and NormalizerTest. The files are renamed to fit the 8.3 filename naming convention if necessary for portability reasons.
Temporarily, tr// has UC (utf-8 to char) and CU (char to utf-8) options.
This feature will likely be eliminated in favor of many to many mapping functions. Unicode can be used as a pivot for converting any charset to another, although not all characters have matches in another charset.
Here's some one-liners for Latin-1 to UTF-8 and vice versa:
s/([\x80-\xFF])/chr(0xC0|ord($1)>>6).chr(0x80|ord($1)&0x3F)/eg;
s/([\xC0-\xDF])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg;
Q11. What are these CPAN Unicode modules for?
A11. The modules listed below will likely be obsoleted by internal C routines in Perl 6.0, although are still very useful for older versions of Perl.
You will need a C compiler to build most of them. Somebody has to post some ppm packages or most Windows people are out of luck until they install a compiler.
Jcode
Jcode is used to convert between EUC/S-JIS/Unicode using the Unicode EASTASIA character mapping tables. Also supported are tr and other operations. Both OO and procedural programming models are supported, along with compatibility with older jcode.pl.
use Jcode;
# the new method does charset autodetection on the supplied argument
my $euc = Jcode->new("string from a set of supported charsets");
my $sjis = $euc->sjis;
my $ucs2 = $euc->ucs2;
my $utf8 = $euc->utf8;
Dan Kogai wrote Jcode.
Unicode::Map
Map is used to map characters to and from UCS-2. The actual mappings available are defined in the REGISTRY file. Various code pages and tables are stored as files.
Here's a demo from the POD documentation. Pipe through od -c to see the null bytes.
use Unicode::Map;
$Map = new Unicode::Map({ ID => "ISO-8859-1" });
$_16bit = $Map -> to_unicode ("Hello world!");
# => $_16bit == "\0H\0e\0l\0l\0o\0 \0w\0o\0r\0l\0d\0!"
print $_16bit,"\n";
$_8bit = $Map -> from_unicode ($_16bit);
# => $_8bit == "Hello world!"
print $_8bit,"\n";
$Map = new Unicode::Map;
$_16bit = $Map -> to_unicode ("ISO-8859-1", "Hello world!");
# => $_16bit == "\0H\0e\0l\0l\0o\0 \0w\0o\0r\0l\0d\0!"
print $_16bit,"\n";
$_8bit = $Map -> from_unicode ("ISO-8859-7", $_16bit);
# => $_8bit == "Hello world!"
print $_8bit,"\n";
Martin Schwartz wrote Unicode::Map.
James will likely be the maintainer when new changes are proposed. One is adding a map file for EUC-JP.
Unicode::Map8
Map8 is used to map 8-bit character encodings to UCS2 (16-bit) and back. The programmer may build the translation table on the fly. Does not handle Unicode surrogate pairs as a single character.
require Unicode::Map8;
my $no_map = Unicode::Map8->new("ISO646-NO") || die;
my $l1_map = Unicode::Map8->new("latin1") || die;
my $ustr = $no_map->to16("V}re norske tegn b|r {res\n");
my $lstr = $l1_map->to8($ustr);
print $lstr;
print $no_map->tou("V}re norske tegn b|r {res\n")->utf8;
Gisle Aas wrote Unicode::Map8.
Unicode::String
This module does various mappings, including various Unicode Transformation Formats.
use Unicode::String qw(utf8 latin1 utf16);
$u = utf8("The Unicode Standard is a uniform ");
$u .= utf8("encoding scheme for written characters and text");
# convert to various external formats
print "UCS-4: ", $u->ucs4, "\n"; # 4 byte characters
print "UTF-16: ", $u->utf16, "\n"; # 2 byte characters + surrogates
print "UTF-8: ", $u->utf8, "\n"; # 1-4 byte characters
print "UTF-7: ", $u->utf7, "\n"; # 7-bit clean format
print "Latin1: ", $u->latin1, "\n"; # lossy
print "Hex: ", $u->hex, "\n"; # a hexadecimal string
Gisle Aas wrote Unicode::String.
i18N::Collate
This module compares 8-bit scalar data according to the current locale. i18N::Collate has been has been deprecated since 5.003_06.
use i18N::Collate;
setlocale(LC_COLLATE, 'en_US');
$s1 = new i18N::Collate "scalar_data_1";
$s2 = new i18N::Collate "scalar_data_2";
if ($s1 lt $s2) {
print $$s1, " before ", $$s2;
}
Here's a simple program that converts from native encodings to various UTFs using the above CPAN modules:
use CGI qw / :standard /;
use Unicode::Map;
use Unicode::String qw(utf8 latin1 utf16);
# initialize Unicode::String
Unicode::String->stringify_as("utf16");
my $encoding = param('encoding') || ''; # like Shift-JIS - see Map's REGISTRY
my $in = param('in') || ''; # input text in native encoding
my $Map = new Unicode::Map({ ID => $encoding });
my $map_out = $Map -> to_unicode($in);
my $us = Unicode::String -> new($map_out);
my $us_utf8 = $us -> utf8;
my $us_utf16 = $us -> utf16;
my $us_utf8_esc = CGI::escape($us_utf8);
print header(),
start_html(),
"$encoding = $in, UTF-16 = $us, UTF-8 = $us_utf8, URL-escaped UTF-8 = $us_utf8_esc\n",
end_html();
Newer Modules
PP emailed me about:
lib/Locale/Constants.pm Locale::Codes
lib/Locale/Country.pm Locale::Codes
lib/Locale/Currency.pm Locale::Codes
lib/Locale/Language.pm Locale::Codes
lib/Locale/Maketext.pm Locale::Maketext
there is also a pod version of the TPJ article that discussed the
reasoning behind the Locale::Maketext (which tends to embed more code into
a lexicon in an attempt to deal with nuances of things like ordinal vs.
cardinal numbering between languages). The article, by Sean M. Burke and
Jordan Lachler, is in:
lib/Locale/Maketext/TPJ13.pod Locale::Maketext documentation article
So perhaps Locale::Maketext is worth mentioning in your document(?).
Another CPAN module possibly worth mentioning is Locale::Msgcat by
Christophe Wolfhugel. Although it suffers from a couple of problems: the
h2xs output was not packaged all the well by Christophe (e.g. the
DESRIPTION mentions: "Perl extension for blah blah blah"), it is based on
numerically indexed MSG catalogs (and as you already point out numeric
indexing lexicons is difficult to program with and/or update and
maintain). Nonetheless I think that XPG4 inspired msg catalogs, via
catopen(), catgets(), and catclose() will be around for quite a while
since there are built into so many C and C++ implementations (indeed are
required by XPG4, which is why there is a msgcat utility with gcc).
At any rate I think that it would be useful to mention Locale::Msgcat so
as to get it more widely ported and/or updated.
The Encode module that Jarkko and Nick Ing-Simmons wrote is now in
bleedperl and is already set up to deal with approximately 106
coded character set encodings:
% grep 'Encoding tables' MANIFEST | wc -l
106
But that result is actually an overcount since certain tables will be
counted twice in their *.enc and *.ucm forms, e.g.:
ext/Encode/Encode/koi8-r.enc Encoding tables
ext/Encode/Encode/koi8-r.ucm Encoding tables
SADAHIRO Tomoyuki Now Sort::UCA is available from http://homepage1.nifty.com/nomenclator/perl/indexE.htm NAME Sort::UCA - use UCA (Unicode Collation Algorithm) SYNOPSIS use Sort::UCA; #construct $uca = Sort::UCA->new(%tailoring); #sort @sorted = $uca->sort(@not_sorted); #compare $result = $uca->cmp($a, $b); # returns 1, 0, or -1. SEE ALSO http://www.unicode.org/unicode/reports/tr10/Q11b. What about i18N POD?
PP says:
There have been recent queries on p5p regarding translations of fixed pod
sets (e.g. those that came with perl 5.6.1) to other languages. I think
it is widely supported as a "good thing" but it may be unlikely that the
translated pods would be included in the perl tar ball in future versions
- so as to keep tar ball size down and to avoid maintenance problems with
languages that the perl tar ball maintainers are not qualified to keep up
to date.
That said I can offer this bit of advice: for widest distribution try
to restrict your pod to to the 7 bit ASCII char set. I found out that
with the 8 bit ISO-8859-1 chars in perlebcdic.pod that there are some
*roff implementations that do not grok the Latin-1 Char set well (nroff on
locale C Solaris 2.7 being one, the GNU version of nroff on OS/390 being
another). Note that the pod spec allows for html inspired E names
for the printable Latin-1 characters but little else.
Having said that I think that the question actually pertains not to wide
distribution of pod, rather to narrow distribution of pod. E.g. using
scandanavian pod with an appropriate char set. Given my experience with
the *roff implementations I guess I would recommend testing things out
with whatever pod2* tools you intend to use. A limited example of this
would be writing pod for translation only to html: you might make more
liberal use of the L<> construct than you would if the pod needed to go
both to pod2man and pod2html. Likewise if I knew that the Linux
implementation of pod2finnish could easily grok ISO- then I might
not care at all if the nroff on Solaris 2.7 does not handle the
ISO- char set all that well.
I think that maintaining a list of such cross-system incompatabilities
would be as daunting a job as, say, specifying which characters in
Mac-Roman do not map well to the latest rendition of Windows codepage
1252; which is to say a combinitorial explosion that would be difficult to
verify for accuracy.
A12. JPerl (Japanized Perl) may also be used for Japanese localization.
JPerl for MS-Windows JPerl patch for Unix
Lunde's CJKV also has references to JPerl on pp. 412 and 444-446.
If I remember correctly, kipp would create a patch that would Japanize Perl after each Perl release. Finally, JH rolled most of his changes into core Perl.
There is also a Macintosh Version of JPerl
/>/usr/local/bin/perl -v This is perl, version 5.004_04 built for i386-freebsd Copyright 1987-1997, Larry Wall Japanization patch 4 by Yasushi Saito, 1996 Modified by Hirofumi Watanabe, 1996-1998 jperl5.004_04-980303 LATIN version Perl may be copied only under the terms of either the Artistic License or the GNU General Public License, which may be found in the Perl 5.0 source kit.
Q13. Can I just do nothing and let my program be agnostic?
A13. Some legacy applications work without knowledge of which or how many native encodings are being used as long as the round-trip from client to server and back does not mangle the data. Generally this only works when no date or time display or casing or sorting are required and the client encoding can be assumed.
This would work better for European character sets which are usually single-byte in a single encoding than Japanese for example. Japanese has a multitude of multi-byte encodings, the most common being ISO-2022-JP, S-JIS, EUC-JP and many more.
Web applications adopt the common practice of storing data in Unicode and sending data to the browser in native encodings. When multiple character encodings are required on a single web page, (like Japanese combined with non-English text), UTF-8 is used.
Q14. Why and where should I use Unicode instead of native encodings?
A14. The simple answer is that Unicode should be used internally for processing and databases now, and in the future sent to the client program.
The advantage of internal Unicode is the ability to simultaneously store multiple languages in a common format without creating separate databases for Korean, Japanese, Chinese, and English, for example.
The current advantage of sending and receiving native character sets to the user is legacy support for browsers or other older clients and operating systems.
Although Netscape and IE 4.0+ have basic Unicode support, use after testing. (The MyNetscape feature of netscape.com uses UTF-8 when displaying languages that would normally require multiple character sets on the same web page, like Japanese and non-English languages.)
Browsers generally seem to have the most difficulty displaying UTF-8 in forms, alt text and JavaScript strings (older versions of Netscape browsers would not parse non-ASCII text correctly just before a double-quote).
Andreas Koenig's Perl Conference 3.0 HTML slides are in UTF-8
Q15. What is Unicode normalization and why is it important?
A15. Normalization is a transformation applied to a Unicode character sequence. The transformation is performed according to the Unicode Standard and involves table lookups to do substitution and reordering of characters as defined in the Unicode Character Database. Normalization is necessary because Unicode allows multiple representations of the same string with combining characters and precomposed characters. Normalized character sequences allow accurate binary comparison of strings. This is especially important for purposes such as digital signatures, variable identifiers, mailbox folder names and directory searches.
Early normalization is performing normalization as close as possible to the text source, before transmitting the text to another application. This must be done for XML, HTML, URIs, and other W3C standard systems.
Late normalization is performing normalization after receiving text from another process or network transfer.
Canonical equivalence of 2 character sequences is determined by applying operations defined in the Unicode standard, without changing the visual appearance of resulting text.
Compatibility equivalence of 2 character sequences is determined by applying operations defined in the Unicode standard, and may change the visual appearance of text (for example, input text with both half and full-width katakana would result in full-width only output).
There are four normal forms (D, C, KD and KC), documented in Unicode Standard Annex #15.
Briefly:
Martin Dürst of the World Wide Web Consortium (W3) has written reference code in Perl for normalization, called Charlint - A Character Normalization Tool
Q16. How do I do auto-detection of Unicode streams?
A16. There is no mandatory signature, making this hard to do reliably. The optional Byte Order Mark (BOM) U+FFFE may help if present. UTF-7 and UTF-8 strings have properties that would make them identifiable most of the time unless there is substantial leading US-ASCII.
Without a signature you would need a moderate amount of text to do a reliable detection. An example of an input source that is probably not long enough would be a search widget on a web page. Statistical analysis of octets is another unreliable method.
Frank Tang's site has some UTF-8 auto detection scripts
See Ken Lunde's CJKV book, Appendix W: Perl Code Examples, for numerous character encoding, detection, and regular expressions in perl.
Q17. Is Unicode big endian or little endian?
A17. Unicode characters are not considered to be endian when stored in memory.
However, UTF-16 and UTF-32 must be serialized when saved in a file, emitted as a stream, or transferred over the network. Network byte ordering (most significant byte first or big-endian) is preferred, but you would not likely see it on Intel hardware. Mozilla 5 supports both UTF-16-LE and UTF-16-BE, etc!
Q18. Is there an EBCDIC-safe transformation of Unicode?
A18. Yes. UTF-EBCDIC stands for EBCDIC-friendly Unicode (or UCS) Transformation Format. Unicode Technical Report #16
PP says:
As of Perl 5.6.1 (and also in bleedperl developers versions 5.7.1++) the use utf8; pragma on an EBCDIC machine will have the same effect with respect to UTF-EBCDIC that use utf8; has with respect to UTF-8 on an ASCII based machine. Hence there is about as much support for UTF-EBCDIC in current Perls as there is for UTF-8 on ASCII-based machines.
Where to Use UTF-EBCDIC?
UTF-EBCDIC is intended to be used inside EBCDIC systems or in closed networks where there is a dependency on EBCDIC hard-coding assumptions. It is not meant to be used for open interchange among heterogeneous platforms using different data encodings. Due to specific requirements for ASCII encoding for line endings in some Internet protocols, UTF-EBCDIC is unsuitable for use over the Internet using such protocols. UTF-8 or UTF-16 forms should be used in open interchange.
Peter Prymmer's Perl EBCDIC document
Q19. Are there security implications in i18N?
A19. Yes. Since locale affects Perl's understanding of strings, operations like character comparison may produce unexpected results when the locale is manipulated.
The UTF-8 Corrigendum better defines UTF-8 by forbidding generation and interpretation of non-shortest forms, which will help disambiguate characters.
Perl taints strings affected by locale.
An example is using a regex with \w to check user input. \w may accept character codes for punctuation characters if the locale "says" they are alpha codes.
From the Sun man attributes page:
For applications to get full support of internationalization services, dynamic binding has to be applied. Statically bound programs will only get support for C and POSIX locales.
CERT* Advisory CA-97.10 - Topic: Vulnerability in Natural Language Service
Some comments from Bruce Schneier
Q20. Are there performance issues in i18N?
A20. Yes. Locale-dependent collation is generally slower than binary comparisons.
Some of the reasons are:
A contributor reports that when using regular expressions like /^([\w]*)([\W]*)/ in perl 5.6, performance has been measured to be similar to that in pre-5.6 on straight ASCII.
Q21. How do I localize strings in my program?
A21. The best way is to rip them out and save them in an external string resource file. The program API should load strings by string_id (preferably a descriptive text literal rather than a cryptic numeric code like Windows resources) and lang_locale (lang_locale should be either default locale or explicit locale). An additional complication is that text may have parameters that are substituted in locale- dependent positioning (Perl does not have the printf reorder edit descriptors $1, etc. yet). Some programmers provide the default locale's string as an argument to the catalog API call. This may be ok where the program is small and the localizer is also a developer with access to the source code, but I do not recommend this for large projects or where strings are often updated. Leaving interface text in the source code is bad.
Another advantage of using string resource files is that translators can work on the files without touching your source code. This reduces cost, avoids inadvertant source mgmt. errors and limits write access to the repository.
There's 4 strategies for resource file formats:
Programming for Internationalization FAQ talks about gettext, by Michael K. Gschwind.
GNU gettext manual and
GNU gettext download
SunDocs has information on gettext here.
Q22. I do database programming with Perl. Can I use Unicode?
A22. This is well-supported with Oracle 8 and Sybase. Usually your database must be installed with the correct character set. WS from GlobalSight recommends setting your data locale to whichever your office most commonly uses for reporting.
Oracle 8
Oracle has broad i18N support. It is important to read the correct documentation for your version of Oracle.
Likely for legacy reasons, Oracle servers must be configured with 2 character sets:
Oracle National Language Support (NLS)
Oracle 8 National Language Support: An Oracle Technical White Paper
Oracle iAS (Internet Application Server) is a very interesting product. It is bundled with Apache, and integrates Apache language content negotiation with Java local support for impressive locale mgmt. There is a demo called World-O-Books.
Oracle will likely add a UTF-16 encoding in 2001.
It has been reported that DBI does not work with UTF-8 stored in Oracle nvarchar2 fields.
Sybase
Configure with UTF-8. Sybase has an alpha version of UTF-16 with beta release in November, 2000 and full release in 2001. New column types unichar and unitext have been added, which may be optionally configured for surrogate use or not. They added UTF-16 to meet user requests for a wide Unicode datapath for Java and NT applications.
Microsoft SQL Server
Michka Kaplan has this to say:
SQL Server supports the datatypes NTEXT, NCHAR, and NVARCHAR, all of which are of type UCS-2. When such a column indexed, then the index is Unicode. SQL Server 7.0 only supports one language collation at the server level.... this choice affects the actual ordering of all such indexes.
SQL Server 2000 supports a COLLATE keyword that allows you to specify a collation at the database or field level and thus choose a different language for such columns/indexes if you like (I discuss practical details and implications of this feature in an upcoming article in the Visual Basic Programmer's Journal, tentatively scheduled for November). See also his book, Internationalization with Visual Basic.
In any case, you can certainly query and such field in either SQL 7.0 or in SQL 2000.
Informix
*** TBA ***
DB2
*** TBA ***
MySQL
MySQL supports many encodings, but not Unicode. And these many encodings can only be used 1 per database. (So you could create a Korean-only or Japanese-only database.)
The MySQL manual lists Unicode as one of the "things that must done in the real near future."
By default, accented characters are collated to the same value. This can be overridden by/compensated by:
A work-around would be to use the default ISO-8859-1 and have all your database routines convert to UTF-8 and translate to native encodings. You would lose the ability to sort automatically and would need to do your own conversion to native character encodings for presentation to the client.
Progress Software is considering combining ICU and MySQL. Progress has experience adding Unicode to their Progress database product.
Basis Technology has added Unicode support to MySQL using Rosette's Unicode library and may be involved with adding ICU support also.
PostgreSQL
There is an install option --enable-multibyte-=UNICODE for unicode support. The option --enable-unicode-conversion enables Unicode to legacy conversion tables. No further information so far on how well it works.
PostgreSQL Localization Documentation
Q23. I do database programming with Perl. What are the i18N issues?
A23. The usual issues must be dealy with, like collating (sorting) and conversion to native encodings for application input and output.
A new issue is ensuring your database columns are wide enough. UTF-8 for example can triple the size of a Japanese column.
Also, if your fields were designed for English text, they will have to be up to 100% wider for other languages, like French.
So to be safe, I would quadruple originally English ASCII text columns for use with UTF-8 databases that will store CJKV characters.
While many people use HTML entity codes ("entification") to represent accented characters in static HTML, this aggravates collation so I recommend using actual character values in databases and content management systems instead of entity values. A post-processing filter can always be used if there is a need for entity codes. One such requirement would be for foreign language support in the older Mac browser versions. Also beware of user page creation tools that gratuitously emit smart quotes HTML output.
Q24. How do other programming languages implement Unicode and i18N?
A24. Programming Languages
Ken Lunde's CJKV, p. 410 also describes some features of C/C++, Java, Python and TCL.
Tony Graham's A Primer for Unicode also has a language survey chapter.
Unicode: APrimer has a good comparison of programming languages
Java language
Java has what programmers generally acclaim as the best Unicode support. Since Java 1.1, i18N support is more or less based on ICU.
Oddly, there is an interface module to Java called Java.pm, a Perl module for extracting information from Java class files. No functions to parse Java resource bundles though.
See Also
O'Reilly's Java In a Nutshell, Chapter 11
Sun's Java i18N Tutorial
IBM's Unicode site
C language
C provides the wchar_t and wchar_t * type for handling wide characters and strings. Since wchar_t varies in width by compiler, for portable programming you should define a macro of the appropriate width rather than use wchar_t directly. ISO C Amendment 1 and ISO C 99 both added a rich set of new library functions for handling wchar_t strings to the language. C can also handle Unicode by using UTF-8 as the multi-byte encoding in the char * type, such that upgrades from ASCII to UTF-8 can be implemented with relatively minor changes in most existing software.
Regular C string functions like strcmp may not work with embedded null values.
Since most operating systems are substantially written in C, the C locale libraries are available to applications as well.
ICU (International Classes for Unicode) may be used to internationalize C/C++ programs.
Basis Technology's Rosette Library C/C++ is commonly used for adding Unicode support to custom programs. For a demo, try Lycos.
If you are porting old C/C++ source to Unicode on Unix or Windows, you may want to try OneRealm's Eliminator program for converting and auditing source code.
C++ language
See C language above.
ECMAscript (aka JavaScript, LiveScript)
Ada95 language
Ada95 was designed for Unicode support and the Ada95 standard library features special ISO 10646-1 data types Wide_Character and Wide_String, as well as numerous associated procedures and functions. The GNU Ada95 compiler (gnat-3.11 or newer) supports UTF-8 as the external encoding of wide characters. This allows you to use UTF-8 in both source code and application I/O. To activate it in the application, use "WCEM=8" in the FORM string when opening a file, and use compiler option "-gnatW8" if the source code is in UTF-8. See the GNAT and Ada95 reference manuals for details.
XML
XML requires parsers to support both UTF-8 and UTF-16. Optionally, XML parsers may support other character sets, such as ASCII.
A contributor writes:
"It's like so: if you're using UTF-16, you *must* have the BOM. If
you're using UTF-16-LE or BE, then even though these are really UTF-16,
you can avoid using the BOM. Some people over at the IETF are trying
to *forbid* the BOM with the -BE and -LE versions, which essentially
means that they shouldn't be used for XML, since XML is all about
interoperability, and byte order just can't relied on outsde the
firewall."
Extensible Markup Language (XML) 1.0 - Character Encoding in Entities
Extensible Markup Language (XML) 1.0 - Autodetection of Character Encodings (Non-Normative)
The CPAN module XML::Parser has its own character transcoding routines (multi-byte no less), which are documented in the source in Expat/expat/xmlparse/xmlparse.h, Expat/encoding.h and Expat/Expat.xs. There is an interesting description of the lack of standards in the S-JIS and EUC-JP character set encodings in Japanese_Encodings.msg.
XML::Parser uses wchar_t, which the Unicode Standard recommends against as wchar_t varies in width depending on the compiler. XML::Parser also engages in old-fashioned C character processing loops, which are inadvisable in i18N programming.
Python
TCL
How to write a transformation (channel)
Q25. What support for Unicode do web browsers have?
A25. This varies depending on browser version.
Netscape 4.x and IE 4.x and above support UTF-8 and possibly other transformation formats reasonably well. Statistics I have seen show that 99% of page views are with 4.x and above. UTF-8 is fine for content pages.
Some problems that have been observed are:
Q26. How can I i18N my web pages and CGI programs?
A26. In a nutshell, by generating the correct headers and HTML meta tags.
HTML4 uses UCS as its document character set, but the document may be transferred with a different character encoding. That is why numeric entity codes index UCS code values rather than S-JIS or ISO-8859-1 or whatever character encoding was used to transfer the document.
Please seee Notes on Internationalization for details. To summarize, it recommends configuring your web server to send the correct Content-type character encoding response and not using META CHARSET tags, especially with ISO-8859-1 text. Some older browsers are known to ignore charsets that are not ISO-8859-1, so this may not be a perfect solution.
When generating static pages, I would emit the correct META tag to help browsers of foreign readers default to the correct character encoding. (Do a view source on this page to see one.)
The correct meta tag is helpful even when the server could send the correct header, because some browser versions do not understand non ISO-8859-1 charset set header names, like ISO-8859-5 for example. Otherwise older browsers may see the file as an unrecognizable encoding and treat it as a binary that must be saved to disk. Although the language meta tag should appear as the first tag in the head block, it may not make that much difference where it is placed as some believe the entire head tag is processed as a unit.
To make your HTML render correctly, avoid font face tags in CJKV pages that users may not have. Otherwise mojibake or hollow squares will likely result (observed with Netscape browsers on both Windows and and Mac platforms.) Using a very general font tag like <font face="sans-serif,Arial,Helvetica,serif"> may work by providing enough defaults so some font will match. Also, the non-breaking space tag may cause visual problems when displaying Japanese text on the Mac in older Netscape browsers.
A work-around for displaying multiple native encodings on the scree simultaneously is to use multiple frames, one for each charset.
It has been reported that as of now (Sept. 12), UTF-8 is not usable as a general charset for public Internet use when Asian languages are displayed on Netscape 4.x because of issues of an available Unicode font and low likelihood that your audience can find and install a Unicode font.
HTML::Entities::{de,en}code_entities can be used to translate HTML entity codes from de to en for example. W3C SGML Entities
When generating dynamic pages, emit the correct server response header:
use CGI.pm qw /:standard/;
print header("text/html; charset=iso-8859-1"),
start_html();
If you want to generate the correct content type and language for creating static HTML pages with CGI.pm, you can modify the start_html subroutine in CGI.pm to
emit the custom head tags first, then:
print start_html( -head => q{
});
Forms may be constructed with a hidden field with charset input and locale (language and country) fields.
IE5 and later will fill a field called _charset_.
<input type="hidden" name="_charset_">IE4 and IE5 will submit characters that do not fit into the charset used for form submission as HTML numeric character references (〹). This may need special decoding.
Forms may also be constructed have a hidden field with known sample text to be tested for mangling. This would be especially effective with Japanese, since multiple character sets are in common use, including ISO-2202-JP, Shift-JIS, and JIS.
RFC2070 recommends using the ACCEPT-CHARSET hint, like ACCEPT-CHARSET="utf-8" on both the FORM element and the FORM controls. There is substantial ambiguity in how the user, browser client, and server would interact with ACCEPT-CHARSET, so this method would need some experimentation.
With
FORM METHOD=GET ... and FORM METHOD=POST ENCTYPE="application/x-www-form-urlencoded" ...
the %HH codes represent octets but there is not standard for specifying the character encoding. Only ENCTYPE=multipart/formdata (sp?) provides the opportunity to send the encoding information. So far only lynx uses it.
Altavista's international country selector page illustrates using .GIF images to render multiple character sets on the same non-Unicode encoded page.
The recommended character set encoding for Japanese web pages is Shift-JIS.
Fascinating email on browser charset issues
Q27. How should I structure my web server directories for international content?
A27. If you have more than a few pages for each locale, the usual practice is to have separate directories for each locale to make content management easier. This is especially effective for truly localized content like local news, or when some locales get more frequent updates than others.
Here's an example:
http://home.netscape.com/fr/index.html, or perhaps even better depending on your requirements:
http://home.netscape.com/fr_FR/index.html
Q28. Can web servers automatically detect the language of the browser and display the correct localized page?
A28. Yes. HTTP/1.1 defines the details of how content negotiation works, including language content.
WWW browsers send an Accept-Language request header specifying which languages are preferred for responses. This technique works fairly well, although some versions of Netscape Navigator send an improperly formatted request parameter. Also, switching language preferences in either Navigator or IE 4 doesn't always "take" without first deleting a language hint.
Few sites do content-negotiation on language, and interestingly enough I do not know of any major portals doing this. One site that does is Sun's documentation library at SunDocs. Debian.org does a very nice job of using Apache content negotiation wih languages and also has some nice help info too on Setting the Default Language.
Apache's Content Negotiation features will select the right page to return, whether HTML or image file. Annoyingly, the match logic is very literal, so a browser request for en-us will not match a server entry of en except as a last resort. Any other exact match will win over en, even if en-us was first preference.
There are 2 ways of doing content negotiation with Apache: type or variant maps and multiviews.
Variant Maps
In httpd.conf, disable Options Multiviews if configured and add
AddType type-var var DirectoryIndex index.var
Then create an index.var file like this:
URI: start; vary="type,language" URI: index.html.en Content-type: text/html Content-language: en-GB URI: index.html.en Content-type: text/html Content-language: en-US URI: index.html.en Content-type: text/html Content-language: en URI: index.html.fr Content-type: text/html Content-language: fr-CA URI: index.html.fr Content-type: text/html Content-language: fr URI: index.html.es Content-type: text/html Content-language: es
Multiviews
The multiviews technique works like this. This does add extra server load, as each content directory must be scanned for the variant document names.
index.html is localized into variant documents such as:
Here's an example of http.conf directives for this:
# in httpd.conf
AddLanguage en .en
AddLanguage fr .fr
AddLanguage es .es
#
# LanguagePriority allows you to give precedence to some languages
# in case of a tie during content negotiation.
# Just list the languages in decreasing order of preference.
#
LanguagePriority en es fr de pt
Options Multiviews
# end of httpd.conf fragment
Starting in Apache 1.2, you may also create documents with multiple language extensions.
O'Reilly's Apache - The Definitive Guide, Chapter 6
ApacheWeek article on Language Negotiation
Another ApacheWeek article on Language Negotiation
Netscape Enterprise Web Server
Yes, the Enterprise International Server edition supports multiple languages with content directories. Also LDAP may be populated with UTF-8 data, and server side javascript and the search feature can have a default language specified.
When a request for this page is made by a browser with an accept language request header for: http://www.someplace.com/somepage.html the server translates that to the following URL: http://www.someplace.com/xx/somepage.html, where xx is a language code.
If not found, the default page is served.
The magnus.conf directives that control this are:
ClientLanguage (en|fr|de|etc.) for client error messages like "Not Found" or "Access Denied" DefaultLanguage(en|fr|de|etc.) for other error messages AcceptLanguage (on|off)
Microsoft IIS
*** TBA ***
Q29. What format do I send strings to the translator?
A29. Ask the translator for the recommended string file format before sending strings out for translation. They may have a specific file format in mind for use with translation memory tools. Some use Microsoft Word and prefer their import format for creating a table.
If there's no preference (or you just can't wait), here's an acceptable text format that is both human and machine readable:
project: copyright: date: version: charset: id: MESSAGE_ID1 en: english text xx: translated text note: Notes. - id: MESSAGE_ID2 en: english text xx: translated text note: Notes. -The note field can be used to tell the translator hints about context and parts of speech (noun, verb, etc.) If the final charset will be Unicode, you should consistently use Unicode throughout the production process to avoid potential conversion errors. The delimiting mostly blank line aids readability for humans.
Here's a first pass at an XML string resource format that should be compatible with current translation tools. See http://www.opentag.com/ for more details.
<?xml version="1.0" encoding="UTF-8"?>
<strings>
<version>1.0</version>
<string id="test">
<en>test</en>
<fr>l'examen</fr>
<note>noun</note>
</string>
</strings>
Here's a parser:
use XML::Parser;
use XML::Simple;
my $file = './strings.xml';
my $xs1 = XML::Simple->new();
my $doc = $xs1->XMLin($file);
foreach my $key (keys (%{$doc->{string}})){
print "string=$key\n";
print " en=", $doc->{string}->{$key}{en}, "\n";
print " fr=", $doc->{string}->{$key}{fr}, "\n";
print " note=", $doc->{string}->{$key}{note}, "\n";
}
If your tools have xml:lang support, you can try this:
<?xml version="1.0" encoding="UTF-8"?> <strings> <version>1.0</version> <string id="test" xml:lang="en">test</string> <string id="test" xml:lang="fr">l'examen</string> </strings>Q30. What are common encodings for email?
A30. Both transfer and character set encodings must be considered.
SMTP is a 7-bit protocol that supports US-ASCII. To support 8-bit data over 7-bits, many transfer encodings were developed:
Mail may be mangled such that multiple transfer encodings are used simultaneously (most commonly observed in webmail products).
Also, MIME encoding is often used now.
Here are standard and common character set encodings used in email:
MIME allows multiple attachments, each in a different character set encoding.
What complicates email replies is that previous text is quoted in the body of the email. The quoted text is in a specific character set encoding that may not be what your MTA or MUA prefer, so transcoding may need to be done. An example would be replying to email in EUC-JP when you prefer S-JIS.
Here are some comments from a Japanese email user:
If you write Japanese email, it's strongly recommended
to use ISO-2022-JP. The contemporary way to send japanese email is,
* Mime-Version is 1.0,
* Content-Type is Text/Plain; charset=iso-2022-jp,
* Content-Transfer-Encoding is 7bit (not Base64 nor quoted-printable)
(some people/mailer don't attach CTE header. I don't know
which is better.)
In early days, Japanese use ISO-2022-JP without MIME header. But
now, it's not recommended.
You should not use Shift_JIS and EUC-JP, even though RFC permit
to use these encodings, because there are many mailers in japan
not to handle these encodings still now.
In (near?) future, we may use UTF-8, but we should not use it now.
UTF-8 is not commonly used yet, but here's a sample header:
MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Mailer: Becky! ver. 2.00 (beta 10) Content-Type: text/plain; charset="UTF-8"
Internet Mail Consortium: Implementing Internationalization in Internet Mail
GNU UUDeview and UUEnview Tools
Decoding Internet Attachments - A Tutorial
Q30b. What is happening with internationalized DNS?
A30b. The goal was to have a standard for hosting non-English domain names in local scripts by the end of 2000. For example, if a user wants to use Chinese ideograms as a domain name, that will be supported by both DNS resolvers and servers.
This will be accomplished by internationalizing the domain names, but not DNS itself, by storing the Unicode name as an encoded ASCII sequence in DNS. An example of an encoded ASCII sequence from the RACE algorithm of Japanese text is bq--gbqxg.com
The complications with internationalized DNS are:
Right now care must be taken when implementing iDNS since most the of standards documents are still in draft.
IETF IDN WG
draft-ietf-idn-race-03.txt - RACE: Row-based ASCII Compatible Encoding for IDN
Perl Sample Script for draft-ietf-idn-race-02.txt
Perl Module for draft-ietf-idn-race-02.txt
draft-ietf-idn-lace-01.txt - LACE: Length-based ASCII Compatible Encoding for IDN
Convert::RACE and Convert::LACE Perl modules by Tatsuhiko Miyagawa
Internet Mail Consortium nameprep test tool
i-DNS.net International
Center for Internet Research (CIRC) iDNS (Obsolete)
Q30c. How do I manage timezones in Perl?
Adding timezone support to an application your first time is a challenge.
Timezone boundaries are in fact dictated by law and politics, not geography. This means that a program needs a table by date and political entity (city or county level) that is updated on a monthly basis with GMT offsets to be of any use.
The generally accepted practice nowadays is to store the users' timezone setting preference as Region/City.
The city level is a fine enough granularity to define timezones accurately (so far). Of note is that city timezone policies are less subject to change than country borders, especially in politically fluid regions like Africa. Legacy alphabetic (short) identifiers like 'PST' are deprecated, or translated from and to Region/City as needed. Storing EST as a preference is obviously wrong for half the year, and am ambiguous, and storing a GMT offset is even worse since then you're not sure what the timezone is at all really.
Perl support for timezones has strengthened recently.
The most appropriate module for new applications is Perl Wrappers for ICU. It supports both Region/City and legacy short identifiers.
For legacy alphabetic identifier only support, Date::Manip and Time::Zone are available on CPAN.
Olson tz data (used in most Unix implementations) and other documents ftp://elsie.nci.nih.gov/pub/
A site that has interesting maps and info on timezones: http://www.worldtimezone.com/
More information: http://www.twinsun.com/tz/tz-link.htm
Just about any Unix install program these days uses a Region/City picker.
The Netmax Firewall product is notable in being a mod_perl application with very sophisticated timezone support based on a graphical image world map picker for Region/City.
A30c. Date::Manip can do timezone conversions between Standard timezones, not Daylight Savings timezones.
(If you are doing client-server programming, and the server stores dates in a common timezone (like PST or GMT), and the server's clock is set to Standard or Daylight time, and all your users also obey Standard or Daylight time, then Date::Manip will give you the correct timezone conversions. Otherwise the time will be offset 1 hour.)
ICU/picu is probably your best bet to do more flexible timezone conversions.
A Summary of the International Standard Date and Time Notation
Military World Time Zone Map
A31. Here's a list:
Miscellaneous References
Ken Lunde has written a character set conversion C tool called jconv to convert Japanese JIS, SJIS and EUC encodings. Typical usage is
jconv -is -oe <sjis.txt >euc.txt
Some other transcoding utilities are Laux's Kakitori, Ichikawa's NKF (there is a NKF Perl wrapper also), Sato's QKC, Yano's RXJIS.
iconv can also be used (on Unix systems):
iconv -f UJIS -t ISO-2022-JP
Viewing and Inputting Japanese Text in Windows
Q32. How do I convert US-ASCII to UTF-16 on Windows NT?
A32. Here's Peter Prymmer's code to convert US-ASCII to UTF-16 on Windows NT:
Q33. How do I transform the name of a character encoding to the MIME charset name?
A33. Here's code to transform name of a encoding to the MIME charset name:
Ruby and Perl Code
Communicator 4.7 Release Notes - International
Japanese Email Software
ISO C Standard
If you are looking for tools for chinese, you will find many things here
http://www.mandarintools.com/
and here for japanese :
http://ftp.cc.monash.edu.au/pub/nihongo/00INDEX.html
This free editor for Windows (95/98/NT/CE) is able to handle many different
encodings for japanese including EUC-JP and UTF-8.
http://www.physics.ucla.edu/~grosenth/jwpce.html
Unicode resources on the net.
IBM Classes for Unicode (ICU), ex-Taligent, available in C++ and Java
http://czyborra.com/unicode/ucoverage lists the coverage of an *-iso10646-1 BDF font or a Unicode mapping file by Roman csyborra
The 19 × 21 × 28 = 11'172 Hangul syllables are all that are needed for Modern Hangul.
#!/usr/local/bin/perl
foreach $choseong (split(/;/,
"G;GG;N;D;DD;L;M;B;BB;S;SS;;J;JJ;C;K;T;P;H"))
{
foreach $jungseong (split(/;/,
"A;AE;YA;YAE;EO;E;YEO;YE;O;WA;WAE;OE;YO;U;WEO;WE;WI;YU;EU;YI;I"))
{
foreach $jongseong (split(/;/,
";G;GG;GS;N;NJ;NH;D;L;LG;LM;LB;LS;LT;LP;LH;M;B;BS;S;SS;NG;J;C;K;T;P;H"))
{
printf("U+%X:%s\n", 0xAC00 + $i++,
"HANGUL SYLLABLE $choseong$jungseong$jongseong");
} } }
$unicode_string = "\0" . join("\0",split(//,$ascii_string));
#!/usr/bin/perl
# Read the name of a character encoding from stdin and transform it into
# the corresponding standardized MIME charset name, as registered on
# (or pipelined for) ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
# Markus Kuhn
Tony Graham's Unicode Slides
Concepts of C/UNIX Internationalization by Dave Johnson
Mike Brown's XML and Character encoding
Setting up Macintosh Web Browsers for Multilingual and Unicode Support
Japanisation FAQ for computers running Western Windows
XML Japanese Profile W3C Note 14 April 2000
Erik's Various Internet and Internationalization Links
email from spc on unicode list
Herman Ranes' Japanese and Esperanto page in UTF-8
Why Can't they Just Speak ? - 102 Languages in UTF-8
Alan Wood's Unicode Resources - Unicode and Multilingual Support in Web Browsers and HTML
I Can Eat Glass ... Project
Origin of euro
Unicode Mailing List Archives
Braille
p5pdigest article on Unicode?
Alan Wood's Unicode Resources: Windows Fonts that Support Unicode
IETF Internationalized Domain Names WG
Repertoires of characters used to write the indigenous languages of Europe: A CEN Workshop Agreement
} END {
A language is just a dialect with an army and a navy.