Matzにっき(2006-06-22)

2006-06-22 [長年日記]

_ [Ruby] Unicode

[ruby-talk:197946]で公開されたRubyでUnicodeを扱うライブラリ。

ダウンロードは<URL:ftp://ftp.mars.org/pub/ruby/Unicode.tar.bz2>から。

使い方はこんな感じ。

Unicode strings can be obtained by applying the + unary operator to native strings, e.g. +"Hello" (where the native string is encoded in the default encoding).

% irb -I. -runicode -Ku
irb(main):001:0> ustr = +"π is pi"
=> +"π is pi"

Native strings are obtained from Unicode strings by calling to_s, which accepts an optional argument to indicate the desired encoding.

irb(main):002:0> str = ustr.to_s
=> "π is pi"
irb(main):003:0> str.encoding
=> Unicode::Encoding::UTF8

Individual characters can be indexed from Unicode strings, returning a Unicode::Character object.

irb(main):004:0> ustr[0]
=> U+03C0 GREEK SMALL LETTER PI

Case conversion is handled as with native strings.

irb(main):005:0> ustr.upcase
=> +"Π IS PI"

Normalization is accomplished with the ~ unary operator.

irb(main):006:0> ustr = +"m,Am"
=> +"m,Am"
irb(main):007:0> ustr.to_a
=> [U+006D LATIN SMALL LETTER M, U+00ED LATIN SMALL LETTER I WITH ACUTE]
irb(main):008:0> (~ustr).each_char { |ch| p ch }
U+006D LATIN SMALL LETTER M
U+0069 LATIN SMALL LETTER I
U+0301 COMBINING ACUTE ACCENT
=> +"m,Am"

実に面白い。

_ [Ruby] auto conversion

Ruby M17Nは、複数のエンコーディングを(できるだけ)変換なしで処理するのを主眼にしたデザインになっているのだが、Cのlocaleモデルのような、１プログラム１エンコーディングのようなケースはともかく、複数エンコーディングが混在する場合には、結局は統一的な内部文字集合(Universal Character Set - UCS)に変換して処理する必要があるかな、と考えてきた。

というか、変換まわりにはあまり気を使ってこなかったというのが実情だ。この辺が、「基本はUnicodeへの変換」という他の言語(PerlとかPythonとか)との違いだ。

とはいえ、実用のためには、どこかで変換は必要なわけで、それはきっとIOで行うに違いないと考えてきた。

しかし、自動変換(coercing)を強く勧める意見が出た。[ruby-talk:198475]

自動変換は

ふたつのエンコーディングが相互に変換可能とは限らない
変換によって知らないうちに情報が落ちる可能性がある
エラーが起きたときのデータの起源がわからなくなりがち

などの理由で敬遠してきたのだけど、今回の提案はちょっと具体的。

#
# NOTES:
# a) String#recode!(new_encoding) replaces current
#    internal byte representation with new byte sequence,
#    that is recoded current. must raise IncompatibleCharError, if
#    can't convert char to destination encoding
# b) downgrading string from some stated encoding to "none"  tag must
#    be done only explicitly.
#    it is not an option for implicit conversion
# c) $APPLICATION_UNIVERSAL_ENCODING is a global var, allowed to be
#    set once and only once per application run.
#    Intent: we want all strings which aren't raw bytes to be in one
#    single predefined encoding,
#    so all operations on string must return string in conformant encoding.
#    Desired encoding is value of $APPLICATION_UNIVERSAL_ENCODING.
#    If $APPLICATION_UNIVERSAL_ENCODING is nil, we go in "democracy
#    mode", see below.
#
def coerce_encodings(str1, str2)
   enc1 = str1.encoding
   enc2 = str2.encoding

   # simple case, same encodings, will return fast in most cases
   return if enc1 == enc2

   # another simple but rare case, totally incompatible encodings, as
   # they represent incompatible charsets
   if fully_incompatible_charsets?(enc1, enc2)
        raise(IncompatibleCharError, "incompatible charsets %s and %s", enc1, enc2)
   end

   # uncertainity, handling "none" and preset encoding
   if enc1 == "none" || enc2 == "none"
        raise(UnknownIntentEncodingError, "can't implicitly coerce encodings %s and %s, use explicit conversion", enc1, enc2)
   end

   # Tirany mode:
   # we want all strings which aren't raw bytes to be in one single
   # predefined encoding
   if $APPLICATION_UNIVERSAL_ENCODING
        str1.recode!($APPLICATION_UNIVERSAL_ENCODING)
        str2.recode!($APPLICATION_UNIVERSAL_ENCODING)
        return
   end

   # Democracy mode:
   # first try to perform non-loss conversion from one encoding to another:
   # 1) direct conversion, without loss, to another encoding, e.g. UTF8 + UTF16
   if exists_direct_non_loss_conversion?(enc1, enc2)
        if exists_direct_non_loss_conversion?(enc2, enc1)
        # performance hint if both available
           if str1.byte_length < str2.byte_length
                str1.recode!(enc2)
           else
                str2.recode!(enc1)
           end
        else
                str1.recode!(enc2)
        end
        return
   end
   if exists_direct_non_loss_conversion?(enc2, enc1)
        str2.recode!(enc1)
        return
   end

   # 2) non-loss conversion to superset
   # (I see no reason to raise exception on KOI8R + CP1251,
   # returning string in Unicode will be OK)
   if superset_encoding = find_superset_non_loss_conversion?(enc1, enc2)
        str1.recode!(superset_encoding)
        str2.recode!(superset_encoding)
        return
   end

   # A case for incomplete compatibility:
   # Check if subset of enc1 is also subset of enc2,
   # so some strings in enc1 can be safely recoded to enc2,
   # e.g. two pure ASCII strings, whatever ASCII-compatible encoding
   # they have
   if exists_partial_loss_conversion?(enc1, enc2)    	
        if exists_partial_loss_conversion?(enc2, enc1)
           # performance hint if both available
           if str1.byte_length < str2.byte_length
                str1.recode!(enc2)
           else
                str2.recode!(enc1)
           end
        else
                str1.recode!(enc2)
        end
        return
   end

   # the last thing we can try
   str2.recode!(enc1)
end

うーん、面白い(こればっかり)。

確かに通常のアプリケーションモデルは

１プログラム１エンコーディング(ただし、切り替えはあり)
１プログラム１内部エンコーディング(おそらくはUnicode)

が、ほとんどだと思うので、それを考えるとこの辺ってのはそんなに悪くないのかも。ただ、文字列の中身がいつの間にかすりかわるのはちょっと恐い。

本日のツッコミ(全1件) [ツッコミを入れる]

_ maeda (2006-06-25 04:28)

最近、複数の文字コードを扱うRubyプログラムをたくさん書く機会がありました。入力テキストを加工してファイルに出力する単純なものがほとんどですが、入力・出力・ログ出力・プログラムのエンコーディング(KCODE)がそれぞれ異なる（可能性がある）ものです。(cp932, euc-jp, utf-8)

KCODE->入力エンコーディング
入力->KCODE
KCODE->メッセージ
KCODE->出力
入力->出力
の変換メソッドを(NKFを使って)作っておいて、適宜使い分けたのですが、なるべく入力->出力だけ使うようにしていると、入力と定数を比較したり、入力文字列をキーにハッシュ表を引いたりする際に変換するのをしばしば忘れました。

ただ、(Windows拡張文字もなるべく状態を落とさずutf-8にする必要があったので)変換できないときに自分で処理できるIconvはとても便利でした。Perlだとプログラミングは確かに楽なのですが、扱えない文字を入力したときに勝手に情報が落ちて、しかも余計なエラーメッセージが出てしまうので使えなかった。

ツッコミ・コメントがあればどうぞ! E-mailアドレスは公開されません。

[TrackBack URL: http://www.rubyist.net/~matz/tb.rb/20060622]