Crypto-Gram

July 15, 2000

by Bruce Schneier
Founder and CTO
Counterpane Internet Security, Inc.
schneier@counterpane.com
http://www.counterpane.com

A free monthly newsletter providing summaries, analyses, insights, and commentaries on computer security and cryptography.

Back issues are available at http://www.counterpane.com. To subscribe or unsubscribe, see below.

Security Risks of Unicode

Unicode is an international character set. Like ASCII, it provides a standard correspondence between the binary numbers that computers understand and the letters, digits, and punctuation that people understand. But unlike ASCII, it seeks to provide a code for every character in every language in the world. To do this requires more than 256 characters, the 8-bit ASCII character set; default Unicode uses 16-bit characters, and there are rules to extend even that.

I don't know if anyone has considered the security implications of this.

Remember all those input validation attacks that were based on replacing characters with alternate representations, or that explored alternative delimiters? For example, there was a hole in the IRIX Web server: if you could replace spaces with tabs you could fool the parser, and you could use hexadecimal escapes, strange quoting, and nulls to defeat input validation.

The Unicode specification includes all sorts of complicated new escape sequences. They have things called UTF-8 and UTF-16, which allow several possible representations of various character codes, several different places where control-characters pop through, a scheme for placing diacriticals and accents in separated characters (looking very much like an escape), and hundreds of brand new punctuation characters and otherwise nonalphabetic characters.

The philosophy behind the Unicode spec is to provide all possible useful characters for applications that are 8- or 16-bit clean. This is admirable, but it is nearly impossible to filter a Unicode character stream to decide what is "safe" in some application and what is not.

What happens when:

- We start attaching semantics to the new characters as delimiters, white space, etc? With thousands of characters and new characters being added all the time, it will be extremely difficult to categorize all the possible characters consistently, and where there is inconsistency, there tends to be security holes.

- Somebody uses "modifier" characters in an unexpected way?

- Somebody uses UTF-8 or UTF-16 to encode a conventional character in a novel way to bypass validation checks?

With the ASCII character set, we could carefully study a small selection of characters, categorize them clearly, and make relatively straightforward decisions about the nature of each character. And even here, there have been mistakes (forgetting about tabs, multicharacter control-sequence snafus, etc). Still, a careful designer can figure out a safe way to deal with any possible character that can come off an untrusted wire by elimination if necessary.

With Unicode, we probably won't be able to get a consistent definition of what to accept, what is a delimiter under what circumstance, or how to handle arbitrary streams safely. It's just a matter of time before simple validators pass data and upper layer software, trying to be helpful, attach magic-character semantics, and we have a brand-new variety of security holes.

Unicode is just too complex to ever be secure.

Unicode:
<http://www.unicode.org/>

My thanks to Jeffrey Streifling, who provided much of the material for this article.