solidDB Help : Programming : Working with Unicode : What is Unicode?
  
What is Unicode?
The Unicode standard is the universal character representation standard for text in computer processing. Unicode provides a consistent way of encoding multilingual plain text making it easier to exchange text files internationally.
The Unicode standard defines code points (unique numbers) for characters used in the major written languages. This includes punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, dingbats, and so on. In all, the Unicode standard provides codes for over 100,000 characters from alphabets, ideograph sets, and symbol collections, including classical and historical texts of many written languages. The characters can be represented in different encoding forms (UTF-8, UTF-16, and UTF-32).
The Unicode standard is fully compatible with the International Standard ISO/IEC 10646; it contains all the same characters and code points as ISO/IEC 10646. This code-for-code identity is true for all encoded characters in the two standards, including the East Asian (Han) ideographic characters. The Unicode standard also provides additional information about the characters and their use. Any implementation that conforms to Unicode also conforms to ISO/IEC 10646.
Encoding forms
Unicode characters are represented in one of three encoding forms: a 32-bit form (UTF-32), a 16-bit form (UTF-16), and an 8-bit form (UTF-8). These character encoding standards define not only the identity of each character and its numeric value (code position), but also how this value is represented in bits.
The encodings share the same character set, but the data size of each character differs.
UTF-32 is a fixed-width encoding scheme and always uses 4 bytes to encode a character.
UTF-16 assumes 16-bit characters and allows for a certain range of characters to be used as an extension mechanism in order to access an additional million characters by using 16-bit character pairs.
UTF-8 is a way of transforming all Unicode characters into a variable length encoding of bytes. It has the advantages that the Unicode characters that correspond to the familiar ASCII set end up having the same byte values as ASCII, and that Unicode characters transformed into UTF-8 can be used with most software without extensive software rewrites.
The Unicode Consortium endorses the use of UTF-8 as a way of implementing the Unicode standard. Any Unicode character that is expressed in the 16-bit UTF-16 form can be converted to the UTF-8 form and back without loss of information.
Go up to
Working with Unicode