What is Unicode?

The Unicode Standard is the universal character representation standard for text in computer processing. Unicode provides a consistent way of encoding multilingual plain text making it easier to exchange text files internationally.

The Unicode Standard defines code points (unique numbers) for characters used in the major languages written today. This includes punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, dingbats, and so on. In all, the Unicode Standard provides codes for over 100,000 characters from the world’s alphabets, ideograph sets, and symbol collections, including classical and historical texts of many written languages. The characters can be represented in different encoding forms, such as UTF-8 and UTF-16.

The Unicode Standard is fully compatible with the International Standard ISO/IEC 10646; it contains all the same characters and code points as ISO/IEC 10646. This code-for-code identity is true for all encoded characters in the two standards, including the East Asian (Han) ideographic characters. The Unicode Standard also provides additional information about the characters and their use. Any implementation that conforms to Unicode also conforms to ISO/IEC 10646.

Unicode characters are represented in one of three encoding forms: a 32-bit form (UTF-32), a 16-bit form (UTF-16), and an 8-bit form (UTF-8). These character encoding standards define not only the identity of each character and its numeric value (code position), but also how this value is represented in bits.

Starting from version 6.5, solidDB^® can be configured to use the UTF-8 encoding for representing character data and UTF-16 encoding for wide character data. The database mode is controlled with the parameter General.InternalCharEncoding.

If a database is created in Unicode mode (General.InternalCharEncoding=UTF8), the following applies:

If a database is created in the partial Unicode mode (General.InternalCharEncoding=Raw), the following applies

▪Data in character column types is not encoded in any particular encoding; instead, the data is stored in byte strings, with the assumption that user applications are aware of this and handle the conversion as necessary.

The UTF-8 and UTF-16 encodings are essentially ways of turning the encoding into the actual bits that are used in implementation; UTF-8 and UTF-16 encodings share the same character set, but the data size of each character differs.

UTF-16 assumes 16-bit characters and allows for a certain range of characters to be used as an extension mechanism in order to access an additional million characters using 16-bit character pairs.

UTF-8 is a way of transforming all Unicode characters into a variable length encoding of bytes. It has the advantages that the Unicode characters corresponding to the familiar ASCII set end up having the same byte values as ASCII, and that Unicode characters transformed into UTF-8 can be used with much existing software without extensive software rewrites.

The Unicode Consortium also endorses the use of UTF-8 as a way of implementing the Unicode standard. Any Unicode character expressed in the 16-bit UTF-16 form can be converted to the UTF-8 form and back without loss of information.