Designing Unicode databases

This section contains information on how to setup solidDB^® databases for use with Unicode.

Note Unicode applications can be built on both Unicode and partial Unicode databases. However, the instructions in this section assume that the Unicode support is based on the Unicode database mode.

The solidDB^® database mode is controlled with the parameter General.InternalCharEncoding.

When the InternalCharEncoding is set to UTF8, the internal representation for character data types is UTF-8. Both character data types and wide character data types are converted between the solidDB^® server and the application.

When the InternalCharEncoding is set to Raw, the internal representation for character data types uses no particular encoding; instead, the data is stored in byte strings with the assumption that user applications are aware of this and handled the conversion as necessary. Wide character data types are converted between the solidDB^® server and the application.

The databases created with solidDB^® version 6.3 or earlier are of the partial Unicode type.

Important: The database mode must be defined when the database is created and it cannot be changed later.

If the database already exists in either mode and the database mode contradicts the value of the parameter, the server startup fails with the following error message in the solerr.out:

Parameter General.InternalCharEncoding contradicts the existing database mode

Both character and wide character data types can be used to store Unicode data in Unicode databases. If mainly multi-byte data is expected, you can optimize space-efficiency by choosing to store the multi-byte data into wide character column types. This is because even though UTF-8 and UTF-16 encodings share the same character set, the data size of each character differs.

▪Data in wide character column types (WCHAR/WVARCHAR/LONGWVARCHAR) is represented internally in UTF-16: each character is represented in two or four bytes.

▪Data in character column types (CHAR/VARCHAR/LONG VARCHAR) is represented in UTF-8: each character is represented in one to four bytes.

The size depends on the code point:

– ASCII characters: one byte

– Cyrillic, Arabic, Hebrew, Latin-1 supplement and so on characters: two bytes

– Asian characters/rest of BMP characters: three bytes

– Characters outside the BMP (surrogate characters): four bytes

For example, Asian languages are stored more efficiently on wide character data types (UTF-16) since most characters are part of BMP, which requires two bytes. European languages are stored more efficiently on character data types (UTF-8) since most common characters are represented in one byte.

Wide character data requires also less processing; using wide character data types may improve performance.

The Unicode data types are interoperable; because UTF-16 and UTF-8 share the same character set, there is no risk of data loss when using either data type. All string operations are possible between character and wide character data types with implicit type conversions.

In order to start storing Unicode data in a Unicode database, tables with Unicode data columns need to be created first as follows:

The character data columns are ordered based on the binary values of the UTF-8 and wide character data columns on the UTF-16 format (using most significant byte order). If the binary order is different than what the national language users expect, you need to provide a separate column to store the correct ordering information.

It is possible to name database entities such as tables, columns, and procedures with Unicode strings simply by enclosing the Unicode names with double quotation marks in all the SQL statements.

solidDB^® tools can handle Unicode strings according to the default locale of the environment or according to a specified locale.

For more details, see Using solidDB^® tools with Unicode.

User names and passwords can also be Unicode strings. However, to avoid access problems from different tools, the original database administrator account information must be given as pure ASCII strings.