solidDB Help : Programming : Working with Unicode : Designing Unicode databases
  
Designing Unicode databases
This topic contains information about how to setup solidDB databases for use with Unicode.
Note Unicode applications can be built on both Unicode and partial Unicode databases. However, the instructions in this topic assume that the database mode is Unicode.
Creating Unicode databases
The solidDB database mode is controlled with the parameter General.InternalCharEncoding, see General section.
Unicode mode: General.InternalCharEncoding=UTF8
Both character data types and wide-character data types are converted between the solidDB server and the application.
Data in wide-character column types is represented internally in UTF-16.
Data in character column types is represented in UTF-8.
partial Unicode mode: General.InternalCharEncoding=Raw
Data in wide-character column types is represented internally in UTF-16. Wide-character data types are converted between the solidDB server and the application.
Data in character column types is not encoded in any particular encoding; instead, the data is stored in byte strings, with the assumption that user applications are aware of this and handle the conversion as necessary.
The databases created with solidDB version 6.3 or earlier are of the partial Unicode type.
Important The database mode must be defined when the database is created and it cannot be changed later, see Converting partial Unicode databases to Unicode.
If the database already exists in either mode and the database mode contradicts the value of the parameter, the server startup fails with the following error message in the solerr.out file:
Parameter General.InternalCharEncoding contradicts the existing database mode
Determining which data types to use in Unicode databases
Both character and wide-character data types can be used to store Unicode data in Unicode databases. If mainly multi-byte data is expected, you can optimize space-efficiency by choosing to store the multi-byte data in wide-character column types. This is because even though UTF-8 and UTF-16 encodings share the same character set, the data size of each character differs.
Data in wide-character column types (WCHAR/WVARCHAR/LONGWVARCHAR) is represented internally in UTF-16: each character is represented in 2 or 4 bytes.
characters in the Basic Multilingual Plane (BMP): 2 bytes
characters outside the BMP (surrogate characters): 4 bytes
Data in character column types (CHAR/VARCHAR/LONG VARCHAR) is represented in UTF-8: each character is represented in 1 to 4 bytes.
The size depends on the code point, for example:
ASCII characters: 1 byte
Cyrillic, Arabic, Hebrew, Latin-1 supplement and so on characters: 2 bytes
Asian characters/rest of BMP characters: 3 bytes
Characters outside the BMP (surrogate characters): 4 bytes
For example, Asian languages are stored more efficiently in wide-character data types (UTF-16) since most characters are part of BMP, which requires 2 bytes. European languages are stored more efficiently in character data types (UTF-8) since most common characters are represented in 1 byte.
Wide-character data requires less processing and so using wide-character data types can improve performance.
The Unicode data types are interoperable; because UTF-16 and UTF-8 share the same character set, there is no risk of data loss when using either data type. All string operations are possible between character and wide-character data types with implicit type conversions.
Creating columns for storing Unicode data
In order to start storing Unicode data in a Unicode database, tables with Unicode data columns must be created, for example:
CREATE TABLE customer1 (c_id INTEGER, c_name VARCHAR,...)
CREATE TABLE customer2 (c_id INTEGER, c_name WVARCHAR,...)
Ordering data columns (collation)
The character data columns are ordered based on the binary values of the UTF-8 and wide-character data columns in the UTF-16 format (using most significant byte order). If the binary order is different than what the national language users expect, you must provide a separate column to store the correct ordering information.
Using Unicode in database entity names
It is possible to name database entities such as tables, columns, and procedures with Unicode strings by enclosing the Unicode names in double quotation marks in all SQL statements.
solidDB tools can handle Unicode strings according to the default locale of the environment or according to a specified locale. For more details, see Using solidDB tools with Unicode.
Using Unicode in user names and passwords
User names and passwords can be Unicode strings. However, to avoid access problems from different tools, the original database administrator credentials must use pure ASCII strings.
Using Unicode in file names
Unicode strings cannot be used in any file names.
Go up to
Working with Unicode