Genus

View Original

Using Unicode to Support Multilingual Applications

The preferred way to store text is by using strings that support unicode, since this allows a mix of characters from any language. When using unicode strings there are some features that are useful to be aware of to avoid problems. Some of these features are also differences from non-unicode strings that should be considered, especially when converting an existing solution from non-unicode to unicode.

The Legacy

How to represent a character when storing it and processing it in a computer today is dictated by standards that have a history all the way back to the first computers, and actually even before. This article will not attempt to unravel the whole history, but a brief and somewhat simplified explanation of some terms are in order.

ASCII - American Standard Code for Information Interchange is the most common character encoding format for text data in computers. In standard ASCII-encoded data 7 bits are used, resulting in unique values for 128 alphabetic, numeric or special additional characters and control codes.

Extended ASCII - An additional bit is used to encode an additional 128 characters resulting in a total of 256 characters. The additional characters are not assigned to specific code, but can be used to store different characters according to a code page. This way several different sets of characters could be supported, like Japanese, Chinese, and Nordic characters, just not at the same time. The extended ASCII is also known as ANSI.

Code Page - A named set of characters and the code that is used to represent them. For example the Code Page "865 Nordic Languages" contains the mapping of the character "Æ" to the code 145 in the extended ASCII encoding.

All in all the scene is set with problems arising from the complexities of this legacy. Non-unicode character sets, such as ASCII or the various extended ASCII sets, with their limited range, is insufficient in today's global environment, where data input could literally be in any language or symbol set. With non-Unicode sets, any character outside their limited range needs conversion, often resulting in data loss or incorrect representations. 

What is Unicode?

Unicode is an international encoding standard for use with different languages and scripts, by which each letter, digit, or symbol is assigned a unique numeric value that applies across different platforms and programs.

Unicode thereby solves the problems of language support in applications, by allowing any character in any language to be consistently encoded.

This sounds fantastic, but there are however, some features to be aware of. To encode all characters of all languages including all symbols etc., 4 bytes are required (32 bit). This is a quadrupling of the requirements ANSI/Extended ASCII has, In many cases this is a complete waste, as one byte would suffice. Unicode can be encoded in different ways, most notably UTF-8 and UTF-16, which addresses the optimization of memory usage .

UTF - Unicode Transformation Format is a character encoding format which is able to encode all of the possible character code points in Unicode.

The numeric value in the name of a UTF encoding denotes the number of bits used as "a unit", so UTF-8 - is a variable width encoding that uses one to four bytes to encode a character, while UTF-16 - is a variable width encoding that uses two or four bytes to encode a character.

Considerations

In SQL Server the main difference between the database data types VARCHAR and NVARCHAR is:

  • VARCHAR stores one character in one byte. A column defined as VARCHAR(10) reports a size of 10 bytes.

  • NVARCHAR uses UTF-16 and stores one character in two or four bytes depending on the character. A column defined as NVARCHAR(10) reports a size of 20 bytes.

In both definitions VARCHAR(10) / NVARCHAR(10) the numbers does actually NOT state the number of characters, but the string length in bytes for varchar and in byte-pairs for nvarchar. For varchar, this corresponds well with the number of characters in most cases. For nvarchar, it becomes apparent that there is a possible mismatch, as each character can require as much as four bytes, or two byte-pairs, depending on the character.

Therefore when using nvarchar one must consider if it is necessary to increase the size of the column in the database.

  • Will the property contain only text in western languages, or other languages with asian, greek, cyrillic characters etc., or a mix? For example a name property could contain any character, even if the application uses a single western language for its data.

  • Is the string size of the property set a bit tight, so it in many cases would be filled or nearly filled?

  • Is the column in a high volume table, like a transaction table? Should we even stick with varchar?

  • Are emojis probable in the content, for example in comment properties or similar?

  • Is it critical that all characters possible to input for a user can be stored?

  • Is it acceptable for user to be prompted with a warning message that the text is too long to be saved?

Remember, the size of an nvarchar column is defined by specifying a number of byte-pairs, and the byte size is thereby the double of that. So if you double the size of an nvarchar column, to ensure that all characters in any language can be stored, the result is a quadrupling in the size.

VARCHAR (100) --specifies 100 bytes, and the size of the column is 100 bytes.
NVARCHAR(200) --specifies 200 byte-pairs, and the size of the column is 400 bytes.

Recommendations

Most textual properties in the application should be defined using the data type Unicode String, and the Data Size property should be set to the number of characters the property can contain.

The corresponding datatype to use in the database should be NVARCHAR(n).The (n) should be set after a consideration on a case-by-case basis.

Converting from non-Unicode to Unicode data

Converting non-Unicode data (like VARCHAR) to Unicode data types (like NVARCHAR) in SQL Server is generally uncomplicated. 

For example, to change a column from VARCHAR to NVARCHAR, you can use the ALTER TABLE statement: 

ALTER TABLE YourTableName ALTER COLUMN YourColumnName NVARCHAR(255); 

This statement modifies the specified column to use the NVARCHAR data type, allowing it to store Unicode data. The size (255 in this example) can be adjusted based on your data requirements. 

Moreover, if you need to insert data into a Unicode column, and the source data is in a non-Unicode format, SQL Server implicitly converts the data type, if possible. In scenarios requiring explicit conversion, you can use the CAST or CONVERT function: 

INSERT INTO YourUnicodeTable (YourUnicodeColumn) 
SELECT CAST(YourNonUnicodeColumn AS NVARCHAR(255)) 
FROM YourNonUnicodeTable; 

Remember that the data type in Genus Studio must also be changed from Non-unicode String to Unicode String.

When Not to Use Unicode

There can be certain cases when unicode should not be used, These are cases when the text is strictly formatted or constrained by semantics or system specification. Examples can be phone numbers, account numbers, object identifiers like order number with a mix of alphanumeric and numeric characters, etc.

Assumptions in This Article

This article attempts to explain aspects of unicode in the context of a Genus application, and the aim is to make the use of unicode easy to achieve. There are, however, some simplifications in the article based on what is the recommendation for a high majority of cases.

  • SQL Server is used as basis for data types and examples.

  • The description of VARCHAR assumes that a single-byte character set collation is used, which is the most common case by far.

  • Examples uses VARCHAR and NVARCHAR, but the same considerations are valid for CHAR and NCHAR data types.

There are other possible configurations that would also work, in combinations with specific collations, other database types etc. To cover all these combinations and possibilities would defeat the purpose of this article, and is therefore left for anyone to explore in detail should the need arise.