Oracle character set basics

In this entry I try to explain some basic facts about character sets in Oracle Databases and the most current issues with them.

The databases can store different types of information, numbers, dates, characters, raw data (data with a format know by the owner but not necessary by the database) among others, but internally all of them are represented as numbers (basically due computers only work with numbers).

For the text data we can use different datatypes to store them (CHAR, VARCHAR2, NVARCHAR, CLOB) these datatypes defines how the data is stored (firxed lenth, variable length, in what character set, in large amounts…). But despite the datatype used all of them have in common the use of a “character set”.

As mentioned before the computers can only store numbers, then, a character set is a list of characters (or symbols, or pictures or whatever you want to call them) with a number to identify every one.

Examples of character sets

One of the most know character set is the ASCII. In a database ho uses US7ASCII character set if someone put one “A” character on them, the database stores a number 65 or a 97 if the letter to store is an “a”.

2000px-ascii_table-nocolor-svg

In a database with US7ASCII character set we can only store this 127 characters and no others.

Another common character set is the WE8MSWIN1252 (is the one used by Windows in the European country’s)

mswin1252

This character set uses numbers from 0 to 256 to represent the characters, the first 127 are the same as ASCII ones (same character,same number), and some of these numbers are not used (no have any character associated with them).

In a database using the WE8ISO8859P1 cannot exists any “character datatype” with a “129” stored on them, due the number 129 is not used in this character set to represent any character.

Another one, the WE8859P15, have less characters than WEMSWIN252 and the common ones are represented by different numbers.

iso8859-15

All these, US7ASCII, WEMSWIN1252, WE8ISO8859P15, etc are “fixed size” characters sets, this implies that every character have one byte in size (ASCII only need 7bits but uses really one byte) if you need to store “n” characters, you have to reserve “n” bytes of space.

To allow the databases to store more than 256 different characters we can use character sets ho uses more than one byte to store every character (called “multi byte”), some of them are  fixed length and some are variable length. One example of this are the AL32UTF8 how uses the “Unicode” codification standard and can represent more than 120.000 characters (in the last version).

In the variable length character sets, and due the characters can use different number of bytes, not is possible to know with advance exactly how many bytes must use the strings stored on the database (only can know the theoretical maximum). This character sets are used basically on databases ho need symbols from multiple languages due can store virtually any character.

What can go wrong?

Incorrect client character set configuration

Well, as mentioned early in a US7ASCII database cannot exists any character datatype with a number bigger than 127 stored on it no? …

Imagine you have a database with US7ASCII, and one application on a Windows machine (ho uses the MSWIN1252 character set) with the Oracle client installed.

You configure your Oracle client with the same character set as the database NLS_LANG=US7ASCII, the Windows operating system sends the key strokes to the Oracle client (codified as MSWIN125) and the Oracle client send them to the database.. and the database blindly believes the characters you send are all US7ASCII (due the client says they are) and stores them.

But my application works!!!!   Yes if the client retrieves the characters and interpret them as MSWIN252 all seems to work, until the day you connect with another application (with the correct character set configured), or uses any Java application (java always uses an internal Unicode character set to represent characters) or tries to export/import the data and some of your data appears to be rubbish.

Well, but if I configure my client correctly some characters disappear!!! (are all stored as quotation marks). Yes, this is the correct behaviour, if you send to the database a character not existent on the database character set, the database instead of the character stores a quotation mark “?” and the character is lost.

The rule are:

  • The clients must be configured always with the character set of the machines where it runs, despite the character set of the database.
  • You must use a character set capable of store all the necessary characters.

Note also de database can have any character set despite the one used on the server where the database runs.

Finally, change database from one character set to another is possible, but in general not an easy task, and must be done with caution and following the guidance of Oracle Support Notes.

Changing Or Choosing the Database Character Set ( NLS_CHARACTERSET ) (Doc ID 225912.1)

Moving data arround

Another thing to consider is to move data from a “non variable length” characterset to a variable lenght characterset. The data on the origin database must not fit on destination, it depends on the definition of the lengths of the datatypes.

For example on a multibyte character set database, a VARCHAR2(10 BYTE) is different than a VARCHAR2(10 CHAR), the first one is 10 bytes in size and the second one is 10 chars (as mentioned early multibyte character sets can use more than one byte per character).

Every database have a “default” interpretation for character datatypes sizes (in bytes of chars), you can view this default value and change it if necessary (if you change it only affect the tables created after the change). The parameter is found on the init or spfile and is called:

nls_length_semantics

and an be “BYTE” or “CHAR”,

Also you can create the row in the table with the size and the type (char or byte) despite the default you have.

Unfortunately some cases not have easy solution, if you have a “full” VARCHAR2(4000) on a fixed size character set database and try to move this data to a multibyte database you can catch this error:

ORA-12899: value too large for column <mycol>

 

And due the limitations on size for varchar2 type (have a maximum of 4000 bytes in length until 12.1) the issue in most cases cannot be solved increasing the size of the column.

Also is necessary to take on account this issues on our PL/SQL code, if you have a variable of one byte of length to store one character on the origin database this may not work on a multibyte character set destination database, due most of the characters need more than one byte.

Finally if you move from a multibyte character set database to a one with a less extent character set all the characters not present on destination character set are changed to quotation marks and lost and this cannot be avoided.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s