GB 18030: A mega-codepage

Exploring the history and structure of the new Chinese Unicode standard

Markus Scherer (markus.scherer@us.ibm.com)
Software Engineer, IBM Unicode Technology Group, IBM
February 2001

Originally published on http://www-106.ibm.com/developerworks/unicode/library/u-china.html?dwzone=unicode

This article briefly describes the important Chinese GB 18030-2000 standard and its implications for software for the Chinese market. GB 18030 presents adopters with some unusual challenges. They are explained here, along with suggestions for how to deal with them.
Contents:
Introduction
A brief history of major GB codepages
Structure
Challenges for implementations of GB 18030
Suggestions for dealing with these challenges
Algorithm for mapping contiguously-enumerated mappings between GB 18030 and Unicode
Conclusion and outlook
Resources
About the author

Introduction
GB 18030-2000 is a new Chinese standard that specifies an extended codepage and a mapping table to Unicode. GB 18030 was first published on March 17, 2000. After feedback from the worldwide software industry, the codepage was changed, and a new mapping table was released on November 30, 2000. The text of the standard is expected to be republished in March of 2001.

This codepage standard is important for the software industry because China has mandated that any software application that is released for the Chinese market after a certain date must support GB 18030. Initially, this date was specified as January 1, 2001. It has been changed to September 1, 2001.

A brief history of major GB codepages
A common base codepage standard for Chinese is GB 2312-1980. It encodes more than 6,000 frequently-used Chinese ideographs.

With the growing importance of Unicode and the parallel standard ISO 10646 (which was adopted by China as GB 13000), an extension of GB 2312-1980 was created. This extension was called GBK and encoded all 20,902 unified ideographs that are assigned in Unicode 2.1. GBK is not a formal standard, but a widely-implemented specification.

Unicode 3.0 added more than 6,000 ideographs, and the upcoming version 3.1 will add about 42,000 on top of that.

GB 18030 was created as an update of GBK for Unicode 3.0 with an extension that covers all of Unicode. It has the following general features:

Structure
GB 18030-2000 encodes characters in sequences of one, two, or four bytes. Valid byte sequences are as follows (byte values are hexadecimal):

(*) Note: At the time of this writing, it seems that the single byte 0x80 should be treated as valid but unassigned, while the single byte 0xff should be treated as illegal.

GB 18030 was created with GBK as a basis. The Unicode mapping table for GB 18030 starts with the same mappings for single-byte and double-byte sequences as the Unicode mapping table for GBK, except for a few dozen characters. These characters were not assigned in Unicode 2.1 and were mapped in the GBK mapping table to Unicode Private-Use code points. GB 18030 maps them to the newly-assigned code points in Unicode 3.0 for the corresponding characters. This keeps the GBK byte sequences the same for these characters, but the Unicode mapping table yields different results for them.

In addition, all Unicode code points that are not mapped by this updated GBK portion are mapped to four-byte sequences, which are new in GB 18030. They are simply enumerated beginning at the lowest such Unicode code point (U+0080) and at the lowest such four-byte sequence (GB+81308130). One such enumeration fills in the 40,000 or so Unicode BMP code points that were not covered by GBK (GB lead bytes 0x81..0x84). Another such enumeration covers the 1 million supplementary Unicode code points (GB lead bytes 0x90..0xe3).

One of the biggest changes with the re-released mapping table from November, compared to the initial one, is that all of the 40,000 mappings to BMP code points were changed. This is mainly (but not only) due to starting the BMP enumeration at U+0080 instead of U+0081.

The current Unicode mapping table in the XML format as described in Unicode Technical Report 22 is available on the ICU Web site (see Resources).

The current Unicode mapping table contains only round-trip mappings. The original mapping table contained fallback mappings for the GBK characters that were updated according to Unicode 3.0: Their old GBK Private-Use code points were mapped unidirectionally to the GB codes, while the round-trip mappings were changed (compared to GBK) to be from the GB codes to the new (Unicode 3.0) code points. In the new mapping table, the fallback mappings are removed, and the Private-Use code points instead map to new four-byte sequences with round-trip mappings.

Note: Like some GBK implementations, the original publication of GB 18030-2000 assigned the Euro currency symbol to the single byte 0x80. The updated mapping table from November leaves 0x80 unassigned and instead maps 0xa2e3 U+20ac for the Euro symbol.

GB 18030 has 1.6 million valid byte sequences, but there are only 1.1 million code points in Unicode, so there are about 500,000 byte sequences in GB 18030 that are currently unassigned.

Challenges for implementations of GB 18030
GB 18030 has some unusual properties that present challenges for an implementation of a codepage converter as well as for in-process use:

Suggestions for dealing with these challenges
An implementation of GB 18030 needs to be able to determine the length of a byte sequence by examining not only the lead byte, but at least the second byte of a multi-byte sequence as well. This could be hard-coded for GB 18030, or could be done in a more general way with a state machine that represents the entire validity structure of this codepage. Such a state machine could be purely data-driven and would be useful for all multi-byte encodings. It provides a general approach for checking that any byte sequence is valid in a given codepage.

For full support of GB 18030, there are basically only two options because it is specified with a Unicode mapping table for all code points:

The number of valid byte sequences -- of Unicode code points covered and of mappings defined between them -- makes it impractical to directly use a normal, purely mapping-table-based codepage converter. With about 1.1 million mappings, a simple mapping table would be several megabytes in size. Most likely, some initial implementations will not support GB 18030 fully, but only some subset of it.

A simple and effective way to handle the large number of defined mappings is to handle most of the four-byte sequences algorithmically. This is possible because the mappings between four-byte GB 18030 sequences and Unicode code points are a result of an enumeration process (see the Structure description above). Large portions of the mapping table contain entries that differ by exactly one position in both Unicode code points and byte sequences. It is possible to extract a small number of such contiguously-enumerated ranges mechanically (for details about how to do this, see this page). The result is that only the remaining mappings need to be stored in an actual mapping table, while the ranges are mapped by special code in a converter.

The XML mapping file mentioned above contains 13 such ranges to cover all but 31,000 mappings. This number is not unusual for mapping tables between Unicode and East Asian codepages. A converter using such a mapping table would first use the explicit mappings; when a result is "unassigned", then it would need to find a range that contains the input, and map algorithmically if such a range exists or otherwise treat the input as unassigned. (Of course, illegal sequences must be handled, as usual, according to the application.)

Handling the one range for the supplementary Unicode code points algorithmically eliminates all non-BMP Unicode code point mappings from the actual mapping table.

In principle, it is possible to handle all mappings involving four-byte sequences algorithmically by extracting all of them as contiguous ranges. Some of these will only contain a single mapping. Doing this would slow down the conversion for four-byte sequences but would allow the remaining mapping table to contain only mappings between single-byte and double-byte GB 18030 sequences and Unicode BMP code points. The remaining mapping table would contain only about 24,000 entries.

Algorithm for mapping contiguously-enumerated mappings between GB 18030 and Unicode
The following is an example of an algorithm for mapping between GB 18030 and Unicode within a contiguously-enumerated range of the mapping specification. Code snippets are pseudo-code. It is possible to implement this algorithm in a general way, storing the range information alongside the mapping table. Currently, however, GB 18030 is the only codepage where this algorithm is really useful, if not necessary.

Consider the following example for a range of enumerated mappings from the XML file (this range covers all supplementary Unicode code points):


		
	
   <range uFirst="10000" uLast="10FFFF"
           bFirst="90 30 81 30" bLast="E3 32 9A 35"
           bMin="81 30 81 30" bMax="FE 39 FE 39"/>

Note that all byte and code point values in the XML file are hexadecimal.

In order to handle GB 18030 four-byte sequences algorithmically, one needs to linearize them, i.e., generate a number for each four-byte sequence so that the difference between two such numbers is the same as the lexical difference between the byte sequences:


		
	
   int linear(byte bytes[4]) {
        return ((bytes[0]*10+bytes[1])*126+bytes[2])*10+bytes[3];
    }

The factors 10 and 126 are the numbers of byte values in the byte positions according to bMin and bMax: 10 values 0x30..0x39 and 126 values 0x81..0xfe. The result of this function is an ordinal number that follows the lexical order of four-byte sequences.

Given a linear value for a byte sequence, the byte sequence itself can be calculated:


		
	

    byte[4] unLinear(int lin) {
        byte result[4];
        lin-=linear(0x81, 0x30, 0x81, 0x30); // zero-base the linear value
        result[3]=0x30+lin%10;  lin/=10;
        result[2]=0x81+lin%126; lin/=126;
        result[1]=0x30+lin%10;  lin/=10;
        result[0]=0x81+lin;
        return result;
    }

For each contiguously enumerated range, the following must be true: uLast-uFirst == linear(bLast)-linear(bFirst)

Mapping from a GB 18030 four-byte sequence to a Unicode code point:


		
	
    int mapToUnicode(byte bytes[4]) {
        int lin=linear(bytes);
        for each range {
            if(linear(bFirst)&lt;=lin&lt=linear(bLast)) {
                // range found
                return uFirst+(lin-linear(bFirst));
            }
        }
        // the byte sequence is not in any known range
        return error;
    }

Mapping from a Unicode code point to a GB 18030 four-byte sequence:


		
	
    byte[4] mapFromUnicode(int u) {
        for each range {
            if(uFirst&lt;=u&lt;=uLast) {
                // range found
                return unLinear(linear(bFirst)+(u-uFirst));
            }
        }
        // code point u is not in any known range
        return error;
    }

An example implementation of the techniques and algorithms discussed here can be found in ICU's ucnvmbcs.c. (See the license.)

Conclusion and outlook
This article has explained the history and the structure of the new Chinese codepage standard GB 18030-2000, which must be implemented in future applications that are marketed for China. Unusual features and challenges are discussed, and suggestions for solutions presented.

With the release of a mapping table by the Chinese standards agency and the adoption of this mapping table by the software industry, there is a rare chance for a consistent industry-wide implementation of a codepage standard.

The standard has been modified since its publication. A new mapping table was released in November of 2000, and the text of the standard is expected to be republished in March of 2001. The date after which newly-released software must support GB 18030 has been moved to September 1, 2001.

Resources


About the author

Markus Scherer is a Software Engineer and Unicode expert and works in IBM's Unicode Technology Group in Cupertino, California. He is currently leading the development of the C/C++ library of the International Components for Unicode (ICU), an open source Unicode library. Before that, he worked on IBM projects for Wireless and Mobile Computing, including GUIs, Translation, and Internationalization, in his native Germany and in North Carolina. Markus can be reached at markus.scherer@us.ibm.com