Unicode – Characters (Part 6)



Unicode – Characters (Part 6)

The ever-increasing number of additional characters meant that Unicode had to be expanded again and again. The set of characters is now divided into several levels (“planes“). The practical thing is that in this way you can arrange a certain number of additional characters on each of the now 17 levels (with a 16-bit coding per level 65,536 characters). A total of significantly more, namely 1,114,112 possibilities of character encoding result from this arrangement.

The different levels of Unicode

So far only the levels 0 to 2 are important for the representation of non-European characters. At level 0 (“Basic Multilingual Plane [BMP]”) there are for example the Latin characters, as well as Greek, Hebrew, Arabic, Ethiopian N’Ko, Tifinagh, Vai and Bamum (see overview). While the first 256 characters of this level, which are mainly relevant for the Latin characters and special characters, can still be represented with only 8 bits (1 byte) per character, the binary number combinations for all higher-order characters, with increasing “numbering”, become longer, and you need more bits there. The characters therefore usually are encoded with 16 bits (2 bytes). With 16 bits, you can display a maximum of 65,536 characters, and the same number of characters are at level 0.

At Level 1 (“Supplementary Multilingual Plane [SMP]”) there are then further characters of ancient or non-European writing systems, such as ancient Greek and Coptic numerals, Osmanya, Meroitic hieroglyphs, Old South Arabic, Egyptian hieroglyphs, additions to Bamum, Bassa Vah and Medefaidrin (see overview). At this level, even 16 digits are no longer sufficient for the ever-increasing binary number combinations of the characters. One-to-one representations of the binary number sequences can only be made here via 32-bit encodings (4 bytes).

Level 2 (“Supplementary Ideographic Plane [SIP]”) contains mainly Chinese and Japanese characters. All other of the altogether 17 levels (levels 3 to 16) are either not yet assigned or contain, next to a number of vacant positions, especially tags, variant selectors and private use areas.

Further details on the different levels can be found in the description of the Unicode consortium (Unicode 11.0.0, pp. 44-52).

Too many bits? – Unicode Transformation Formats (UTF)

If you wanted to display, one-to-one, all previously “collected” characters in a single system without causing confusion, you would have to change the entire previous information at all levels uniformly to 32 bits. (Smaller numbers would then have to be preceded by a corresponding number of zeros, so that they also get at 32 digits.) This, however, might create further problems. For one thing, this would often consume too much storage space, and for another, older documents or websites that have not undergone such a conversion would not be compatible with the new form of presentation.

The solution is therefore sought exactly the other way round: one tries to shorten the long 32-bit representations by repeated encryption. There are several conversion options for this, the so-called “Unicode transformation formats (UTF)” (Unicode 11.0.0, p. 121). While “UTF-32”, which maintains the higher-level 32 bits one-to-one, is mainly used for large-capacity operating systems and applications, in other areas, the 32-bit data become transformed, via the “UTF-16” encoding, into 16 bits. When encoding via “UTF-8” (the usual format on the Internet), the previously often long bit data get even shorter, because they are transformed into 8-bit data (in UTF-8, all binary values, both 32-bit and 16-bit values, are equally converted into 8-bit data). Such UTF values represent secondary encodings, and should not be confused with the “original” 16- or 8-bit representations, as you find them on level 0 (Unicode 11.0.0, p. 35-39).