Unicode – Characters (Part 1)

If you work a lot with foreign languages, you might also wish to enter characters from other writing systems into your word processor, whether it be Arabic characters, special characters developed from Latin, African writing systems like Adlam or Bamum, Ethiopian characters or Meroitic hieroglyphics. However, this is not always easy, since the preinstalled software standards are usually not set up for such requirements on many devices. The Unicode standard developed since 1991 for the universal encoding of characters and symbols offers helpful ways to display also non-European characters digitally.

The diversity of writing systems and their digitization

Since 1963 (last updated in 1986), the “American Standard Code for Information Interchange (ASCII)“ was used in the US for writings based on the Latin script. It originally contained only 128 characters. ASCII had initially been developed for English-speaking users but was soon to introduce other special characters (including those for several European languages), and was therefore expanded in 1981 to 256 characters.

Despite such possibilities of embedding diacritical and other special characters, ASCII was not designed for writing systems outside the European and Anglo-American realms. The system quickly reached its limits when it came to representing non-European writing systems because of its limited number of code positions. Some countries, such as Japan, instead developed their own computer encodings for their writing systems. The problem, however, was that such solutions were often incompatible with other systems.

Not least for this reason, the „International Organization for Standardization (ISO)“ tried to establish internationally valid standards („ISO-Standards“) for character encoding. The Unicode initiative launched a major international standardization effort to digitally encode writing systems, which is also in line with ISO standards. The latest Unicode version 11.0.0 of 2018, for example, matches the specifications of Amendment 1 to ISO/IEC 10646:2017 (the fifth version of ISO/IEC 10646) and contains some characters of Amendment 2. However, Unicode does in some respects go beyond the ISO specifications.

Unicode – An attempt at standardization

Unicode Logo

Founded in 1991, the Unicode consortium consists primarily of memberships of well-known companies such as Microsoft, Apple, IBM, Google, Adobe Systems, etc., but also of individuals and liaison members. Despite the commercial base of the companies that are its main members, the consortium is “a non-profit Public Benefit Corporation and is not organized for the private gain of any person”; its sole purpose is to promote public standardized character encodings that will in the future enable the use of computers in all languages of the world:

This Corporation is a nonprofit Public Benefit Corporation and is not organized for the private gain of any person. It is organized under the Nonprofit Public Benefit Corporation Law for public and charitable purposes. This Corporation’s specific purpose shall be to enable people around the world to use computers in any language, by providing freely-available specifications and data to form the foundation for software internationalization in all major operating systems, search engines, applications, and the World Wide Web. An essential part of this purpose is to standardize, maintain, educate and engage academic and scientific communities, and the general public about, make publicly available, promote, and disseminate to the public a standard character encoding that provides for an allocation for more than a million characters.

(Unicode consortium, Bylaws, Article 1 (1), 2015, p. 5)

The practices and principles of the Unicode consortium conform with these charitable goals (see also the consortium guidelines for cases of conflict of interest and its whistleblower policy).

Although the development of Unicode initially focused mainly on Latin writing systems, this policy soon changed. The standard was extended more and more to non-European languages and writing systems. This process is ongoing, visible in the veritable “history” of the various Unicode versions. The latest Unicode version 11.0.0 was released in August 2018. The goal of the constant evolution of the project is the global standardization of all old and new European and non-European characters and symbols into a single universal code – the “Unicode”.

The Unicode character set: Binary and decimal codes

Why do you even have to encrypt characters and symbols again? The answer is that you have to turn them into numerical codes so that the computer can at all read and process them. After all, computers can only process binary codes (i.e. those based on combinations of zeros and ones). How do you “tell” the computer the visual image of a glyph, with all its curves, strokes, or hooks, in the numeral form of zeros and ones?

It is much easier to convert numbers into binary codes than graphic mappings. If you “collect” all known characters within a single character set, and then consecutively “number” them all the way through, you get a numerical value for the “code position” (the so-called “code point”) for each character. This digit is easy to convert back from and forth into binary codes. The computer can easily process these codes, and then unambiguously refer back to the desired character.

Since the decimal system (with the numbers 0-9) is most common in Europe and North America, all the “collected” characters of the world for which digital graphic representations or illustrations were already available, were consecutively “numbered” in the decimal system ascending from 0 to (currently) over 137,000. For each character with a specific code position, a digital graphic image serves as an example, which as an abstractly conceived “ideal image” is representative of numerous other drawn, painted, carved, printed or digitized representations of the same sign. (The description here somewhat condenses the more complex structure of Unicode; graphic characters are often composed of a sequence of several individual characters or parts of characters, each of which may have its own position number, see Unicode 11.0.0, pp. 23, 29, 38.) The (example) mappings of the characters are listed in the various Unicode tables, and each of them is then assigned the corresponding number of its position, its code point.

Unicode in the decimal system: The code points of the signs

The “numbering” of the positions of the individual characters in a fixed order, which are used as decimal codes, thus represent, in a way, directory keys, by means of which one can find and unambiguously identify individual characters. Of course, the large number of collected characters does not make it easy to memorize their assigned decimal codes. It was therefore advisable, as far as possible, to arrange characters pre-sorted in blocks of various writing systems, such as “Latin”, “Greek” or “Arabic”, and even better, to place them internally in a well-remembered order, such as in the alphabetical order of the letters A to Z.

The Unicode position-keys, or code points, do follow this logic. For example, in Unicode the positions from 0 to 31 are reserved for system commands. The following positions from 32 to 879 then contain various types of Latin (or Latin-derived) characters and special characters, such as diacritics or the International Phonetic Alphabet (IPA). The higher positions are then assigned to the non-European writing systems, each organized in contiguous “blocks,” such as Hebrew with the decimal codes 1424-1535, or “Ethiopian” with the decimal codes 4608-4991.

Example: Positions 65 to 122 were allocated to the Latin alphabet, beginning with the capital letter “A” at position 65, and ending with the lowercase letter “z” at position 122. Uppercase letters are distinguished from lowercase letters; they each have their own decimal code.

African writing systems in the Unicode character code charts (detail). Unicode 11.0.0, 2018, **www.unicode.org**.

Although the various writing systems are mostly arranged in blocks, characters of the same writing system can occasionally appear in different sections or ranges of positions. This is because, in practice, upgrades in Unicode were made chronologically, that is, code positions depended on the time the characters were incorporated into the Unicode system. While in some writing systems, block positions for future additional characters were providently kept free (newly added characters are then each sorted in their corresponding block – for the more recent practice, see Unicode 11.0.0, S. 46-47), other blocks of writing systems were already full and their subsequent numbering already taken. In these cases, for later additions a new block was added, often called “supplement” or “extension” (for example, “Greek” with the decimal codes 880-1023, plus “Greek Extended” with the decimal codes 7936 -8191). For some writing systems, there are several such supplementary blocks.

Example: The Arabic script is distributed in Unicode on several complementary blocks: “Arabic” with the decimal codes 1536-1791, “Arabic Supplement” with the decimal codes 1872-1919, “Arabic Extended-A” with the decimal codes 2208-2303, “Arabic Presentation Forms-A” with the decimal codes 64336-65023, “Arabic Presentation Forms-B” with the decimal codes 65136-65278, as well as on other blocks containing numerical and mathematical symbols.

If you know these decimal codes, you can enter the relevant characters on your own computer via the keyboard.

Continue to Unicode Characters Part 2.

VAD

Alle Artikel