Page 42 - ARM 64 Bit Assembly Language

P. 42

Introduction 25

Table 1.9: UTF-8 encoding of the ISO/IEC 10646 code points.
UCS bits First Last Bytes Byte 1 Byte 2 Byte 3 Byte 4
code code
point point
7 U+0000 U+007F 1 0xxxxxxx
11 U+0080 U+07FF 2 110xxxxx 10xxxxxx
16 U+0800 U+FFFF 3 1110xxxx 10xxxxxx 10xxxxxx
21 U+10000 U+10FFFF 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

for a Universal Character Set (UCS). This standard is known as the ISO/IEC 10646 standard.
The ISO/IEC 10646 standard deﬁnes the mapping of code points (numbers) to glyphs (char-
acters), but does not specify character collation or other language-dependent properties. UCS
code points are commonly written in the form U+XXXX, where XXXX in the numerical code
point in hexadecimal. For example, the code point for the ASCII DEL character would be
written as U+007F. Unicode extends the ISO/IEC standard and speciﬁes language-speciﬁc
features.

Originally, Unicode was designed as a 16-bit encoding. It was not fully backward-compatible
with ASCII, and could encode only 65,536 code points. Eventually, the Unicode character set
grew to encompass 1,112,064 code points, which requires 21 bits per character for a straight-
forward binary encoding. By early 1992, it was clear that some clever and efﬁcient method for
encoding character data was needed.

UTF-8 (UCS Transformation Format-8-bit) was proposed and accepted as a standard in 1993.
UTF-8 is a variable-width encoding that can represent every character in the Unicode char-
acter set using between one and four bytes. It was designed to be backward compatible with
ASCII and to avoid the major issues of previous encodings. Code points in the Unicode char-
acter set with lower numerical values tend to occur more frequently than code points with
higher numerical values. UTF-8 encodes frequently occurring code points with fewer bytes
than those which occur less frequently. For example the ﬁrst 128 characters of the UTF-8 en-
coding are exactly the same as the ASCII characters, requiring only seven bits to encode each
ASCII character. Thus any valid ASCII text is also valid UTF-8 text. UTF-8 is now the most
common character encoding for the World Wide Web, and is the recommended encoding for
email messages.
In November 2003, UTF-8 was restricted by RFC 3629 to end at code point 10FFFF 16 .This
allows UTF-8 to encode 1,114,111 code points, which is slightly more than the 1,112,064
code points deﬁned in the ISO/IEC 10646 standard. Table 1.9 shows how ISO/IEC 10646
code points are mapped to a variable-length encoding in UTF-8. Note that the encoding al-
lows each byte in a stream of bytes to be placed in one of three distinct categories as follows:

37 38 39 40 41 42 43 44 45 46 47