Page 43 - ARM 64 Bit Assembly Language
P. 43
26 Chapter 1
1. If the most significant bit of a byte is zero, then it is a single-byte character, and is com-
pletely ASCII-compatible.
2. If the two most significant bits in a byte are set to one, then the byte is the beginning of a
multi-byte character.
3. If the most significant bit is set to one, and the next bit is set to zero, then the byte is part
of a multi-byte character, but is not the first byte in that sequence.
The UTF-8 encoding of the UCS characters has several important features:
Backwards compatible with ASCII: This allows the vast number of existing ASCII docu-
ments to be interpreted as UTF-8 documents without any conversion.
Self synchronization: Because of the way code points are assigned, it is possible to find the
beginning of each character by looking only at the top two bits of each byte. This can
have important performance implications when performing searches in text.
Encoding of code sequence length: The number of bytes in the sequence is indicated by the
pattern of bits in the first byte of the sequence. Thus, the beginning of the next character
can be found quickly. This feature can also have important performance implications
when performing searches in text.
Efficient code structure: UTF-8 efficiently encodes the UCS code points. The high-order
bits of the code point go in the lead byte. Lower-order bits are placed in continuation
bytes. The number of bytes in the encoding is the minimum required to hold all the sig-
nificant bits of the code point.
Easily extended to include new languages: This feature will be greatly appreciated when
we contact intelligent species from other star systems.
With UTF-8 encoding The first 128 characters of the UCS are each encoded in a single byte.
The next 1,920 characters require two bytes to encode. The two-byte encoding covers almost
all Latin alphabets, and also Arabic, Armenian, Cyrillic, Coptic, Greek, Hebrew, Syriac, and
T¯ ana alphabets. It also includes combining diacritical marks, which are used in combination
with another character, such as á, ñ, and ö. Most of the Chinese, Japanese, and Korean (CJK)
characters are encoded using three bytes. Four bytes are needed for the less common CJK
characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols).
Consider the UTF-8 encoding for the British Pound symbol (£), which is UCS code point
U+00A3. Since the code point is greater than 7F 16 , but less than 800 16 , it will require two
bytes to encode. The encoding will be 110xxxxx 10xxxxxx, where the x characters are re-
placed with the 11 least-significant bits of the code point, which are 00010100011. Thus, the
character £ is encoded in UTF-8 as 11000010 10100011 in binary, or C2 A3 in hexadecimal.
The UCS code point for the Euro symbol (e) is U+20AC. Since the code point is between
800 16 and FFFF 16 , it will require three bytes to encode in UTF-8. The three-byte encoding