Page 43 - ARM 64 Bit Assembly Language
P. 43

26 Chapter 1

                  1. If the most significant bit of a byte is zero, then it is a single-byte character, and is com-
                     pletely ASCII-compatible.
                  2. If the two most significant bits in a byte are set to one, then the byte is the beginning of a
                     multi-byte character.
                  3. If the most significant bit is set to one, and the next bit is set to zero, then the byte is part
                     of a multi-byte character, but is not the first byte in that sequence.

                  The UTF-8 encoding of the UCS characters has several important features:
                  Backwards compatible with ASCII: This allows the vast number of existing ASCII docu-
                       ments to be interpreted as UTF-8 documents without any conversion.
                  Self synchronization: Because of the way code points are assigned, it is possible to find the
                       beginning of each character by looking only at the top two bits of each byte. This can
                       have important performance implications when performing searches in text.
                  Encoding of code sequence length: The number of bytes in the sequence is indicated by the
                       pattern of bits in the first byte of the sequence. Thus, the beginning of the next character
                       can be found quickly. This feature can also have important performance implications
                       when performing searches in text.
                  Efficient code structure: UTF-8 efficiently encodes the UCS code points. The high-order
                       bits of the code point go in the lead byte. Lower-order bits are placed in continuation
                       bytes. The number of bytes in the encoding is the minimum required to hold all the sig-
                       nificant bits of the code point.
                  Easily extended to include new languages: This feature will be greatly appreciated when
                       we contact intelligent species from other star systems.

                  With UTF-8 encoding The first 128 characters of the UCS are each encoded in a single byte.
                  The next 1,920 characters require two bytes to encode. The two-byte encoding covers almost
                  all Latin alphabets, and also Arabic, Armenian, Cyrillic, Coptic, Greek, Hebrew, Syriac, and
                  T¯ ana alphabets. It also includes combining diacritical marks, which are used in combination
                  with another character, such as á, ñ, and ö. Most of the Chinese, Japanese, and Korean (CJK)
                  characters are encoded using three bytes. Four bytes are needed for the less common CJK
                  characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols).

                  Consider the UTF-8 encoding for the British Pound symbol (£), which is UCS code point
                  U+00A3. Since the code point is greater than 7F 16 , but less than 800 16 , it will require two
                  bytes to encode. The encoding will be 110xxxxx 10xxxxxx, where the x characters are re-
                  placed with the 11 least-significant bits of the code point, which are 00010100011. Thus, the
                  character £ is encoded in UTF-8 as 11000010 10100011 in binary, or C2 A3 in hexadecimal.
                  The UCS code point for the Euro symbol (e) is U+20AC. Since the code point is between
                  800 16 and FFFF 16 , it will require three bytes to encode in UTF-8. The three-byte encoding
   38   39   40   41   42   43   44   45   46   47   48