Unicode Character an UTF 8 Encoding
In this class, we discuss Unicode Character, an UTF 8 encoding scheme.
For Complete YouTube Video: Click Here
ASCII Characters
In our previous class, we discussed string as a sequence of Unicode characters. Click here.
First, we understand ASCII characters. Then we go into Unicode characters.
The ASCII character table from 0 to 127 is shown below.
Dec | Hex | Char | Name / Function | Dec | Hex | Char | Dec | Hex | Char | Dec | Hex | Char |
0 | 00 | NUL | Null | 32 | 20 | space | 64 | 40 | @ | 96 | 60 | ` |
1 | 01 | SOH | Start Of Heading | 33 | 21 | ! | 65 | 41 | A | 97 | 61 | a |
2 | 02 | STX | Start Of Text | 34 | 22 | “ | 66 | 42 | B | 98 | 62 | b |
3 | 03 | ETX | End Of Text | 35 | 23 | # | 67 | 43 | C | 99 | 63 | c |
4 | 04 | EOT | End Of Transmit | 36 | 24 | $ | 68 | 44 | D | 100 | 64 | d |
5 | 05 | ENQ | Enquiry | 37 | 25 | % | 69 | 45 | E | 101 | 65 | e |
6 | 06 | ACK | Acknowledge | 38 | 26 | & | 70 | 46 | F | 102 | 66 | f |
7 | 07 | BEL | Bell | 39 | 27 | ‘ | 71 | 47 | G | 103 | 67 | g |
8 | 08 | BS | Backspace | 40 | 28 | ( | 72 | 48 | H | 104 | 68 | h |
9 | 09 | HT | Horizontal Tab | 41 | 29 | ) | 73 | 49 | I | 105 | 69 | i |
10 | 0A | LF | Line Feed | 42 | 2A | * | 74 | 4A | J | 106 | 6A | j |
11 | 0B | VT | Vertical Tab | 43 | 2B | + | 75 | 4B | K | 107 | 6B | k |
12 | 0C | FF | Form Feed | 44 | 2C | , | 76 | 4C | L | 108 | 6C | l |
13 | 0D | CR | Carriage Return | 45 | 2D | – | 77 | 4D | M | 109 | 6D | m |
14 | 0E | SO | Shift Out | 46 | 2E | . | 78 | 4E | N | 110 | 6E | n |
15 | 0F | SI | Shift In | 47 | 2F | / | 79 | 4F | O | 111 | 6F | o |
16 | 10 | DLE | Data Line Escape | 48 | 30 | 0 | 80 | 50 | P | 112 | 70 | p |
17 | 11 | DC1 | Device Control 1 | 49 | 31 | 1 | 81 | 51 | Q | 113 | 71 | q |
18 | 12 | DC2 | Device Control 2 | 50 | 32 | 2 | 82 | 52 | R | 114 | 72 | r |
19 | 13 | DC3 | Device Control 3 | 51 | 33 | 3 | 83 | 53 | S | 115 | 73 | s |
20 | 14 | DC4 | Device Control 4 | 52 | 34 | 4 | 84 | 54 | T | 116 | 74 | t |
21 | 15 | NAK | Non Acknowledge | 53 | 35 | 5 | 85 | 55 | U | 117 | 75 | u |
22 | 16 | SYN | Synchronous Idle | 54 | 36 | 6 | 86 | 56 | V | 118 | 76 | v |
23 | 17 | ETB | End Transmit Block | 55 | 37 | 7 | 87 | 57 | W | 119 | 77 | w |
24 | 18 | CAN | Cancel | 56 | 38 | 8 | 88 | 58 | X | 120 | 78 | x |
25 | 19 | EM | End Of Medium | 57 | 39 | 9 | 89 | 59 | Y | 121 | 79 | y |
26 | 1A | SUB | Substitute | 58 | 3A | : | 90 | 5A | Z | 122 | 7A | z |
27 | 1B | ESC | Escape | 59 | 3B | ; | 91 | 5B | [ | 123 | 7B | { |
28 | 1C | FS | File Separator | 60 | 3C | < | 92 | 5C | \ | 124 | 7C | | |
29 | 1D | GS | Group Separator | 61 | 3D | = | 93 | 5D | ] | 125 | 7D | } |
30 | 1E | RS | Record Separator | 62 | 3E | > | 94 | 5E | ^ | 126 | 7E | ~ |
31 | 1F | US | Unit Separator | 63 | 3F | ? | 95 | 5F | _ | 127 | 7F | delete |
ASCII characters have given a unique number for every character.
a-z, A-Z, space etc., for each character, they have provided a unique number.
Character A is given value 65 in the ASCII table.
ASCII is following a standard to characters. Why we need that standard?
If everyone follows the standard. It’s easy to exchange information.
Example:
we want to send a text hello to some other person.
Character h is converted to ASCII. Character e is converted to ASCII.
All characters are converted to ASCII. The one who receives the message also follows ASCII standards.
It’s easy to understand what message has been received.
Computers don’t understand characters. So A is converted to binary value 65.
The last character is the delete character. The decimal value is 127.
They extended the ASCII characters from 127 to 256. They added Latin, Greek symbols.
Extended ASCII made a total of 256 different characters.
The extended table is given below.
DEC | OCT | HEX | BIN | Symbol | HTML Number | HTML Name | Description |
---|---|---|---|---|---|---|---|
128 | 200 | 80 | 10000000 | € | € | € | Euro sign |
129 | 201 | 81 | 10000001 | ||||
130 | 202 | 82 | 10000010 | ‚ | ‚ | ‚ | Single low-9 quotation mark |
131 | 203 | 83 | 10000011 | ƒ | ƒ | ƒ | Latin small letter f with hook |
132 | 204 | 84 | 10000100 | „ | „ | „ | Double low-9 quotation mark |
133 | 205 | 85 | 10000101 | … | … | … | Horizontal ellipsis |
134 | 206 | 86 | 10000110 | † | † | † | Dagger |
135 | 207 | 87 | 10000111 | ‡ | ‡ | ‡ | Double dagger |
136 | 210 | 88 | 10001000 | ˆ | ˆ | ˆ | Modifier letter circumflex accent |
137 | 211 | 89 | 10001001 | ‰ | ‰ | ‰ | Per mille sign |
138 | 212 | 8A | 10001010 | Š | Š | Š | Latin capital letter S with caron |
139 | 213 | 8B | 10001011 | ‹ | ‹ | ‹ | Single left-pointing angle quotation |
140 | 214 | 8C | 10001100 | Œ | Œ | Œ | Latin capital ligature OE |
141 | 215 | 8D | 10001101 | ||||
142 | 216 | 8E | 10001110 | Ž | Ž | Latin capital letter Z with caron |
143 | 217 | 8F | 10001111 | ||||
144 | 220 | 90 | 10010000 | ||||
145 | 221 | 91 | 10010001 | ‘ | ‘ | ‘ | Left single quotation mark |
146 | 222 | 92 | 10010010 | ’ | ’ | ’ | Right single quotation mark |
147 | 223 | 93 | 10010011 | “ | “ | “ | Left double quotation mark |
148 | 224 | 94 | 10010100 | ” | ” | ” | Right double quotation mark |
149 | 225 | 95 | 10010101 | • | • | • | Bullet |
150 | 226 | 96 | 10010110 | – | – | – | En dash |
151 | 227 | 97 | 10010111 | — | — | — | Em dash |
152 | 230 | 98 | 10011000 | ˜ | ˜ | ˜ | Small tilde |
153 | 231 | 99 | 10011001 | ™ | ™ | ™ | Trade mark sign |
154 | 232 | 9A | 10011010 | š | š | š | Latin small letter S with caron |
155 | 233 | 9B | 10011011 | › | › | › | Single right-pointing angle quotation mark |
156 | 234 | 9C | 10011100 | œ | œ | œ | Latin small ligature oe |
157 | 235 | 9D | 10011101 | ||||
158 | 236 | 9E | 10011110 | ž | ž | Latin small letter z with caron | |
159 | 237 | 9F | 10011111 | Ÿ | Ÿ | Ÿ | Latin capital letter Y with diaeresis |
160 | 240 | A0 | 10100000 |   | | Non-breaking space | |
161 | 241 | A1 | 10100001 | ¡ | ¡ | ¡ | Inverted exclamation mark |
162 | 242 | A2 | 10100010 | ¢ | ¢ | ¢ | Cent sign |
163 | 243 | A3 | 10100011 | £ | £ | £ | Pound sign |
164 | 244 | A4 | 10100100 | ¤ | ¤ | ¤ | Currency sign |
165 | 245 | A5 | 10100101 | ¥ | ¥ | ¥ | Yen sign |
166 | 246 | A6 | 10100110 | ¦ | ¦ | ¦ | Pipe, Broken vertical bar |
167 | 247 | A7 | 10100111 | § | § | § | Section sign |
168 | 250 | A8 | 10101000 | ¨ | ¨ | ¨ | Spacing diaeresis – umlaut |
169 | 251 | A9 | 10101001 | © | © | © | Copyright sign |
170 | 252 | AA | 10101010 | ª | ª | ª | Feminine ordinal indicator |
171 | 253 | AB | 10101011 | « | « | « | Left double angle quotes |
172 | 254 | AC | 10101100 | ¬ | ¬ | ¬ | Not sign |
173 | 255 | AD | 10101101 | | ­ | ­ | Soft hyphen |
174 | 256 | AE | 10101110 | ® | ® | ® | Registered trade mark sign |
175 | 257 | AF | 10101111 | ¯ | ¯ | ¯ | Spacing macron – overline |
176 | 260 | B0 | 10110000 | ° | ° | ° | Degree sign |
177 | 261 | B1 | 10110001 | ± | ± | ± | Plus-or-minus sign |
178 | 262 | B2 | 10110010 | ² | ² | ² | Superscript two – squared |
179 | 263 | B3 | 10110011 | ³ | ³ | ³ | Superscript three – cubed |
180 | 264 | B4 | 10110100 | ´ | ´ | ´ | Acute accent – spacing acute |
181 | 265 | B5 | 10110101 | µ | µ | µ | Micro sign |
182 | 266 | B6 | 10110110 | ¶ | ¶ | ¶ | Pilcrow sign – paragraph sign |
183 | 267 | B7 | 10110111 | · | · | · | Middle dot – Georgian comma |
184 | 270 | B8 | 10111000 | ¸ | ¸ | ¸ | Spacing cedilla |
185 | 271 | B9 | 10111001 | ¹ | ¹ | ¹ | Superscript one |
186 | 272 | BA | 10111010 | º | º | º | Masculine ordinal indicator |
187 | 273 | BB | 10111011 | » | » | » | Right double angle quotes |
188 | 274 | BC | 10111100 | ¼ | ¼ | ¼ | Fraction one quarter |
189 | 275 | BD | 10111101 | ½ | ½ | ½ | Fraction one half |
190 | 276 | BE | 10111110 | ¾ | ¾ | ¾ | Fraction three quarters |
191 | 277 | BF | 10111111 | ¿ | ¿ | ¿ | Inverted question mark |
192 | 300 | C0 | 11000000 | À | À | À | Latin capital letter A with grave |
193 | 301 | C1 | 11000001 | Á | Á | Á | Latin capital letter A with acute |
194 | 302 | C2 | 11000010 | Â | Â | Â | Latin capital letter A with circumflex |
195 | 303 | C3 | 11000011 | Ã | Ã | Ã | Latin capital letter A with tilde |
196 | 304 | C4 | 11000100 | Ä | Ä | Ä | Latin capital letter A with diaeresis |
197 | 305 | C5 | 11000101 | Å | Å | Å | Latin capital letter A with ring above |
198 | 306 | C6 | 11000110 | Æ | Æ | Æ | Latin capital letter AE |
199 | 307 | C7 | 11000111 | Ç | Ç | Ç | Latin capital letter C with cedilla |
200 | 310 | C8 | 11001000 | È | È | È | Latin capital letter E with grave |
201 | 311 | C9 | 11001001 | É | É | É | Latin capital letter E with acute |
202 | 312 | CA | 11001010 | Ê | Ê | Ê | Latin capital letter E with circumflex |
203 | 313 | CB | 11001011 | Ë | Ë | Ë | Latin capital letter E with diaeresis |
204 | 314 | CC | 11001100 | Ì | Ì | Ì | Latin capital letter I with grave |
205 | 315 | CD | 11001101 | Í | Í | Í | Latin capital letter I with acute |
206 | 316 | CE | 11001110 | Î | Î | Î | Latin capital letter I with circumflex |
207 | 317 | CF | 11001111 | Ï | Ï | Ï | Latin capital letter I with diaeresis |
208 | 320 | D0 | 11010000 | Ð | Ð | Ð | Latin capital letter ETH |
209 | 321 | D1 | 11010001 | Ñ | Ñ | Ñ | Latin capital letter N with tilde |
210 | 322 | D2 | 11010010 | Ò | Ò | Ò | Latin capital letter O with grave |
211 | 323 | D3 | 11010011 | Ó | Ó | Ó | Latin capital letter O with acute |
212 | 324 | D4 | 11010100 | Ô | Ô | Ô | Latin capital letter O with circumflex |
213 | 325 | D5 | 11010101 | Õ | Õ | Õ | Latin capital letter O with tilde |
214 | 326 | D6 | 11010110 | Ö | Ö | Ö | Latin capital letter O with diaeresis |
215 | 327 | D7 | 11010111 | × | × | × | Multiplication sign |
216 | 330 | D8 | 11011000 | Ø | Ø | Ø | Latin capital letter O with slash |
217 | 331 | D9 | 11011001 | Ù | Ù | Ù | Latin capital letter U with grave |
218 | 332 | DA | 11011010 | Ú | Ú | Ú | Latin capital letter U with acute |
219 | 333 | DB | 11011011 | Û | Û | Û | Latin capital letter U with circumflex |
220 | 334 | DC | 11011100 | Ü | Ü | Ü | Latin capital letter U with diaeresis |
221 | 335 | DD | 11011101 | Ý | Ý | Ý | Latin capital letter Y with acute |
222 | 336 | DE | 11011110 | Þ | Þ | Þ | Latin capital letter THORN |
223 | 337 | DF | 11011111 | ß | ß | ß | Latin small letter sharp s – ess-zed |
224 | 340 | E0 | 11100000 | à | à | à | Latin small letter a with grave |
225 | 341 | E1 | 11100001 | á | á | á | Latin small letter a with acute |
226 | 342 | E2 | 11100010 | â | â | â | Latin small letter a with circumflex |
227 | 343 | E3 | 11100011 | ã | ã | ã | Latin small letter a with tilde |
228 | 344 | E4 | 11100100 | ä | ä | ä | Latin small letter a with diaeresis |
229 | 345 | E5 | 11100101 | å | å | å | Latin small letter a with ring above |
230 | 346 | E6 | 11100110 | æ | æ | æ | Latin small letter ae |
231 | 347 | E7 | 11100111 | ç | ç | ç | Latin small letter c with cedilla |
232 | 350 | E8 | 11101000 | è | è | è | Latin small letter e with grave |
233 | 351 | E9 | 11101001 | é | é | é | Latin small letter e with acute |
234 | 352 | EA | 11101010 | ê | ê | ê | Latin small letter e with circumflex |
235 | 353 | EB | 11101011 | ë | ë | ë | Latin small letter e with diaeresis |
236 | 354 | EC | 11101100 | ì | ì | ì | Latin small letter i with grave |
237 | 355 | ED | 11101101 | í | í | í | Latin small letter i with acute |
238 | 356 | EE | 11101110 | î | î | î | Latin small letter i with circumflex |
239 | 357 | EF | 11101111 | ï | ï | ï | Latin small letter i with diaeresis |
240 | 360 | F0 | 11110000 | ð | ð | ð | Latin small letter eth |
241 | 361 | F1 | 11110001 | ñ | ñ | ñ | Latin small letter n with tilde |
242 | 362 | F2 | 11110010 | ò | ò | ò | Latin small letter o with grave |
243 | 363 | F3 | 11110011 | ó | ó | ó | Latin small letter o with acute |
244 | 364 | F4 | 11110100 | ô | ô | ô | Latin small letter o with circumflex |
245 | 365 | F5 | 11110101 | õ | õ | õ | Latin small letter o with tilde |
246 | 366 | F6 | 11110110 | ö | ö | ö | Latin small letter o with diaeresis |
247 | 367 | F7 | 11110111 | ÷ | ÷ | ÷ | Division sign |
248 | 370 | F8 | 11111000 | ø | ø | ø | Latin small letter o with slash |
249 | 371 | F9 | 11111001 | ù | ù | ù | Latin small letter u with grave |
250 | 372 | FA | 11111010 | ú | ú | ú | Latin small letter u with acute |
251 | 373 | FB | 11111011 | û | û | û | Latin small letter u with circumflex |
252 | 374 | FC | 11111100 | ü | ü | ü | Latin small letter u with diaeresis |
253 | 375 | FD | 11111101 | ý | ý | ý | Latin small letter y with acute |
254 | 376 | FE | 11111110 | þ | þ | þ | Latin small letter thorn |
255 | 377 | FF | 11111111 | ÿ | ÿ | ÿ | Latin small letter y with diaeresis |
To store a total of 256 characters uniquely. We need 8 bits.
ASCII characters take 8 bits to store a character.
These ASCII characters consist of English alphabets, some symbols in Latin Greek etc.
What about the remaining characters? ie chineese, japanees, etc.
Unicode Characters
Unicode takes all the characters present in the world into consideration.
From the Unicode 11.7 standard. They have 100 thousand around different characters considered.
Unicode character also assigned a unique number for every character.
This assignment of numbers to each character we call encoding.
We have different encoding techniques.
UTF 8, UTF 16, etc.
UTF 8 Encoding
In this class, we discuss UTF 8 encoding.
UTF means Unicode Characters Transformation Format.
Python default uses UTF 8 encoding.
First code point | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|---|
U+0000 | U+007F | 0xxxxxxx | |||
U+0080 | U+07FF | 110xxxxx | 10xxxxxx | ||
U+0800 | U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |
U+10000 | U+10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
Check the above UTF 8 encoding.
The first 127 ASCII characters are the same in Unicode character set.
These ASCII characters will take one byte of memory to store in UTF 8 encoding.
Next, they considered Latin, Hebrew, Thaana, etc. The symbols from these languages are given 16 bits of space.
2 Bytes are used to store the symbols present in the languages mentioned above.
The remaining languages, Japanese, Chinese, etc. taking 3 bytes of memory to store.
The symbols present in other languages are given four bytes of memory space.
The UTF 8 encoding is using different memory space for different characters.
We said string is a sequence of characters.
We take an English alphabet and a Chinese alphabet.
Each character is assigned a different size of memory. How to identify how many bytes taken by character?
From the above table, we observe that ASCII characters most significant bit are 0.
If the first bit is 0. that character is taking one byte of memory.
Latin and Greek symbols Most significant bits are 110.
Chinese and Japanese character most significant bits are 1110.
The remaining Symbols MSB are 11110.
With the help of the most significant bits, we can identify the space taken by the character.
Some of the UTF 8 encoding symbols are shown below.
U+00A1 | ¡ | c2 a1 | INVERTED EXCLAMATION MARK |
U+00A2 | ¢ | c2 a2 | CENT SIGN |
U+00A3 | £ | c2 a3 | POUND SIGN |
U+00A4 | ¤ | c2 a4 | CURRENCY SIGN |
U+00A5 | ¥ | c2 a5 | YEN SIGN |
U+00A6 | ¦ | c2 a6 | BROKEN BAR |
U+00A7 | § | c2 a7 | SECTION SIGN |
U+00A8 | ¨ | c2 a8 | DIAERESIS |
U+00A9 | © | c2 a9 | COPYRIGHT SIGN |
U+00AA | ª | c2 aa | FEMININE ORDINAL INDICATOR |
U+00AB | « | c2 ab | LEFT-POINTING DOUBLE ANGLE QUOTATION MARK |
U+00AC | ¬ | c2 ac | NOT SIGN |
U+00AD | | c2 ad | SOFT HYPHEN |
U+00AE | ® | c2 ae | REGISTERED SIGN |
U+00AF | ¯ | c2 af | MACRON |
U+00B0 | ° | c2 b0 | DEGREE SIGN |
U+00B1 | ± | c2 b1 | PLUS-MINUS SIGN |
U+00B2 | ² | c2 b2 | SUPERSCRIPT TWO |
U+00B3 | ³ | c2 b3 | SUPERSCRIPT THREE |
U+00B4 | ´ | c2 b4 | ACUTE ACCENT |
The yen symbol Unicode is given c2a5 Unique value, given in hexadecimal format.
The complete codes of UTF 8 encoding is given here.