Unicode Character an UTF 8 Encoding

In this class, we discuss Unicode Character, an UTF 8 encoding scheme.

For Complete YouTube Video: Click Here

ASCII Characters

In our previous class, we discussed string as a sequence of Unicode characters. Click here.

First, we understand ASCII characters. Then we go into Unicode characters.

The ASCII character table from 0 to 127 is shown below.

DecHexCharName / FunctionDecHexCharDecHexCharDecHexChar
000NULNull3220space6440@9660`
101SOHStart Of Heading3321!6541A9761a
202STXStart Of Text34226642B9862b
303ETXEnd Of Text3523#6743C9963c
404EOTEnd Of Transmit3624$6844D10064d
505ENQEnquiry3725%6945E10165e
606ACKAcknowledge3826&7046F10266f
707BELBell39277147G10367g
808BSBackspace4028(7248H10468h
909HTHorizontal Tab4129)7349I10569i
100ALFLine Feed422A*744AJ1066Aj
110BVTVertical Tab432B+754BK1076Bk
120CFFForm Feed442C,764CL1086Cl
130DCRCarriage Return452D774DM1096Dm
140ESOShift Out462E.784EN1106En
150FSIShift In472F/794FO1116Fo
1610DLEData Line Escape483008050P11270p
1711DC1Device Control 1493118151Q11371q
1812DC2Device Control 2503228252R11472r
1913DC3Device Control 3513338353S11573s
2014DC4Device Control 4523448454T11674t
2115NAKNon Acknowledge533558555U11775u
2216SYNSynchronous Idle543668656V11876v
2317ETBEnd Transmit Block553778757W11977w
2418CANCancel563888858X12078x
2519EMEnd Of Medium573998959Y12179y
261ASUBSubstitute583A:905AZ1227Az
271BESCEscape593B;915B[1237B{
281CFSFile Separator603C<925C\1247C|
291DGSGroup Separator613D=935D]1257D}
301ERSRecord Separator623E>945E^1267E~
311FUSUnit Separator633F?955F_1277Fdelete
ASCII Table

ASCII characters have given a unique number for every character.

a-z, A-Z, space etc., for each character, they have provided a unique number.

Character A is given value 65 in the ASCII table.

ASCII is following a standard to characters. Why we need that standard?

If everyone follows the standard. It’s easy to exchange information.

Example:

we want to send a text hello to some other person.

Character h is converted to ASCII. Character e is converted to ASCII.

All characters are converted to ASCII. The one who receives the message also follows ASCII standards.

It’s easy to understand what message has been received.

Computers don’t understand characters. So A is converted to binary value 65.

The last character is the delete character. The decimal value is 127.

They extended the ASCII characters from 127 to 256. They added Latin, Greek symbols.

Extended ASCII made a total of 256 different characters.

The extended table is given below.

DECOCTHEXBINSymbolHTML NumberHTML NameDescription
1282008010000000&#128;&euro;Euro sign
1292018110000001    
1302028210000010&#130;&sbquo;Single low-9 quotation mark
1312038310000011ƒ&#131;&fnof;Latin small letter f with hook
1322048410000100&#132;&bdquo;Double low-9 quotation mark
1332058510000101&#133;&hellip;Horizontal ellipsis
1342068610000110&#134;&dagger;Dagger
1352078710000111&#135;&Dagger;Double dagger
1362108810001000ˆ&#136;&circ;Modifier letter circumflex accent
1372118910001001&#137;&permil;Per mille sign
1382128A10001010Š&#138;&Scaron;Latin capital letter S with caron
1392138B10001011&#139;&lsaquo;Single left-pointing angle quotation
1402148C10001100Œ&#140;&OElig;Latin capital ligature OE
1412158D10001101    
1422168E10001110Ž&#142; Latin capital letter Z with caron
1432178F10001111    
1442209010010000    
1452219110010001&#145;&lsquo;Left single quotation mark
1462229210010010&#146;&rsquo;Right single quotation mark
1472239310010011&#147;&ldquo;Left double quotation mark
1482249410010100&#148;&rdquo;Right double quotation mark
1492259510010101&#149;&bull;Bullet
1502269610010110&#150;&ndash;En dash
1512279710010111&#151;&mdash;Em dash
1522309810011000˜&#152;&tilde;Small tilde
1532319910011001&#153;&trade;Trade mark sign
1542329A10011010š&#154;&scaron;Latin small letter S with caron
1552339B10011011&#155;&rsaquo;Single right-pointing angle quotation mark
1562349C10011100œ&#156;&oelig;Latin small ligature oe
1572359D10011101    
1582369E10011110ž&#158; Latin small letter z with caron
1592379F10011111Ÿ&#159;&Yuml;Latin capital letter Y with diaeresis
160240A010100000 &#160;&nbsp;Non-breaking space
161241A110100001¡&#161;&iexcl;Inverted exclamation mark
162242A210100010¢&#162;&cent;Cent sign
163243A310100011£&#163;&pound;Pound sign
164244A410100100¤&#164;&curren;Currency sign
165245A510100101¥&#165;&yen;Yen sign
166246A610100110¦&#166;&brvbar;Pipe, Broken vertical bar
167247A710100111§&#167;&sect;Section sign
168250A810101000¨&#168;&uml;Spacing diaeresis – umlaut
169251A910101001©&#169;&copy;Copyright sign
170252AA10101010ª&#170;&ordf;Feminine ordinal indicator
171253AB10101011«&#171;&laquo;Left double angle quotes
172254AC10101100¬&#172;&not;Not sign
173255AD10101101­&#173;&shy;Soft hyphen
174256AE10101110®&#174;&reg;Registered trade mark sign
175257AF10101111¯&#175;&macr;Spacing macron – overline
176260B010110000°&#176;&deg;Degree sign
177261B110110001±&#177;&plusmn;Plus-or-minus sign
178262B210110010²&#178;&sup2;Superscript two – squared
179263B310110011³&#179;&sup3;Superscript three – cubed
180264B410110100´&#180;&acute;Acute accent – spacing acute
181265B510110101µ&#181;&micro;Micro sign
182266B610110110&#182;&para;Pilcrow sign – paragraph sign
183267B710110111·&#183;&middot;Middle dot – Georgian comma
184270B810111000¸&#184;&cedil;Spacing cedilla
185271B910111001¹&#185;&sup1;Superscript one
186272BA10111010º&#186;&ordm;Masculine ordinal indicator
187273BB10111011»&#187;&raquo;Right double angle quotes
188274BC10111100¼&#188;&frac14;Fraction one quarter
189275BD10111101½&#189;&frac12;Fraction one half
190276BE10111110¾&#190;&frac34;Fraction three quarters
191277BF10111111¿&#191;&iquest;Inverted question mark
192300C011000000À&#192;&Agrave;Latin capital letter A with grave
193301C111000001Á&#193;&Aacute;Latin capital letter A with acute
194302C211000010Â&#194;&Acirc;Latin capital letter A with circumflex
195303C311000011Ã&#195;&Atilde;Latin capital letter A with tilde
196304C411000100Ä&#196;&Auml;Latin capital letter A with diaeresis
197305C511000101Å&#197;&Aring;Latin capital letter A with ring above
198306C611000110Æ&#198;&AElig;Latin capital letter AE
199307C711000111Ç&#199;&Ccedil;Latin capital letter C with cedilla
200310C811001000È&#200;&Egrave;Latin capital letter E with grave
201311C911001001É&#201;&Eacute;Latin capital letter E with acute
202312CA11001010Ê&#202;&Ecirc;Latin capital letter E with circumflex
203313CB11001011Ë&#203;&Euml;Latin capital letter E with diaeresis
204314CC11001100Ì&#204;&Igrave;Latin capital letter I with grave
205315CD11001101Í&#205;&Iacute;Latin capital letter I with acute
206316CE11001110Î&#206;&Icirc;Latin capital letter I with circumflex
207317CF11001111Ï&#207;&Iuml;Latin capital letter I with diaeresis
208320D011010000Ð&#208;&ETH;Latin capital letter ETH
209321D111010001Ñ&#209;&Ntilde;Latin capital letter N with tilde
210322D211010010Ò&#210;&Ograve;Latin capital letter O with grave
211323D311010011Ó&#211;&Oacute;Latin capital letter O with acute
212324D411010100Ô&#212;&Ocirc;Latin capital letter O with circumflex
213325D511010101Õ&#213;&Otilde;Latin capital letter O with tilde
214326D611010110Ö&#214;&Ouml;Latin capital letter O with diaeresis
215327D711010111×&#215;&times;Multiplication sign
216330D811011000Ø&#216;&Oslash;Latin capital letter O with slash
217331D911011001Ù&#217;&Ugrave;Latin capital letter U with grave
218332DA11011010Ú&#218;&Uacute;Latin capital letter U with acute
219333DB11011011Û&#219;&Ucirc;Latin capital letter U with circumflex
220334DC11011100Ü&#220;&Uuml;Latin capital letter U with diaeresis
221335DD11011101Ý&#221;&Yacute;Latin capital letter Y with acute
222336DE11011110Þ&#222;&THORN;Latin capital letter THORN
223337DF11011111ß&#223;&szlig;Latin small letter sharp s – ess-zed
224340E011100000à&#224;&agrave;Latin small letter a with grave
225341E111100001á&#225;&aacute;Latin small letter a with acute
226342E211100010â&#226;&acirc;Latin small letter a with circumflex
227343E311100011ã&#227;&atilde;Latin small letter a with tilde
228344E411100100ä&#228;&auml;Latin small letter a with diaeresis
229345E511100101å&#229;&aring;Latin small letter a with ring above
230346E611100110æ&#230;&aelig;Latin small letter ae
231347E711100111ç&#231;&ccedil;Latin small letter c with cedilla
232350E811101000è&#232;&egrave;Latin small letter e with grave
233351E911101001é&#233;&eacute;Latin small letter e with acute
234352EA11101010ê&#234;&ecirc;Latin small letter e with circumflex
235353EB11101011ë&#235;&euml;Latin small letter e with diaeresis
236354EC11101100ì&#236;&igrave;Latin small letter i with grave
237355ED11101101í&#237;&iacute;Latin small letter i with acute
238356EE11101110î&#238;&icirc;Latin small letter i with circumflex
239357EF11101111ï&#239;&iuml;Latin small letter i with diaeresis
240360F011110000ð&#240;&eth;Latin small letter eth
241361F111110001ñ&#241;&ntilde;Latin small letter n with tilde
242362F211110010ò&#242;&ograve;Latin small letter o with grave
243363F311110011ó&#243;&oacute;Latin small letter o with acute
244364F411110100ô&#244;&ocirc;Latin small letter o with circumflex
245365F511110101õ&#245;&otilde;Latin small letter o with tilde
246366F611110110ö&#246;&ouml;Latin small letter o with diaeresis
247367F711110111÷&#247;&divide;Division sign
248370F811111000ø&#248;&oslash;Latin small letter o with slash
249371F911111001ù&#249;&ugrave;Latin small letter u with grave
250372FA11111010ú&#250;&uacute;Latin small letter u with acute
251373FB11111011û&#251;&ucirc;Latin small letter u with circumflex
252374FC11111100ü&#252;&uuml;Latin small letter u with diaeresis
253375FD11111101ý&#253;&yacute;Latin small letter y with acute
254376FE11111110þ&#254;&thorn;Latin small letter thorn
255377FF11111111ÿ&#255;&yuml;Latin small letter y with diaeresis
Extended ASCII Table

To store a total of 256 characters uniquely. We need 8 bits.

ASCII characters take 8 bits to store a character.

These ASCII characters consist of English alphabets, some symbols in Latin Greek etc.

What about the remaining characters? ie chineese, japanees, etc.

Unicode Characters

Unicode takes all the characters present in the world into consideration.

From the Unicode 11.7 standard. They have 100 thousand around different characters considered.

Unicode character also assigned a unique number for every character.

This assignment of numbers to each character we call encoding.

We have different encoding techniques.

UTF 8, UTF 16, etc.

UTF 8 Encoding

In this class, we discuss UTF 8 encoding.

UTF means Unicode Characters Transformation Format.

Python default uses UTF 8 encoding.

First code pointLast code pointByte 1Byte 2Byte 3Byte 4
U+0000U+007F0xxxxxxx
U+0080U+07FF110xxxxx10xxxxxx
U+0800U+FFFF1110xxxx10xxxxxx10xxxxxx
U+10000U+10FFFF11110xxx10xxxxxx10xxxxxx10xxxxxx
UTF 8 Encoding

Check the above UTF 8 encoding.

The first 127 ASCII characters are the same in Unicode character set.

These ASCII characters will take one byte of memory to store in UTF 8 encoding.

Next, they considered Latin, Hebrew, Thaana, etc. The symbols from these languages are given 16 bits of space.

2 Bytes are used to store the symbols present in the languages mentioned above.

The remaining languages, Japanese, Chinese, etc. taking 3 bytes of memory to store.

The symbols present in other languages are given four bytes of memory space.

The UTF 8 encoding is using different memory space for different characters.

We said string is a sequence of characters.

We take an English alphabet and a Chinese alphabet.

Each character is assigned a different size of memory. How to identify how many bytes taken by character?

From the above table, we observe that ASCII characters most significant bit are 0.

If the first bit is 0. that character is taking one byte of memory.

Latin and Greek symbols Most significant bits are 110.

Chinese and Japanese character most significant bits are 1110.

The remaining Symbols MSB are 11110.

With the help of the most significant bits, we can identify the space taken by the character.

Some of the UTF 8 encoding symbols are shown below.

U+00A1¡c2 a1INVERTED EXCLAMATION MARK
U+00A2¢c2 a2CENT SIGN
U+00A3£c2 a3POUND SIGN
U+00A4¤c2 a4CURRENCY SIGN
U+00A5¥c2 a5YEN SIGN
U+00A6¦c2 a6BROKEN BAR
U+00A7§c2 a7SECTION SIGN
U+00A8¨c2 a8DIAERESIS
U+00A9©c2 a9COPYRIGHT SIGN
U+00AAªc2 aaFEMININE ORDINAL INDICATOR
U+00AB«c2 abLEFT-POINTING DOUBLE ANGLE QUOTATION MARK
U+00AC¬c2 acNOT SIGN
U+00AD­c2 adSOFT HYPHEN
U+00AE®c2 aeREGISTERED SIGN
U+00AF¯c2 afMACRON
U+00B0°c2 b0DEGREE SIGN
U+00B1±c2 b1PLUS-MINUS SIGN
U+00B2²c2 b2SUPERSCRIPT TWO
U+00B3³c2 b3SUPERSCRIPT THREE
U+00B4´c2 b4ACUTE ACCENT
Sample Unicode Table

The yen symbol Unicode is given c2a5 Unique value, given in hexadecimal format.

The complete codes of UTF 8 encoding is given here.