Data Science Unicode Character an UTF 8 Encoding

Unicode Character an UTF 8 Encoding

In this class, we discuss Unicode Character, an UTF 8 encoding scheme.

For Complete YouTube Video: Click Here

ASCII Characters

In our previous class, we discussed string as a sequence of Unicode characters. Click here.

First, we understand ASCII characters. Then we go into Unicode characters.

The ASCII character table from 0 to 127 is shown below.

Dec	Hex	Char	Name / Function	Dec	Hex	Char	Dec	Hex	Char	Dec	Hex	Char
0	00	NUL	Null	32	20	space	64	40	@	96	60	`
1	01	SOH	Start Of Heading	33	21	!	65	41	A	97	61	a
2	02	STX	Start Of Text	34	22	“	66	42	B	98	62	b
3	03	ETX	End Of Text	35	23	#	67	43	C	99	63	c
4	04	EOT	End Of Transmit	36	24	$	68	44	D	100	64	d
5	05	ENQ	Enquiry	37	25	%	69	45	E	101	65	e
6	06	ACK	Acknowledge	38	26	&	70	46	F	102	66	f
7	07	BEL	Bell	39	27	‘	71	47	G	103	67	g
8	08	BS	Backspace	40	28	(	72	48	H	104	68	h
9	09	HT	Horizontal Tab	41	29	)	73	49	I	105	69	i
10	0A	LF	Line Feed	42	2A	*	74	4A	J	106	6A	j
11	0B	VT	Vertical Tab	43	2B	+	75	4B	K	107	6B	k
12	0C	FF	Form Feed	44	2C	,	76	4C	L	108	6C	l
13	0D	CR	Carriage Return	45	2D	–	77	4D	M	109	6D	m
14	0E	SO	Shift Out	46	2E	.	78	4E	N	110	6E	n
15	0F	SI	Shift In	47	2F	/	79	4F	O	111	6F	o
16	10	DLE	Data Line Escape	48	30	0	80	50	P	112	70	p
17	11	DC1	Device Control 1	49	31	1	81	51	Q	113	71	q
18	12	DC2	Device Control 2	50	32	2	82	52	R	114	72	r
19	13	DC3	Device Control 3	51	33	3	83	53	S	115	73	s
20	14	DC4	Device Control 4	52	34	4	84	54	T	116	74	t
21	15	NAK	Non Acknowledge	53	35	5	85	55	U	117	75	u
22	16	SYN	Synchronous Idle	54	36	6	86	56	V	118	76	v
23	17	ETB	End Transmit Block	55	37	7	87	57	W	119	77	w
24	18	CAN	Cancel	56	38	8	88	58	X	120	78	x
25	19	EM	End Of Medium	57	39	9	89	59	Y	121	79	y
26	1A	SUB	Substitute	58	3A	:	90	5A	Z	122	7A	z
27	1B	ESC	Escape	59	3B	;	91	5B	[	123	7B	{
28	1C	FS	File Separator	60	3C	<	92	5C	\	124	7C	\|
29	1D	GS	Group Separator	61	3D	=	93	5D	]	125	7D	}
30	1E	RS	Record Separator	62	3E	>	94	5E	^	126	7E	~
31	1F	US	Unit Separator	63	3F	?	95	5F	_	127	7F	delete

ASCII Table

ASCII characters have given a unique number for every character.

a-z, A-Z, space etc., for each character, they have provided a unique number.

Character A is given value 65 in the ASCII table.

ASCII is following a standard to characters. Why we need that standard?

If everyone follows the standard. It’s easy to exchange information.

Example:

we want to send a text hello to some other person.

Character h is converted to ASCII. Character e is converted to ASCII.

All characters are converted to ASCII. The one who receives the message also follows ASCII standards.

It’s easy to understand what message has been received.

Computers don’t understand characters. So A is converted to binary value 65.

The last character is the delete character. The decimal value is 127.

They extended the ASCII characters from 127 to 256. They added Latin, Greek symbols.

Extended ASCII made a total of 256 different characters.

The extended table is given below.

DEC	OCT	HEX	BIN	Symbol	HTML Name	Description
128	200	80	10000000	€	€	Euro sign
129	201	81	10000001
130	202	82	10000010	‚	&sbquo;	Single low-9 quotation mark
131	203	83	10000011	ƒ	&fnof;	Latin small letter f with hook
132	204	84	10000100	„	&bdquo;	Double low-9 quotation mark
133	205	85	10000101	…	…	Horizontal ellipsis
134	206	86	10000110	†	&dagger;	Dagger
135	207	87	10000111	‡	&Dagger;	Double dagger
136	210	88	10001000	ˆ	&circ;	Modifier letter circumflex accent
137	211	89	10001001	‰	&permil;	Per mille sign
138	212	8A	10001010	Š	&Scaron;	Latin capital letter S with caron
139	213	8B	10001011	‹	&lsaquo;	Single left-pointing angle quotation
140	214	8C	10001100	Œ	&OElig;	Latin capital ligature OE
141	215	8D	10001101
142	216	8E	10001110	Ž		Latin capital letter Z with caron

143	217	8F	10001111
144	220	90	10010000
145	221	91	10010001	‘		‘	Left single quotation mark
146	222	92	10010010	’		’	Right single quotation mark
147	223	93	10010011	“		“	Left double quotation mark
148	224	94	10010100	”		”	Right double quotation mark
149	225	95	10010101	•		•	Bullet
150	226	96	10010110	–		–	En dash
151	227	97	10010111	—		—	Em dash
152	230	98	10011000	˜		&tilde;	Small tilde
153	231	99	10011001	™		™	Trade mark sign
154	232	9A	10011010	š		&scaron;	Latin small letter S with caron
155	233	9B	10011011	›		&rsaquo;	Single right-pointing angle quotation mark
156	234	9C	10011100	œ		&oelig;	Latin small ligature oe
157	235	9D	10011101
158	236	9E	10011110	ž			Latin small letter z with caron
159	237	9F	10011111	Ÿ		&Yuml;	Latin capital letter Y with diaeresis
160	240	A0	10100000				Non-breaking space
161	241	A1	10100001	¡	¡	¡	Inverted exclamation mark
162	242	A2	10100010	¢	¢	¢	Cent sign
163	243	A3	10100011	£	£	£	Pound sign
164	244	A4	10100100	¤	¤	¤	Currency sign
165	245	A5	10100101	¥	¥	¥	Yen sign
166	246	A6	10100110	¦	¦	¦	Pipe, Broken vertical bar
167	247	A7	10100111	§	§	§	Section sign
168	250	A8	10101000	¨	¨	¨	Spacing diaeresis – umlaut
169	251	A9	10101001	©	©	©	Copyright sign
170	252	AA	10101010	ª	ª	ª	Feminine ordinal indicator
171	253	AB	10101011	«	«	«	Left double angle quotes
172	254	AC	10101100	¬	¬	¬	Not sign
173	255	AD	10101101				Soft hyphen
174	256	AE	10101110	®	®	®	Registered trade mark sign
175	257	AF	10101111	¯	¯	¯	Spacing macron – overline
176	260	B0	10110000	°	°	°	Degree sign
177	261	B1	10110001	±	±	±	Plus-or-minus sign
178	262	B2	10110010	²	²	²	Superscript two – squared
179	263	B3	10110011	³	³	³	Superscript three – cubed
180	264	B4	10110100	´	´	´	Acute accent – spacing acute
181	265	B5	10110101	µ	µ	µ	Micro sign
182	266	B6	10110110	¶	¶	¶	Pilcrow sign – paragraph sign
183	267	B7	10110111	·	·	·	Middle dot – Georgian comma
184	270	B8	10111000	¸	¸	¸	Spacing cedilla
185	271	B9	10111001	¹	¹	¹	Superscript one
186	272	BA	10111010	º	º	º	Masculine ordinal indicator
187	273	BB	10111011	»	»	»	Right double angle quotes
188	274	BC	10111100	¼	¼	¼	Fraction one quarter
189	275	BD	10111101	½	½	½	Fraction one half
190	276	BE	10111110	¾	¾	¾	Fraction three quarters
191	277	BF	10111111	¿	¿	¿	Inverted question mark
192	300	C0	11000000	À	À	À	Latin capital letter A with grave
193	301	C1	11000001	Á	Á	Á	Latin capital letter A with acute
194	302	C2	11000010	Â	Â	Â	Latin capital letter A with circumflex
195	303	C3	11000011	Ã	Ã	Ã	Latin capital letter A with tilde
196	304	C4	11000100	Ä	Ä	Ä	Latin capital letter A with diaeresis
197	305	C5	11000101	Å	Å	Å	Latin capital letter A with ring above
198	306	C6	11000110	Æ	Æ	Æ	Latin capital letter AE
199	307	C7	11000111	Ç	Ç	Ç	Latin capital letter C with cedilla
200	310	C8	11001000	È	È	È	Latin capital letter E with grave
201	311	C9	11001001	É	É	É	Latin capital letter E with acute
202	312	CA	11001010	Ê	Ê	Ê	Latin capital letter E with circumflex
203	313	CB	11001011	Ë	Ë	Ë	Latin capital letter E with diaeresis
204	314	CC	11001100	Ì	Ì	Ì	Latin capital letter I with grave
205	315	CD	11001101	Í	Í	Í	Latin capital letter I with acute
206	316	CE	11001110	Î	Î	Î	Latin capital letter I with circumflex
207	317	CF	11001111	Ï	Ï	Ï	Latin capital letter I with diaeresis
208	320	D0	11010000	Ð	Ð	Ð	Latin capital letter ETH
209	321	D1	11010001	Ñ	Ñ	Ñ	Latin capital letter N with tilde
210	322	D2	11010010	Ò	Ò	Ò	Latin capital letter O with grave
211	323	D3	11010011	Ó	Ó	Ó	Latin capital letter O with acute
212	324	D4	11010100	Ô	Ô	Ô	Latin capital letter O with circumflex
213	325	D5	11010101	Õ	Õ	Õ	Latin capital letter O with tilde
214	326	D6	11010110	Ö	Ö	Ö	Latin capital letter O with diaeresis
215	327	D7	11010111	×	×	×	Multiplication sign
216	330	D8	11011000	Ø	Ø	Ø	Latin capital letter O with slash
217	331	D9	11011001	Ù	Ù	Ù	Latin capital letter U with grave
218	332	DA	11011010	Ú	Ú	Ú	Latin capital letter U with acute
219	333	DB	11011011	Û	Û	Û	Latin capital letter U with circumflex
220	334	DC	11011100	Ü	Ü	Ü	Latin capital letter U with diaeresis
221	335	DD	11011101	Ý	Ý	Ý	Latin capital letter Y with acute
222	336	DE	11011110	Þ	Þ	Þ	Latin capital letter THORN
223	337	DF	11011111	ß	ß	ß	Latin small letter sharp s – ess-zed
224	340	E0	11100000	à	à	à	Latin small letter a with grave
225	341	E1	11100001	á	á	á	Latin small letter a with acute
226	342	E2	11100010	â	â	â	Latin small letter a with circumflex
227	343	E3	11100011	ã	ã	ã	Latin small letter a with tilde
228	344	E4	11100100	ä	ä	ä	Latin small letter a with diaeresis
229	345	E5	11100101	å	å	å	Latin small letter a with ring above
230	346	E6	11100110	æ	æ	æ	Latin small letter ae
231	347	E7	11100111	ç	ç	ç	Latin small letter c with cedilla
232	350	E8	11101000	è	è	è	Latin small letter e with grave
233	351	E9	11101001	é	é	é	Latin small letter e with acute
234	352	EA	11101010	ê	ê	ê	Latin small letter e with circumflex
235	353	EB	11101011	ë	ë	ë	Latin small letter e with diaeresis
236	354	EC	11101100	ì	ì	ì	Latin small letter i with grave
237	355	ED	11101101	í	í	í	Latin small letter i with acute
238	356	EE	11101110	î	î	î	Latin small letter i with circumflex
239	357	EF	11101111	ï	ï	ï	Latin small letter i with diaeresis
240	360	F0	11110000	ð	ð	ð	Latin small letter eth
241	361	F1	11110001	ñ	ñ	ñ	Latin small letter n with tilde
242	362	F2	11110010	ò	ò	ò	Latin small letter o with grave
243	363	F3	11110011	ó	ó	ó	Latin small letter o with acute
244	364	F4	11110100	ô	ô	ô	Latin small letter o with circumflex
245	365	F5	11110101	õ	õ	õ	Latin small letter o with tilde
246	366	F6	11110110	ö	ö	ö	Latin small letter o with diaeresis
247	367	F7	11110111	÷	÷	÷	Division sign
248	370	F8	11111000	ø	ø	ø	Latin small letter o with slash
249	371	F9	11111001	ù	ù	ù	Latin small letter u with grave
250	372	FA	11111010	ú	ú	ú	Latin small letter u with acute
251	373	FB	11111011	û	û	û	Latin small letter u with circumflex
252	374	FC	11111100	ü	ü	ü	Latin small letter u with diaeresis
253	375	FD	11111101	ý	ý	ý	Latin small letter y with acute
254	376	FE	11111110	þ	þ	þ	Latin small letter thorn
255	377	FF	11111111	ÿ	ÿ	ÿ	Latin small letter y with diaeresis

Extended ASCII Table

To store a total of 256 characters uniquely. We need 8 bits.

ASCII characters take 8 bits to store a character.

These ASCII characters consist of English alphabets, some symbols in Latin Greek etc.

What about the remaining characters? ie chineese, japanees, etc.

Unicode Characters

Unicode takes all the characters present in the world into consideration.

From the Unicode 11.7 standard. They have 100 thousand around different characters considered.

Unicode character also assigned a unique number for every character.

This assignment of numbers to each character we call encoding.

We have different encoding techniques.

UTF 8, UTF 16, etc.

UTF 8 Encoding

In this class, we discuss UTF 8 encoding.

UTF means Unicode Characters Transformation Format.

Python default uses UTF 8 encoding.

First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4
U+0000	U+007F	0xxxxxxx
U+0080	U+07FF	110xxxxx	10xxxxxx
U+0800	U+FFFF	1110xxxx	10xxxxxx	10xxxxxx
U+10000	U+10FFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

UTF 8 Encoding

Check the above UTF 8 encoding.

The first 127 ASCII characters are the same in Unicode character set.

These ASCII characters will take one byte of memory to store in UTF 8 encoding.

Next, they considered Latin, Hebrew, Thaana, etc. The symbols from these languages are given 16 bits of space.

2 Bytes are used to store the symbols present in the languages mentioned above.

The remaining languages, Japanese, Chinese, etc. taking 3 bytes of memory to store.

The symbols present in other languages are given four bytes of memory space.

The UTF 8 encoding is using different memory space for different characters.

We said string is a sequence of characters.

We take an English alphabet and a Chinese alphabet.

Each character is assigned a different size of memory. How to identify how many bytes taken by character?

From the above table, we observe that ASCII characters most significant bit are 0.

If the first bit is 0. that character is taking one byte of memory.

Latin and Greek symbols Most significant bits are 110.

Chinese and Japanese character most significant bits are 1110.

The remaining Symbols MSB are 11110.

With the help of the most significant bits, we can identify the space taken by the character.

Some of the UTF 8 encoding symbols are shown below.

U+00A1	¡	c2 a1	INVERTED EXCLAMATION MARK
U+00A2	¢	c2 a2	CENT SIGN
U+00A3	£	c2 a3	POUND SIGN
U+00A4	¤	c2 a4	CURRENCY SIGN
U+00A5	¥	c2 a5	YEN SIGN
U+00A6	¦	c2 a6	BROKEN BAR
U+00A7	§	c2 a7	SECTION SIGN
U+00A8	¨	c2 a8	DIAERESIS
U+00A9	©	c2 a9	COPYRIGHT SIGN
U+00AA	ª	c2 aa	FEMININE ORDINAL INDICATOR
U+00AB	«	c2 ab	LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
U+00AC	¬	c2 ac	NOT SIGN
U+00AD		c2 ad	SOFT HYPHEN
U+00AE	®	c2 ae	REGISTERED SIGN
U+00AF	¯	c2 af	MACRON
U+00B0	°	c2 b0	DEGREE SIGN
U+00B1	±	c2 b1	PLUS-MINUS SIGN
U+00B2	²	c2 b2	SUPERSCRIPT TWO
U+00B3	³	c2 b3	SUPERSCRIPT THREE
U+00B4	´	c2 b4	ACUTE ACCENT

Sample Unicode Table

The yen symbol Unicode is given c2a5 Unique value, given in hexadecimal format.

The complete codes of UTF 8 encoding is given here.

Previous Lesson

Back to Course

Next Lesson