Programming Understanding of Lexical Analysis
In this class, We discuss Programming Understanding of Lexical Analysis.
For Complete YouTube Video: Click Here
The reader should have prior knowledge of Lexical analysis. Click Here.
The example used in this class is discussed in our previous classes.
if( i <= 50)
The above code line should convert to tokens.
The below diagram shows how the code line is placed in memory.
In our code example, we do not have space.
If space is there in code, we need to consider the space in memory, and during analysis, we should eliminate space.
We read characters one by one from the start of the line.
For the understanding purpose, we read characters starting from less than symbol in the memory.
The below diagram shows the logic to identify tokens from the source code.
We constructed finite automata to identify tokens.
One should have an understanding of finite automata. Study finite automata from our Formal languages and automata theory course.
In the theory of computations, we discussed how to write a program for finite automata.
The program for finite automata is written using if, else statements and functions.
When we see the input symbol <, we move to state 1.
After < symbol if we see the input = we move to final state 2.
When we moved to the final state, 2, we observed the relational operator <=.
When we move to the final state, we return the token related to relational operator <=.
In our previous classes, we discussed the lookahead symbol.
We understand the lookahead symbol with an example.
After finding less than a symbol, if we find greater than a symbol, we move to state 3.
State 3 is returning not equal to token.
After finding less than the symbol, suppose we find any other symbol. We move to final state 4.
State 4 is given with a * mark.
The meaning of * is to move one step back in the input memory.
State 4 is returning token less than.
Suppose, After less than we found other characters.
We need to consider less than symbol as a token and move back one step in input to consider the other symbol for the next lexeme.
Important:
After finding the symbol <= in the input string, we separate it as a token.
The next characters are again checked from state zero.
Meaning we are taking the next character to identify the next tokens.
The finite automata are written to identify all the tokens needed in our language.
The complete finite automata are very big to show here.
We are showing part of finite automata to identify keywords and tokens separately.
The below diagram shows the finite automata for identifying keywords and identifiers.
Two ways to identify identifiers and keywords.
1) Maintain a separate table for keywords.
Identifiers are keywords that start with alphabets.
If we find any alphabet or dollar symbol or underscore, we move to state 10.
The identifiers and keywords are of any length.
After identifying the first alphabet, if we identify digit, alphabet, dollar, or underscore, we stay in state 10.
If we find any other symbol not used in the identifier, we move to state 11.
State 11 has a star symbol. This means we move one step back in the input memory.
Example and understand:
Suppose, take the syntax inti=10.
After reading input ‘i’, we move to state 10.
after reading n, t, ‘i’, we stay on state 10.
After reading input ‘=’, we move to state 11.
State 11 will move one step back in input memory because the symbol ‘=’ should be considered in the next token.
‘inti’ should be considered an identifier.
How did we say ‘inti’ is an identifier?
We maintain a list of all the keywords in the language.
The word ‘inti’ is checked in the table. If it is found, we say it key word otherwise identifier.
2) Write logic for all the keywords.
The below diagram shows the finite automata to identify int and if.
Similarly, write the finite automata for all the keywords.
We need to identify constants.
The value 50 is a constant.
The finite automata for identifying floating numbers are shown in the theory of computations.
We are not showing here. It is out of scope for compiler design.
We write a program to identify tokens from the source program.