Understanding Lexical Analysis
In this class, We discuss Understanding Lexical Analysis.
For Complete YouTube Video: Click Here
The reader should have prior knowledge of the phases of a compiler. Click Here.
We will understand the responsibilities of the lexical analysis phase.
We give the below program as input to the lexical analysis phase.
We have chosen the c programming language.
The source code is the input to the lexical analysis phase for any language.
The lexical analysis phase reads the source program character by character.
The first line has main and open, and closed bracket symbols.
The lexical analysis phase will separate the symbol’s main, ‘(and)’.
The second line has a flower bracket, separating it as a symbol.
In the third line of the source program, we separate int, I, =, etc. all the symbols, numbers, etc.
The identifier details are placed in the symbol table whenever a lexical analyzer identifies an identifier.
We write the identifier as <id, 1>. Id means identifier. And the identifier is placed in the symbol table in the first line.
We identify a number after identifier and equal symbol.
We give the number as <num 20>.
In the lexical analysis program, we need to write logic to identify all these identifiers, numbers, symbols, etc.
We discuss the programmatic intuition for lexical analysis in the next classes.
Below we discuss some of the conditions to understand the programming of the lexical phase.
Responsibilities of Lexical Analyzer
1) Removal of white spaces and comments.
The above diagram shows the elimination of white spaces.
Not only a single space, but even tab space should also be eliminated.
2) Count Number of Lines.
Counting the lines helps to identify the last line of the program.
3) Reading ahead
What’s reading ahead mean?
Example:
Take two operators >, and >=.
In our example, the condition k >= 50.
After reading the greater than symbol, the compiler should not decide as an operator.
Look ahead one more symbol and decide as greater than or equal operator.
In our language, we are reading ahead one symbol.
Depending on your language syntax, one has to read ahead two or five symbols.
4) Reading constants
20 is an integer constant
20.2 is a floating-point constant.
The lexical analyzer needs to separate these constants.
Important: In c language, we had a condition variable names should not start with digits.
If we eliminate this condition, we can start variable names with digits then the compiler had difficulty separating variables and constants.
5) Keywords and identifiers.
6) String literals should be separated.
In our example, literal “hello” in the print statement.
7) Operator symbols
>, >=, <, <=, etc all the operators.
The lexical phase needs to identify all the above in our programming language.