Read CSV File to Data Frame Pandas

In this class, We discuss how to read CSV file to data frame pandas.

For Complete YouTube Video: Click Here

Data Frame

The reader should have prior knowledge of python. Click here.

The first and foremost thing to do in data science is to read the data and modify the data according to requirements.

To do that, python has provided a separate library called pandas.

In the library pandas, there is a separate class called data frame to read and modify data.

The data is available in different formats. CSV files, XL files, SQL files, etc.

In this class, we understand how to read data from CSV files into data frames.

To read data from other files is simple once the reader understands the concept of the data frame.

Data Frame is two-dimensional, size-mutable, and heterogeneous tabular data.

Mutable means data we can modify in place. We discussed this in our mutable and immutable objects in python.

Heterogeneous means. Different types of data are allowed in the data frame.

The Data frame is tabular data. So data frame consists of rows and columns.

There are different methods and attributes present in the class data frame. We discuss those in our coming classes.

There is a separate function called read_CSV in the pandas’ library. The function is used to read the data from the CSV file.

There are so many parameters present in the function read_CSV. The first parameter is the file path.

In the first parameter, if we mention the path of the CSV file. The function will read the file from the path mentioned and converts it to the data frame.

The returned object of function read_CSV is the data frame object.

CSV File

The example of the CSV file is shown in the image below.

In the CSV file, data is present in a comma-separated value.

In the above CSV file, the first line represents column names. A comma separates each column.

The file name is given faithfull.csv.

The below program shows how to read data from a CSV file and place it in the data frame.

import pandas as pd
df = pd.read_csv('faithful.csv')
# First line Taken as column names
# Index is created 
print(df)
#ten rows will be displayed

Output:
     Sno   "Eruption length (mins)"  Eruption wait (mins)
0      1                      3.600                    79
1      2                      1.800                    54
2      3                      3.333                    74
3      4                      2.283                    62
4      5                      4.533                    85
..   ...                        ...                   ...
267  268                      4.117                    81
268  269                      2.150                    46
269  270                      4.417                    90
270  271                      1.817                    46
271  272                      4.467                    74

[272 rows x 3 columns]

In the program, we defined a variable df. The variable df is referencing the data frame object because the read_CSV function returns the data frame object.

Note: Now df variable can use the methods and attributes of the data frame class.

The first line of the CSV file is taken as column names in the data frame.

The output of the program is given above.

A separate index is created, and it is starting from zero.

The first five lines and the last five lines of the data is displayed by default in the data frame.

Display all the Data Frame Lines

To display all the lines in the data frame, we need to set an option in the data frame. The code is given below.

# to display all the rows
pd.set_option('display.max_rows', None)
df = pd.read_csv('faithful.csv')
print(df)

Output:
    Sno   "Eruption length (mins)"  Eruption wait (mins)
0      1                      3.600                    79
1      2                      1.800                    54
2      3                      3.333                    74
3      4                      2.283                    62
4      5                      4.533                    85
5      6                      2.883                    55
6      7                      4.700                    88
7      8                      3.600                    85
8      9                      1.950                    51
9     10                      4.350                    85
10    11                      1.833                    54
11    12                      3.917                    84

We are not displaying all the rows here.

The function set_option is used to display all the rows. The option display.max_rows is used to display all the lines.

Giving own Column Names

The names parameter is used to give column names according to the user requirement. The code is given below.

df = pd.read_csv('faithful.csv',names=['x','y','z'])
# names take the column names
print(df)

Output:
       x                          y                     z
0    Sno   "Eruption length (mins)"  Eruption wait (mins)
1      1                      3.600                    79
2      2                      1.800                    54
3      3                      3.333                    74
4      4                      2.283                    62
5      5                      4.533                    85

names=[‘x’,’ y’,’ z’]. we are taking a list of column names. These names are given as column names to the data frame.

The first line in the CSV file is read as data and shown in the output.

Removing First Line From CSV File

Given header =0 option in the read_CSV function will not take the first line as data.

df = pd.read_csv('faithful.csv',names=['x','y','z'],header=0)
#header=0 means dont take the 0th line as data
print(df)

Output:
       x      y   z
0      1  3.600  79
1      2  1.800  54
2      3  3.333  74
3      4  2.283  62
4      5  4.533  85

Taking Required columns

Suppose we want to take only a few columns from the CSV file. We had an option usecols.

The usecols option take list of column numbers. The numbers are assigned from zero.

In the example given below, we are considering the last two columns.

df = pd.read_csv('faithful.csv',names=['x','y','z'],header=0,usecols=[1,2])
# usecols keep the required columns
print(df)

Output:
         y   z
0    3.600  79
1    1.800  54
2    3.333  74
3    2.283  62
4    4.533  85
5    2.883  55

Using column as Index in Data Frame

The read_CSV is creating a new index default.

Suppose we want to consider one of the columns as an index in the data frame. We use the parameter index_col.

The program is given below, taking Sno as the index column.

df = pd.read_csv('faithful.csv',index_col='Sno')
# usecols keep the required columns
print(df)

Output:
      "Eruption length (mins)"  Eruption wait (mins)
Sno                                                 
1                        3.600                    79
2                        1.800                    54
3                        3.333                    74
4                        2.283                    62
5                        4.533                    85
6                        2.883                    55
7                        4.700                    88
8                        3.600                   

Removing Last Rows

Suppose we need to remove the last five rows. I.e., do not consider the last five rows in the data frame.

We use the parameter skipfooter. The example is shown below.

df = pd.read_csv('faithful.csv',index_col='Sno',skipfooter=5)
# skip footer no of lines deleted at bottom
print(df)