Read CSV File to Data Frame Pandas
In this class, We discuss how to read CSV file to data frame pandas.
For Complete YouTube Video: Click Here
Data Frame
The reader should have prior knowledge of python. Click here.
The first and foremost thing to do in data science is to read the data and modify the data according to requirements.
To do that, python has provided a separate library called pandas.
In the library pandas, there is a separate class called data frame to read and modify data.
The data is available in different formats. CSV files, XL files, SQL files, etc.
In this class, we understand how to read data from CSV files into data frames.
To read data from other files is simple once the reader understands the concept of the data frame.
Data Frame is two-dimensional, size-mutable, and heterogeneous tabular data.
Mutable means data we can modify in place. We discussed this in our mutable and immutable objects in python.
Heterogeneous means. Different types of data are allowed in the data frame.
The Data frame is tabular data. So data frame consists of rows and columns.
There are different methods and attributes present in the class data frame. We discuss those in our coming classes.
There is a separate function called read_CSV in the pandas’ library. The function is used to read the data from the CSV file.
There are so many parameters present in the function read_CSV. The first parameter is the file path.
In the first parameter, if we mention the path of the CSV file. The function will read the file from the path mentioned and converts it to the data frame.
The returned object of function read_CSV is the data frame object.
CSV File
The example of the CSV file is shown in the image below.
In the CSV file, data is present in a comma-separated value.
In the above CSV file, the first line represents column names. A comma separates each column.
The file name is given faithfull.csv.
The below program shows how to read data from a CSV file and place it in the data frame.
import pandas as pd
df = pd.read_csv('faithful.csv')
# First line Taken as column names
# Index is created
print(df)
#ten rows will be displayed
Output:
Sno "Eruption length (mins)" Eruption wait (mins)
0 1 3.600 79
1 2 1.800 54
2 3 3.333 74
3 4 2.283 62
4 5 4.533 85
.. ... ... ...
267 268 4.117 81
268 269 2.150 46
269 270 4.417 90
270 271 1.817 46
271 272 4.467 74
[272 rows x 3 columns]
In the program, we defined a variable df. The variable df is referencing the data frame object because the read_CSV function returns the data frame object.
Note: Now df variable can use the methods and attributes of the data frame class.
The first line of the CSV file is taken as column names in the data frame.
The output of the program is given above.
A separate index is created, and it is starting from zero.
The first five lines and the last five lines of the data is displayed by default in the data frame.
Display all the Data Frame Lines
To display all the lines in the data frame, we need to set an option in the data frame. The code is given below.
# to display all the rows
pd.set_option('display.max_rows', None)
df = pd.read_csv('faithful.csv')
print(df)
Output:
Sno "Eruption length (mins)" Eruption wait (mins)
0 1 3.600 79
1 2 1.800 54
2 3 3.333 74
3 4 2.283 62
4 5 4.533 85
5 6 2.883 55
6 7 4.700 88
7 8 3.600 85
8 9 1.950 51
9 10 4.350 85
10 11 1.833 54
11 12 3.917 84
We are not displaying all the rows here.
The function set_option is used to display all the rows. The option display.max_rows is used to display all the lines.
Giving own Column Names
The names parameter is used to give column names according to the user requirement. The code is given below.
df = pd.read_csv('faithful.csv',names=['x','y','z'])
# names take the column names
print(df)
Output:
x y z
0 Sno "Eruption length (mins)" Eruption wait (mins)
1 1 3.600 79
2 2 1.800 54
3 3 3.333 74
4 4 2.283 62
5 5 4.533 85
names=[‘x’,’ y’,’ z’]. we are taking a list of column names. These names are given as column names to the data frame.
The first line in the CSV file is read as data and shown in the output.
Removing First Line From CSV File
Given header =0 option in the read_CSV function will not take the first line as data.
df = pd.read_csv('faithful.csv',names=['x','y','z'],header=0)
#header=0 means dont take the 0th line as data
print(df)
Output:
x y z
0 1 3.600 79
1 2 1.800 54
2 3 3.333 74
3 4 2.283 62
4 5 4.533 85
Taking Required columns
Suppose we want to take only a few columns from the CSV file. We had an option usecols.
The usecols option take list of column numbers. The numbers are assigned from zero.
In the example given below, we are considering the last two columns.
df = pd.read_csv('faithful.csv',names=['x','y','z'],header=0,usecols=[1,2])
# usecols keep the required columns
print(df)
Output:
y z
0 3.600 79
1 1.800 54
2 3.333 74
3 2.283 62
4 4.533 85
5 2.883 55
Using column as Index in Data Frame
The read_CSV is creating a new index default.
Suppose we want to consider one of the columns as an index in the data frame. We use the parameter index_col.
The program is given below, taking Sno as the index column.
df = pd.read_csv('faithful.csv',index_col='Sno')
# usecols keep the required columns
print(df)
Output:
"Eruption length (mins)" Eruption wait (mins)
Sno
1 3.600 79
2 1.800 54
3 3.333 74
4 2.283 62
5 4.533 85
6 2.883 55
7 4.700 88
8 3.600
Removing Last Rows
Suppose we need to remove the last five rows. I.e., do not consider the last five rows in the data frame.
We use the parameter skipfooter. The example is shown below.
df = pd.read_csv('faithful.csv',index_col='Sno',skipfooter=5)
# skip footer no of lines deleted at bottom
print(df)