Data Frame Methods

In this class, We discuss Data Frame Methods.

For Complete YouTube Video: Click Here

Data Frame Methods

The reader should have prior knowledge of data frame attributes. Click here.

In our previous class, we discussed few methods based on the situation.

Here we discuss a few more methods that are mostly used in data science.

The remaining methods are discussed based on the situation in our later classes.

agg Method

Take an example and understand method agg.

import pandas as pd
df = pd.DataFrame([[1, 2, 3],[4, 5, 6],[7, 8, 9]],columns=['A', 'B', 'C'])
print(df)

Output:
   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9

# agg method
# DataFrame.agg(func=None, axis=0, *args, **kwargs)
print(df.agg([sum,min]))

Output:
      A   B   C
sum  12  15  18
min   1   2   3

The method agg is used to aggregate one or more operations over the specified axis.

The func parameter in agg method will take a list of function names.

We can give user-defined functions or inbuilt python functions.

In our example, we have given the functions, sum and min.

The agg method will take the data from each column and send the data as input to the functions given in the func parameter.

The returned value from the function is taken as output.

The output of the above example is giving the sum of the elements in each column and the minimum element from each column

We can mention different functions for different columns.

The example is given below.

# using agg on specified columns
print(df.agg({'A' : ['sum', 'min'], 'B' : ['mean', 'max']}))

Output:
         A    B
max    NaN  8.0
mean   NaN  5.0
min    1.0  NaN
sum   12.0  NaN

In column A we applied the sum and min function.

Column B is applied with mean and max functions.

The parameter axis takes values 0 and 1. 0 means applied on each column. 1 means applied on each row.

The examples are shown below.

# axis =0 will apply on each column
print(df.agg('mean',axis=0))

Output:
A    4.0
B    5.0
C    6.0
dtype: float64

# axis =1 will apply on each row
print(df.agg('mean',axis=1))

Output:
0    2.0
1    5.0
2    8.0
dtype: float64

all, any Methods

Take an example and understand the use of all and any method.

The example program is shown below.

# all and any methods
import pandas as pd
df = pd.DataFrame([[True, True],[True, True],[False, True]],columns=['A', 'B'])
print(df)

Output:
       A     B
0   True  True
1   True  True
2  False  True

print(df.all())

Output:
A    False
B     True
dtype: bool

print(df.any())

Output:
A    True
B    True
dtype: bool

print(df.all(axis=1))

Output:
0     True
1     True
2    False
dtype: bool

all and any methods are applied on the columns and rows that have only boolean values.

In our example, we have taken two columns of type boolean.

All method return boolean value True. If all the values are true.

In the same way, any method will return true if any one of the values is true.

We can apply on columns or rows.

If the axis is equal to zero, they check the columns.

Axis = 1 option will check on rows. The default value is zero.

append, apply, and drop methods

These methods are discussed in our previous classes. Click here.

drop_duplicates method

DataFrame.drop_duplicates(subset=None, keep=’first’, inplace=False, ignore_index=False)

The list of parameters in the method is shown above.

We take examples and understand the parameters.

import pandas as pd
df = pd.DataFrame([[1, 2, 3],[4, 5, 6],[7, 8, 9],[7,8,9]],columns=['A', 'B', 'C'])
print(df)

Output:
   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9
3  7  8  9

df.drop_duplicates(inplace=True)
print(df)

Output:
   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9

import pandas as pd
df = pd.DataFrame([[1, 2, 3],[4, 5, 6],[7, 7, 9],[7,8,9]],columns=['A', 'B', 'C'])
print(df)
print("----------------")

print(df.drop_duplicates(subset=['A','B']))
print("--------------------")
print(df.drop_duplicates(subset=['A','C']))

   A  B  C
0  1  2  3
1  4  5  6
2  7  7  9
3  7  8  9
----------------
   A  B  C
0  1  2  3
1  4  5  6
2  7  7  9
3  7  8  9
--------------------
   A  B  C
0  1  2  3
1  4  5  6
2  7  7  9

The drop_duplicates method is used to remove the duplicate rows from the data.

In the above example, the second and the third row are duplicates.

The drop_duplicate method will remove the duplicate row.

The inplace=True parameter does the modification in the same place. It does not create a new object.

The subset parameter will take a list of column names. 

The duplicate values are considered on the columns mentioned in the subset parameter.

In the above example the columns A, B do not have any duplicate rows.

Columns A, C have a duplicate row.

The parameter keep = last will delete the duplicate first rows.

The example is shown below.

import pandas as pd
df = pd.DataFrame([[1, 2, 3],[4, 5, 6],[7, 7, 9],[7,8,9]],columns=['A', 'B', 'C'])
print(df)
print("----------------")

print(df.drop_duplicates(subset=['A','C'],keep='last'))

Output:
   A  B  C
0  1  2  3
1  4  5  6
2  7  7  9
3  7  8  9
----------------
   A  B  C
0  1  2  3
1  4  5  6
3  7  8  9

The parameter ignore_index=True will arrange the index after deleting the duplicates.

This option work above the 1.0 version of pandas.

value_counts Method

# value_counts method give count of distinct elements
import pandas as pd
df = pd.DataFrame([[1, 2, 3],[4, 5, 6],[7, 7, 9],[7,8,9]],columns=['A', 'B', 'C'])
print(df)
print("----------------")

print(df['A'].value_counts())

Output:
   A  B  C
0  1  2  3
1  4  5  6
2  7  7  9
3  7  8  9
----------------
7    2
1    1
4    1
Name: A, dtype: int64

This method is applied to pandas series objects. The discussion about the series class is done in our previous lectures.

Suppose we take a column from the data frame. It is considered a series object.

The method value_counts will count the distinct elements count.

In our example, we have considered column A. In this column, distinct elements are taken.

The count of each distinct element is given as output.

The method value_counts is applied to the data frame. From version 1.0 of pandas.

head and tail Method

The method head will display the top rows in the data frame.

head(10) will display the top ten rows in the data frame.

The default value is 5.

The same way the tail method will display the last rows.

The example programs are given below.

# head and tail method
import pandas as pd
df = pd.DataFrame([[1, 2, 3],[4, 5, 6],[7, 7, 9],[7,8,9]],columns=['A', 'B', 'C'])
print(df)
print("----------------")

print(df.head(2))

Output:
   A  B  C
0  1  2  3
1  4  5  6
2  7  7  9
3  7  8  9
----------------
   A  B  C
0  1  2  3
1  4  5  6

print(df.tail(2))

Output:
   A  B  C
2  7  7  9
3  7  8  9