Pandas - Handling Nan or null entries

import pandas as pd

import numpy as np

iris = pd.read_csv("archive.ics.uci.edu/ml/machine-learning-dat..")

df = iris.copy()

df.columns = ['sl', 'sw', 'pl', 'pw', 'flower_type']

print(df)

slswplpwflower_type
04.93.01.40.2Iris-setosa
14.73.21.30.2Iris-setosa
24.6NaNNaN0.2Iris-setosa
35.0NaNNaN0.2Iris-setosa
45.43.91.70.4Iris-setosa
..................
1446.73.05.22.3Iris-virginica
1456.32.55.01.9Iris-virginica
1466.53.05.22.0Iris-virginica
1476.23.45.42.3Iris-virginica
1485.93.05.11.8Iris-virginica

149 rows × 5 columns

1) Method 1 - drop the entires itself

df.dropna(inplace = True)

print(df)

slswplpwflower_type
04.93.01.40.2Iris-setosa
14.73.21.30.2Iris-setosa
45.43.91.70.4Iris-setosa
54.63.41.40.3Iris-setosa
65.03.41.50.2Iris-setosa
..................
1446.73.05.22.3Iris-virginica
1456.32.55.01.9Iris-virginica
1466.53.05.22.0Iris-virginica
1476.23.45.42.3Iris-virginica
1485.93.05.11.8Iris-virginica

147 rows × 5 columns

2) Method 2 - replace the nan with a value

df.sw.fillna(df.sw.mean(), inplace = True)

df.pl.fillna(df.pl.mean(), inplace = True)

print(df)

slswplpwflower_type
04.93.0000001.4000000.2Iris-setosa
14.73.2000001.3000000.2Iris-setosa
25.43.0386213.8372410.4Iris-setosa
34.63.0386213.8372410.3Iris-setosa
45.03.4000001.5000000.2Iris-setosa
..................
1426.73.0000005.2000002.3Iris-virginica
1436.32.5000005.0000001.9Iris-virginica
1446.53.0000005.2000002.0Iris-virginica
1456.23.4000005.4000002.3Iris-virginica
1465.93.0000005.1000001.8Iris-virginica

147 rows × 5 columns

In this case, we are replacing nan values with the column's mean.

Other options are replacing with the fixed value, most recurring value, mean of the flower type and so on.

Note: Method 2 is preferred more as in most cases the no. of nan entries will be high and deleting them can lead to data loss.