Pandas - Handling Nan or null entries
import pandas as pd
import numpy as np
iris = pd.read_csv("archive.ics.uci.edu/ml/machine-learning-dat..")
df = iris.copy()
df.columns = ['sl', 'sw', 'pl', 'pw', 'flower_type']
print(df)
sl | sw | pl | pw | flower_type | |
0 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
1 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
2 | 4.6 | NaN | NaN | 0.2 | Iris-setosa |
3 | 5.0 | NaN | NaN | 0.2 | Iris-setosa |
4 | 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa |
... | ... | ... | ... | ... | ... |
144 | 6.7 | 3.0 | 5.2 | 2.3 | Iris-virginica |
145 | 6.3 | 2.5 | 5.0 | 1.9 | Iris-virginica |
146 | 6.5 | 3.0 | 5.2 | 2.0 | Iris-virginica |
147 | 6.2 | 3.4 | 5.4 | 2.3 | Iris-virginica |
148 | 5.9 | 3.0 | 5.1 | 1.8 | Iris-virginica |
149 rows × 5 columns
1) Method 1 - drop the entires itself
df.dropna(inplace = True)
print(df)
sl | sw | pl | pw | flower_type | |
0 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
1 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
4 | 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa |
5 | 4.6 | 3.4 | 1.4 | 0.3 | Iris-setosa |
6 | 5.0 | 3.4 | 1.5 | 0.2 | Iris-setosa |
... | ... | ... | ... | ... | ... |
144 | 6.7 | 3.0 | 5.2 | 2.3 | Iris-virginica |
145 | 6.3 | 2.5 | 5.0 | 1.9 | Iris-virginica |
146 | 6.5 | 3.0 | 5.2 | 2.0 | Iris-virginica |
147 | 6.2 | 3.4 | 5.4 | 2.3 | Iris-virginica |
148 | 5.9 | 3.0 | 5.1 | 1.8 | Iris-virginica |
147 rows × 5 columns
2) Method 2 - replace the nan with a value
df.sw.fillna(df.sw.mean(), inplace = True)
df.pl.fillna(df.pl.mean(), inplace = True)
print(df)
sl | sw | pl | pw | flower_type | |
0 | 4.9 | 3.000000 | 1.400000 | 0.2 | Iris-setosa |
1 | 4.7 | 3.200000 | 1.300000 | 0.2 | Iris-setosa |
2 | 5.4 | 3.038621 | 3.837241 | 0.4 | Iris-setosa |
3 | 4.6 | 3.038621 | 3.837241 | 0.3 | Iris-setosa |
4 | 5.0 | 3.400000 | 1.500000 | 0.2 | Iris-setosa |
... | ... | ... | ... | ... | ... |
142 | 6.7 | 3.000000 | 5.200000 | 2.3 | Iris-virginica |
143 | 6.3 | 2.500000 | 5.000000 | 1.9 | Iris-virginica |
144 | 6.5 | 3.000000 | 5.200000 | 2.0 | Iris-virginica |
145 | 6.2 | 3.400000 | 5.400000 | 2.3 | Iris-virginica |
146 | 5.9 | 3.000000 | 5.100000 | 1.8 | Iris-virginica |
147 rows × 5 columns
In this case, we are replacing nan values with the column's mean.
Other options are replacing with the fixed value, most recurring value, mean of the flower type and so on.
Note: Method 2 is preferred more as in most cases the no. of nan entries will be high and deleting them can lead to data loss.