Vishnu Vinay's Blog

Follow

Vishnu Vinay's Blog

Follow

Pandas - Handling Nan or null entries

Vishnu Vinay's photo

··

3 min read

import pandas as pd

import numpy as np

iris = pd.read_csv("archive.ics.uci.edu/ml/machine-learning-dat..")

df = iris.copy()

df.columns = ['sl', 'sw', 'pl', 'pw', 'flower_type']

print(df)

	sl	sw	pl	pw	flower_type
0	4.9	3.0	1.4	0.2	Iris-setosa
1	4.7	3.2	1.3	0.2	Iris-setosa
2	4.6	NaN	NaN	0.2	Iris-setosa
3	5.0	NaN	NaN	0.2	Iris-setosa
4	5.4	3.9	1.7	0.4	Iris-setosa
...	...	...	...	...	...
144	6.7	3.0	5.2	2.3	Iris-virginica
145	6.3	2.5	5.0	1.9	Iris-virginica
146	6.5	3.0	5.2	2.0	Iris-virginica
147	6.2	3.4	5.4	2.3	Iris-virginica
148	5.9	3.0	5.1	1.8	Iris-virginica

149 rows × 5 columns

1) Method 1 - drop the entires itself

df.dropna(inplace = True)

print(df)

	sl	sw	pl	pw	flower_type
0	4.9	3.0	1.4	0.2	Iris-setosa
1	4.7	3.2	1.3	0.2	Iris-setosa
4	5.4	3.9	1.7	0.4	Iris-setosa
5	4.6	3.4	1.4	0.3	Iris-setosa
6	5.0	3.4	1.5	0.2	Iris-setosa
...	...	...	...	...	...
144	6.7	3.0	5.2	2.3	Iris-virginica
145	6.3	2.5	5.0	1.9	Iris-virginica
146	6.5	3.0	5.2	2.0	Iris-virginica
147	6.2	3.4	5.4	2.3	Iris-virginica
148	5.9	3.0	5.1	1.8	Iris-virginica

147 rows × 5 columns

2) Method 2 - replace the nan with a value

df.sw.fillna(df.sw.mean(), inplace = True)

df.pl.fillna(df.pl.mean(), inplace = True)

print(df)

	sl	sw	pl	pw	flower_type
0	4.9	3.000000	1.400000	0.2	Iris-setosa
1	4.7	3.200000	1.300000	0.2	Iris-setosa
2	5.4	3.038621	3.837241	0.4	Iris-setosa
3	4.6	3.038621	3.837241	0.3	Iris-setosa
4	5.0	3.400000	1.500000	0.2	Iris-setosa
...	...	...	...	...	...
142	6.7	3.000000	5.200000	2.3	Iris-virginica
143	6.3	2.500000	5.000000	1.9	Iris-virginica
144	6.5	3.000000	5.200000	2.0	Iris-virginica
145	6.2	3.400000	5.400000	2.3	Iris-virginica
146	5.9	3.000000	5.100000	1.8	Iris-virginica

147 rows × 5 columns

In this case, we are replacing nan values with the column's mean.

Other options are replacing with the fixed value, most recurring value, mean of the flower type and so on.

Note: Method 2 is preferred more as in most cases the no. of nan entries will be high and deleting them can lead to data loss.

Python python pandas Machine Learning Programming Tips Programming Blogs