Pandas

It is a python library mainly used for data manipulation and behaves similarly to an excel file.

1) Importing the pandas library

import pandas as pd

2) load the file for data manipulation

iris = pd.read_csv("archive.ics.uci.edu/ml/machine-learning-dat..")

Note: the pandas have other functions like read_html(), read_excel() and so on. The point is inside the brackets we must either provide the URL or the path to the file. If the file is in your computer then, you can just pass the name itself. Also, the function loads the data file and returns an object.

In this case, the iris is an object containing all the iris datasets.

2.1) optional step

print(iris)

If we print the data, we can see that by default the first row gets taken as a heading and only the rest of them are considered as data. However, it is possible to change it.

print(type(iris))

You can also check the type of it using type() and it will return DataFrame as the type.

DataFrame is a table with rows and columns.

Pandas basic operations:

1) making a copy of the original data frame

df = iris.copy()

Reason: We can easily revert to the previous step as we already have the original data frame with no changes. Just like having a backup.

2) looking at a few entries to understand about it

df.head()

By default, it will return the first five rows.

If we want more or lesser rows, we pass a number as an argument in head().

Eg:- df.head(3), df.head(10)

Similarly, df.tail() returns the last few entries.

3) size of the data frame

print(df.shape)

4) what all data types are present

print(df.dtypes)

Returns the type of data for each column.

5) changing column heading

df.columns = ['name_1', 'name_2', 'name_3',...., 'name_n']

The column names must be equal to the no. of existing columns else it will throw an error. If want to check the existing column names, you can easily do it by writing:

print(df.columns) -> returns a list of strings

6) provide some useful information

df.describe()

Returns information in a table format for each column containing a count of valid entries, mean, minimum and maximum value, standard deviation etc.

Only works for number columns, for textual contents it will be ignored.

7) accessing a particular column

df.column_name

df["column_name"]

Returns that particular column only along with the indices column.

8) finding null entries

df.isnull()

Returns a table containing true (is null) or false (not null)

8.1) summarized format

df.isnull().sum()

Returns a smaller table with the count of null entries for each column

9) accessing a particular part of the data frame

df.iloc[row_start_number:row_end_number, column_start_number:column_end_number]

Uses slicing to give the specified part