Pandas
It is a python library mainly used for data manipulation and behaves similarly to an excel file.
1) Importing the pandas library
import pandas as pd
2) load the file for data manipulation
iris = pd.read_csv("archive.ics.uci.edu/ml/machine-learning-dat..")
Note: the pandas have other functions like read_html(), read_excel() and so on. The point is inside the brackets we must either provide the URL or the path to the file. If the file is in your computer then, you can just pass the name itself. Also, the function loads the data file and returns an object.
In this case, the iris is an object containing all the iris datasets.
2.1) optional step
print(iris)
If we print the data, we can see that by default the first row gets taken as a heading and only the rest of them are considered as data. However, it is possible to change it.
print(type(iris))
You can also check the type of it using type() and it will return DataFrame as the type.
DataFrame is a table with rows and columns.
Pandas basic operations:
1) making a copy of the original data frame
df = iris.copy()
Reason: We can easily revert to the previous step as we already have the original data frame with no changes. Just like having a backup.
2) looking at a few entries to understand about it
df.head()
By default, it will return the first five rows.
If we want more or lesser rows, we pass a number as an argument in head().
Eg:- df.head(3), df.head(10)
Similarly, df.tail() returns the last few entries.
3) size of the data frame
print(df.shape)
4) what all data types are present
print(df.dtypes)
Returns the type of data for each column.
5) changing column heading
df.columns = ['name_1', 'name_2', 'name_3',...., 'name_n']
The column names must be equal to the no. of existing columns else it will throw an error. If want to check the existing column names, you can easily do it by writing:
print(df.columns) -> returns a list of strings
6) provide some useful information
df.describe()
Returns information in a table format for each column containing a count of valid entries, mean, minimum and maximum value, standard deviation etc.
Only works for number columns, for textual contents it will be ignored.
7) accessing a particular column
df.column_name
or
df["column_name"]
Returns that particular column only along with the indices column.
8) finding null entries
df.isnull()
Returns a table containing true (is null) or false (not null)
8.1) summarized format
df.isnull().sum()
Returns a smaller table with the count of null entries for each column
9) accessing a particular part of the data frame
df.iloc[row_start_number:row_end_number, column_start_number:column_end_number]
Uses slicing to give the specified part