Pandas in a Nutshell (Part-1)
Welcome to another tutorial of Python. In this blog we learn about
- What is Pandas
- Installation of Pandas
- What is DataFrame?
- How to make DataFrames?
- Operations and Manipulations on DataFrame?
This topic lays the foundation of the Data Analysis with Python.
What is Pandas?
Pandas is the Python module builds on top of NumPy. It is one of the primary pillars of Data Analytics. It has high-level data structures.
DataFrames are the Data Structures provided by Pandas, accompanying with Panel and Series. Numpy arrays are homogeneous, to overcome the shortcoming, DataFrames are included in Pandas and are heterogeneous.
Installation of Pandas
The installation of Pandas is easy using pip command. Assuming you already have pip installed on your system.
pip install pandas install it on your local machine.
What is DataFrame?
DataFrame is:
- Two-Dimensional Data Structure with the different or same type of columns.
- DataFrame in Python consists of:
- Data
- Index
- Columns
- DataFrames may contain:
- DataFrame
- NumPy Arrays
- Pandas Series
- CSV file
- Dictionaries, lists.
How to Make DataFrame?
Besides various data to create a DataFrame, we will start with the NumPy Library and then discuss distinct inputs.
import numpy as np
import pandas as pd
array_1 = np.array([np.arange(10, 15), np.arange(15, 20)])
dataframe_1 = pd.DataFrame(array_1)
print("Value of Array_1: \n", array_1)
print("\nValue of DataFrame_1: \n", dataframe_1)
Execute the program.
You will see the following output on your screen.
Value of Array_1:
[[10 11 12 13 14]
[15 16 17 18 19]]
Value of DataFrame_1:
0 1 2 3 4
0 10 11 12 13 14
1 15 16 17 18 19
When we create a DataFrame using the NumPy arrays, Pandas automatically indexed the DataFrame rows and columns. Please refer to the below snapshot. The first row and the first column are indexed as 0 and then incremented by 1 till the end of the row or column.
We also say that Pandas DataFrames are closely correlated with Microsoft excel like structure, Where we index rows and columns with values in it.
What we learned so far is DF is dynamically index by Pandas. In practical scenarios, it hardly happens. Instead of dynamic indexing, we want user-defined indexes on DataFrames.
Let’s do that.
import numpy as np
import pandas as pd
index = ['Row1', 'Row2', 'Row3']
columns = ['Col1', 'Col2', 'Col3', 'Col4']
array_1 = np.array([
(np.arange(11, 15)),
(np.arange(15, 19)),
(np.arange(19, 23))])
print("Value of Array_1: \n", array_1)
dataframe_1 = pd.DataFrame(data=array_1, index=index, columns=columns)
print("\nValue of DataFrame_1 is: \n", dataframe_1)
Here is the output, when you execute the above program.
Value of Array_1:
[[11 12 13 14]
[15 16 17 18]
[19 20 21 22]]
Value of DataFrame_1 is:
Col1 Col2 Col3 Col4
Row1 11 12 13 14
Row2 15 16 17 18
Row3 19 20 21 22
In the output, you can see the Row and Columns name in the DataFrame output.
Give attention to the line pd.DataFrame(data=array_1, index=index, columns=columns).
In the above line,
data -- contains the value of array_1
index -- has Row Names.
Columns -- has Column Names.
Index and column are optional parameters to DataFrame, if not provided then indexing starts from ‘0’.
You have created the DataFrame, if you have to find the number of rows and columns of the DataFrame.
In the above program, add print(dataframe_1.shape). And print(len(dataframe_1.index)), to get the number of elements of DataFrame.
In the previous example, we create DataFrames from the numpy arrays. However, in the real world, we rarely create DataFrame using numpy arrays directly. The CSV or Excel files serve as input to DataFrames.
Our next task is to create the DataFrames from the CSV file. We already have a sample CSV file uploaded to GitHub. You can download it from here.
Make sure you download the file, where your python script resides.
Customer.csv file contains the Customer information
- Customer Name
- Customer id
- Date of birth
- Gender
- City
from pandas.io.parsers import read_csv
customer_csv_file = 'Customer.csv'
customer_dataframe = read_csv(customer_csv_file, delimiter=';')
print(customer_dataframe)
We can also read the CSV from 'import csv' module, but we will use pandas read_csv because it magically converts CSV into the DataFrame. 1st-row imports read_csv from pandas.
This CSV file has around 5147 rows, hence we are not printing the output here. But make sure you executes the program before proceeding.
Right now, we are printing all the elements of DataFrame on Screen, But what if we need to print a single element or index of a DataFrame.
For e.g., if we want to print the 6th row of the DataFrame, write print(customer_dataframe.iloc[6]), to the previous example.
It displays the 6th row on the screen.
But, print(customer_dataframe.iloc[6][0]) will display 0th Column index of 6th row i.e. 'bikad' in our case.
Operations and Manipulations on DataFrame
In this section, we will extend the previous example and learn few more manipulations on DataFrames.
from pandas.io.parsers import read_csv
customer_csv_file = 'DA_Customer.csv'
customer_dataframe = read_csv(customer_csv_file, delimiter=';')
# Print 6th Row
print(customer_dataframe.iloc[6])
# Print 6th Row and 0th Index of Column
print(customer_dataframe.iloc[6][0])
# Print Top 5 rows
print("\n Top 5 Rows are")
print(customer_dataframe.head(5))
# Print Last 5 Rows
print("\n Last 5 rows are")
print(customer_dataframe.tail(5))
Take a look head() and tail() functions. Head() displays the top 'n' results from the DataFrame whereas tail() displays bottom 'n' results of the DataFrame.
If you are already aware of UNIX, in that case, you are reasonably familiar with head and tail commands. Head() and tail() functions in Pandas are their analogy.
At present, we have carried out very elementary operations on DataFrames. Let’s move towards a grouping of Data.
Our CSV file contains Customer Name, Customer ID, Date of Birth, Gender, City.
We want to identify the customer base in each of the cities. It is remarkably meaningful to understand our customer demographics.
By regrouping the data according to City, then we know:
Which city has the highest Customer base?
Which city possesses the least number of Customers?
These outcomes are essential to identify because it provides insights to us on the number of Customers distributed across cities. Eventually, it points where we want to expand our market share.
Let’s include the code to our python script.
# Group the data by City
city_group = customer_dataframe.groupby('City')
count = 0
for city_name, group in city_group:
count = count + 1
print("City", count, city_name)
print(group)
We create an object and customer DataFrame is grouped (see groupby) according to the "City".
NOTE:- The city is one of the Columns in the CSV file.
Execute the program, the results are too big, hence only a portion of the output is copied here.
City 1 Bangalore
Customer Name Customer_ID DOB Gender City
3 rowap 419472292 1987-06-14 M Bangalore
32 zewov 493686428 1986-05-10 F Bangalore
48 sehob 174535964 1964-02-14 F Bangalore
75 xohiv 751472877 1997-12-03 F Bangalore
….
…..
[379 rows x 5 columns]
We have already covered a lot about DataFrames on Pandas. However still there is lot to cover. That's enough for Part-1.
Stay tuned for the Part-2 of the Tutorial.
Test Your Knowledge
- Find the shape of the DataFrame
- Find the length of the DataFrame