- Pandas is an open source python library built on Numpy used for working with datasets.Its name is derived from “Panel Data,” which refers to multidimensional structured data sets. it has become a core library in the Python data science stack, alongside other giants like Matplotlib for data visualization and Scikit-learn for machine learning.
- It has functions for analysing, cleaning, exploring and manipulating data.
- Pandas has capabilities to analyse big data and make conclusions based on statistical theories.
- Pandas is used to clean messy datasets, and make them readable and relevant.
- Relevant data is very important in data science.
- Pandas offers robust functionality for detecting, removing, or filling missing data, making data preprocessing easier.
Key Specialties of Pandas
Pandas offers a plethora of features that make it the go-to library for data manipulation and analysis. Some of its unique capabilities include:
High-Level Data Structures: Pandas provides two fundamental data structures, Series and DataFrame, which allow you to manipulate data intuitively and efficiently.
Flexible Data Handling: It supports a wide variety of data formats, including CSV, Excel, SQL, JSON, and more. This means you can easily load data from external sources, manipulate it, and save it back into the desired format.
Built-in Methods for Data Cleaning: Handling missing data is a common challenge for data scientists. Pandas provides numerous methods to handle missing values, either by filling, interpolating, or dropping them altogether.
Time Series Support: Pandas has specialized tools for working with time series data, allowing you to perform tasks like resampling, shifting, or calculating rolling averages easily.
Data Transformation: Operations like merging, grouping, and pivoting data can be done with just a few lines of code, allowing you to reshape your data as needed.
Integration with Other Libraries: Pandas works seamlessly with other popular Python libraries, making it an ideal choice for data manipulation before moving to more advanced tasks like machine learning.
Why Do Data Scientists and Analysts Use Pandas?
Data scientists and analysts deal with a wide range of data, from financial transactions and customer records to sensor readings and text data. To turn this raw information into actionable insights, they need a library that can handle large datasets with varying levels of complexity.
Here’s why Pandas is so widely used:
Ease of Use: Pandas simplifies complex data manipulation tasks into a few simple commands, making the process faster and less error-prone. It allows users to spend more time analyzing and less time wrangling data.
Efficiency: Working with large datasets can be computationally expensive. Pandas is optimized for performance, and it leverages NumPy under the hood to perform operations efficiently.
Versatility: Whether you’re cleaning messy data, conducting statistical analysis, or preparing data for machine learning, Pandas has the tools needed for a wide range of use cases.
Powerful Grouping and Aggregation: Grouping and summarizing data are common in tasks like calculating average sales, total revenue, or analyzing customer behavior. Pandas allows data scientists to easily group and aggregate data to gain deeper insights.
Rich Ecosystem: Data scientists often work in conjunction with libraries like Matplotlib, Seaborn (for visualizations), and Scikit-learn (for machine learning). Pandas integrates smoothly with these tools, ensuring a streamlined workflow from data collection to model building.
Understanding Pandas Data Structures:
The two main building blocks of Pandas are Series and DataFrame. Let’s break them down.
1. Series:
A Pandas Series is like a column in a table. One-dimensional labeled array capable of holding any data type (integers, strings, floats, etc.). Think of a Series as a column in an Excel sheet or a SQL table. A Series is similar to a NumPy array but has the added benefit of labels (indexes) that allow for more intuitive data manipulation.
2. DataFrame:
A DataFrame is the two-dimensional structure that Pandas is most famous for. It’s a two dimensional data structure, It’s similar to an Excel spreadsheet or a SQL table, where data is organized in rows and columns. A DataFrame can hold different types of data (numeric, string, Boolean, etc.) and is designed for easy manipulation, indexing, and filtering.
Datatypes: Data type in a column of a dataframe or a series is called dtype.
Here Name and city are having Objects dtype and age is of integer type.
one confusing thing in pandas datatype is this object type, if a column consists entirely strings, they show object datatype.
Astype() is used to change a datatype.
I changed Age’s dtype to float64. But if you notice the next line, you will se after changing also Age is of int64 type, that’s because I didn’t assign the new type to the dataframe.
here I assigned to the dataframe
Retrieving series/ dataframe’s information:
df.shape– gives number of Rows and columns
df.index– describes index
df.columns– describes dataframe’s columns
df.info()– gives informations about the dataframe
df.count()– it gives number of non-NA values
Summary of a dataframe:
SUM(), CUMSUM(), MIN(), MAX(), DESCRIBE(), MEAN(), MEDIAN() are used to get summary of any columns/ data of a dataframe. describe(), mean(), median() are elaborated later in the article.
Indexing, Selecting & Assigning:
When you have a dataset and you need to choose or select a specific portion of that we use indexing and selecting, methods like loc(), iloc() are mostly used but there are other functions like at(), iat() as well.
Loc (Label-based access)
- What it does: Access data using row/column labels.
- How it works: You specify the row and column labels (e.g., index values, column names).
- Includes the last value in the range.
iloc
(Integer-based access)
- What it does: Access data using integer positions (similar to NumPy arrays).
- How it works: You specify the row and column by their numerical positions (starting from 0).
- Excludes the last value in the range.
at
(Label-based single value access)
- What it does: Faster access for a single element using row and column labels.
- How it works: Use this when you only need a single value. It is faster than
loc
for scalar lookups.
iat
(Integer-based single value access)
- What it does: Faster access for a single element using integer positions.
- How it works: Similar to
at
, but uses integer positions instead of labels.
Reading and Writing Data: We can read data from csv, xlsx, databases with pandas and then we can write to these file formats too.
For this practice we’re going to use US Stock market data downloaded from KAggle. You can download the dataset from here: https://www.kaggle.com/datasets/muhammadehsan02/us-stock-market-and-commodities-data-2020-2024
To read data from a csv file:df = pd.read_csv(‘data.csv’)
To write to a csv file: df.to_csv(‘output.csv’, index=False) // [ index=False to avoid writing row numbers ]
Similarly for Excel files:
# Reading from Excel
df = pd.read_excel(‘data.xlsx’, sheet_name=’Sheet1′)
# Writing to Excel
df.to_excel(‘output.xlsx’, sheet_name=’Sheet1′, index=False)
For SQL databases:
import sqlite3
- Establish connection to SQL database
conn = sqlite3.connect(‘example.db’) - Write DataFrame to SQL table
df.to_sql(‘my_table’, conn, if_exists=’replace’, index=False) - Read data from SQL table
df_sql = pd.read_sql(‘SELECT * FROM my_table’, conn)
Manipulating the index: Index we see in pandas isn’t immutable, we can manipulate it to the way we want. set_index() is used to do the job.
Conditional Selection: To do interesting things with data sometimes we need to use conditions. For example, we’re using US-stock data and we need to find days on which Nasdaq-100’s volume was more than a particular volume.
In the dataset, there’s a column, “Nasdaq_100_Volume.”. Suppose we need to get data/ details of the days on which the volume was more than the average volume.
Pandas comes with few built_in conditional selectors: isin(), isnull()
isin() lets you select data whose value “is in” a list of values.
isnull() and notnull() let us find values which are (or not) Empty.
Assigning DAta: We can assign data in a da5taframe very eas8ly. We can put constant value or an iterable of values. Suppose we need to assign a column to a particular value, we can use this.
Summary functions: Pandas provides some summary functions which restructure the data in some useful way.
here are few: describe(), mean(), info(), unique(), value_counts().
describe():
The df.describe()
function in Pandas provides a quick summary of descriptive statistics for numerical columns in a DataFrame. It returns a table that shows various statistics for each column, helping you understand the distribution of your data.
Here’s a breakdown of what each row in the output represents:
- count: The number of non-null values in each column.
- mean: The average of the values in each column.
- std: The standard deviation, a measure of the spread of values around the mean.
- min: The minimum value in each column.
- 25% (1st Quartile): The 25th percentile (the value below which 25% of the data falls).
- 50% (Median): The 50th percentile or median (the middle value of the data).
- 75% (3rd Quartile): The 75th percentile (the value below which 75% of the data falls).
- max: The maximum value in each column.
Mean(): The mean()
function in Pandas is used to calculate the arithmetic mean (average) of the numeric values across a Pandas DataFrame or Series. It computes the sum of the data and divides it by the number of elements.
from the dataset, we found out that the average of Crude oil’s price in this timeframe is 69.8534:
Info():The df.info()
function in Pandas is used to quickly get a summary of the DataFrame. This method provides essential information about the DataFrame’s structure, including the number of rows and columns, data types, memory usage, and how many non-null (non-missing) values there are in each column.
isnull().sum(): The isnull().sum() combination in Pandas is commonly used to detect and count the number of missing (NaN or Null) values in each column of a DataFrame or Series.
In Nasdaq_100_Vol there are 7 null values, I deleted those data from the source only, so that we can perform and see the next set of operations.
Unique():
The UNIQUE() function in Pandas is used to find the unique values in a Pandas Series or a column in a DataFrame. It returns a NumPy array containing the unique values, which can be useful for data exploration, particularly when you want to identify the distinct elements in a dataset.
Value_counts():
If we need to see the unique values in a dataset and how often they occur in the dataset, we use value_counts() method.
Exploratory Data Analysis using Pandas:
Exploratory Data Analysis (EDA) is the process of analyzing datasets to summarize their main characteristics, often using visual methods. It helps in understanding the underlying patterns, spotting anomalies, testing hypotheses, and checking assumptions with the help of statistical graphics and data visualization techniques. EDA is essential before building machine learning models because it helps in cleaning and preparing the data effectively.
Since we already learnt about most of the pandas concepts and techniques, now let’s dive into how we can use pandas in real life analysis. Some concepts are not discussed yet, those are elaborated under the points.
Key steps in EDA:
- Data Collection: Gathering data from various sources.
- Data Cleaning: Handling missing values, removing duplicates, correcting errors.
- Data Profiling: Checking the structure of the dataset, data types, and column statistics.
- Univariate Analysis: Analyzing each variable separately, such as checking distribution (using histograms, box plots).
- Bivariate/Multivariate Analysis: Analyzing relationships between two or more variables (scatter plots, correlation matrices).
- Outlier Detection: Identifying and handling outliers that may affect the analysis.
- Data Transformation: Converting or scaling data for better analysis or modeling.
1. Data Loading
pd.read_csv()
: Reads a CSV file into a DataFrame.pd.read_excel()
: Reads an Excel file into a DataFrame.pd.read_sql()
: Reads data from a SQL database.df.head()
: Displays the first few rows of the DataFrame (default is 5 rows).
2. Data Inspection
df.info()
: Provides a concise summary of the DataFrame, including data types, non-null counts, and memory usage.df.describe()
: Generates descriptive statistics for numerical columns (mean, std, min, quartiles, max, etc.).df.shape
: Returns the number of rows and columns in the DataFrame.df.columns
: Lists all column names in the DataFrame.
3. Handling Missing Data
df.isnull()
: Detects missing values and returns a DataFrame of booleans.df.isnull().sum()
: Counts the number of missing values per column.df.fillna()
: Fills missing values with a specified value (mean, median, mode, or custom value).df.dropna()
: Drops rows or columns with missing values.
4. Data Selection & Indexing
df['column_name']
: Accesses a specific column.df.loc[]
: Label-based access to rows and columns.df.iloc[]
: Index-based access to rows and columns.df.at[] / df.iat[]
: Access a single value at a specific row/column.df.set_index()
: Sets a column as the index of the DataFrame.
5. Filtering and Subsetting Data
df[df['column_name'] == value]
: Filters rows where a specific column equals a certain value.df[(df['column1'] > value1) & (df['column2'] < value2)]
: Filters data based on multiple conditions.
6. Data Cleaning
df.drop()
: Removes rows or columns from the DataFrame.df.rename()
: Renames columns or indices.df.duplicated()
: Identifies duplicate rows.df.drop_duplicates()
: Removes duplicate rows from the DataFrame
7. Aggregation and Grouping
df.groupby()
: Groups the data by one or more columns and allows for aggregation.df.agg()
: Performs multiple aggregate functions at once.df.mean(), df.sum(), df.min(), df.max()
: Common aggregation functions.
8. Sorting and Ranking
df.sort_values()
: Sorts the DataFrame by the values in one or more columns.df.rank()
: Assigns ranks to data based on values
9. Merging and Joining DataFrames
pd.concat()
: Concatenates multiple DataFrames along rows or columns.pd.merge()
: Merges two DataFrames based on common columns or indices (similar to SQL JOIN).df.join()
: Joins two DataFrames on their indices.
10. Data Transformation
df.apply()
: Applies a function to each element or row/column in a DataFrame.df.applymap()
: Applies a function element-wise across the entire DataFrame.df.transform()
: Applies a function to transform a DataFrame’s column(s).df.pivot()
: Reshapes data based on column and index values.df.melt()
: Converts wide-format DataFrame to a long-format DataFrame
11. Handling Categorical Data
pd.get_dummies()
: Converts categorical variables into dummy/indicator variables (one-hot encoding).df.astype()
: Converts data to different types (e.g., from strings to categories).
12. Basic Plotting
df.plot()
: Generates basic visualizations from the DataFrame.df.hist()
: Creates a histogram for numerical columns.df.boxplot()
: Creates a boxplot for visualizing the distribution of data.
13. Statistical Analysis
df.corr()
: Calculates the correlation between columns.df.cov()
: Calculates the covariance between columns.df.quantile()
: Computes the quantiles of the data.
14. Handling Dates and Times
pd.to_datetime()
: Converts a column to datetime format.df.resample()
: Resamples time series data at different frequencies.df.dt
: Accesses datetime properties for date manipulations.