1) How would you implemnt a data pipeline in python to ingest raw data, clean it, and export it in a report ready dataset?
Data pipelines are essential for processing raw data into meaningful insights. They automate data ingestion, cleaning, transformation, and export, ensuring that the final dataset is ready for reporting or analysis. In this article, we will walk through the implementation of a basic data pipeline in Python.
Overview of the Pipeline:
The pipeline involves four main stages:
-
Data Ingestion: Load raw data from various sources such as files, APIs, or databases.
-
Data Cleaning: Preprocess the data to remove inconsistencies, handle missing values, and standardize formats.
-
Data Transformation: Convert the cleaned data into a structured format suitable for reporting.
-
Data Export: Save the final dataset in the desired format (e.g., CSV, Excel, or database) Below is the Python implementation of the pipeline:
Set up paths for storing raw, cleaned, and report-ready datasets:
OS: Used for interacting with the operating system, such as managing files, directories, and environment variables.
Requests: Used for making HTTP requests to fetch data from APIs or download files from the web.
Pandas: Used for data manipulation, analysis, and handling structured data in tabular form.
Step 1: Data Ingestion
This function loads data from different sources (CSV files, APIs, or databases):
Step 2: Data Cleaning
This function removes duplicates, handles missing values, and standardizes formats:
Step 3: Data Transformation
This function aggregates and structures the data for reporting:
Step 4: Data Export
This function saves the final dataset to a file:
Main Pipeline Function
Explanation of Steps:
Data Ingestion:
Use ingest_data to fetch raw data from a CSV file, API, or database.
Example: Load a CSV file containing transaction data.
Data Cleaning:
Remove rows with missing or duplicate values.
Standardize column names and handle specific cleaning requirements (e.g., date conversion).
Data Transformation:
Perform aggregations like calculating totals, averages, and counts.
Example: Group transactions by category and compute metrics like total and average values.
Data Export:
Save the cleaned and transformed data to a CSV file using export_data.
Example: Save a report-ready dataset to report_ready_data.csv
2) Write a Python script to merge multiple csv files from a directory and perform data cleaning!
Why Merge and Clean CSV Files?
When working with data from multiple sources, you might encounter scenarios where:
Data is split across multiple CSV files.
Files contain duplicate rows or missing values.
Column names have inconsistencies like extra spaces.
Merging and cleaning these files ensures a unified and high-quality dataset for analysis.
Here I removed rows having null values in all columns, but there would be some null values in some columns, if we need to remove or fill those null values, we can use the below methods based on the type data. If you’re unaware how to fill missing or null values, I would suggest you to read 3rd question from this article “https://insightsbyapurba.com/top-100-questions-on-python-for-data-analysis-interviews/“
merged_df[‘column_name’].fillna(merged_df[‘column_name’].mean(), inplace=True)
merged_df[‘column_name’].fillna(merged_df[‘column_name’].median(), inplace=True)
merged_df[‘column_name’].fillna(merged_df[‘column_name’].mode()[0], inplace=True)
3) GIven a list of dictionaries, write a Python program to group the data by a specific key and calculate summary statistics for the grouped data
Here is a List of dictiuonaries:
OUtput:
4) Create a Python function to identify fraud patterns in a dataset of credit card transactions using stastical methods!
One effective approach to fraud detection is using Z-scores to identify outliers and analyzing user activity patterns to spot frequent or unusual transactions. Here’s how this solution works:
Outlier Detection Using Z-Scores:
- A Z-score measures how far a particular transaction amount deviates from the mean of all transactions in terms of standard deviations.
- Transactions with a Z-score beyond a certain threshold (commonly set at 3) are considered outliers and flagged as potentially fraudulent. This method helps identify transactions with unusually high amounts compared to the user’s normal spending patterns.
Identifying Frequent Users:
- Fraudulent activity often involves an abnormally high frequency of transactions. By analyzing the number of unique transaction days per user, we can identify users who make an unusually high number of transactions in a short period.
- If a user’s transaction frequency exceeds a certain threshold (e.g., more than 2 standard deviations above the mean), they are flagged as a potential fraud risk.
Combining Outlier and Frequency Analysis:
- The final step combines the results of both the Z-score analysis and the user activity analysis. Transactions that are either outliers in terms of amount or belong to users with unusual transaction frequencies are flagged as fraud candidates.
Here’s the output: