Top 100 questions/ concepts on python for data analytics interviews (part:1)

Top 100 questions/ concepts on python for data analytics interviews (part:1)

1) Write a Python program that reads a large csv file in chunks and processes each chunk for data cleaning!

When working with a large dataset, memory may struggle to handle it all at once during loading. The solution is to read and process the data in smaller chunks.

 

 1) Removing rows with missing or invalid data 

 2) Stripping whitespaces from strings

 3) Correcting dataTypes (ex: converting strings to datatime)

 

 

In Python, the Pandas library provides a convenient way to handle large CSV files using the Chunsize argument in pandas.read_csv() . This reads the file in smaller pieces (called chunks), processes them, and either writes the cleaned data to a file or returns it.

By processing each chunk one at a time, you avoid loading the entire file into memory.

 

Here’s the Python code:

 

import pandas as pd

 

def clean_data(chunk):
    chunk.dropna(inplace=True)

 

    for col in chunk.select_dtypes(include=’object’).columns:
        chunk[col] = chunk[col].str.strip()
    return chunk
 
  • Purpose: This function takes a chunk of data (a subset of the CSV file) and performs the following cleaning tasks:
    • dropna(): Removes any rows with missing values in the chunk.
    • Whitespace Stripping: Iterates over all columns of type object (which usually represents strings) and removes leading and trailing whitespaces using the str.strip() method.
    • The cleaned chunk is then returned.
 
def process_csv_in_chunks(file_path, chunk_size=10000, output_file=None):
   
    cleaned_chunks = []  
    for chunk in pd.read_csv(file_path, chunksize=chunk_size):
        cleaned_chunk = clean_data(chunk)
       
        if output_file:
            cleaned_chunk.to_csv(output_file, mode=’a’, header=not bool(cleaned_chunks), index=False)
        else:
            cleaned_chunks.append(cleaned_chunk)  

 

    if not output_file:
        cleaned_data = pd.concat(cleaned_chunks, ignore_index=True)
        return cleaned_data
   
    print(f”Cleaned data has been saved to {output_file}”)

 

process_csv_in_chunks(‘C:/Users/apurb/Downloads/python1/data.csv’, chunk_size=5000, output_file=’cleaned_data.csv’)
 
 
Breakdown of the code:
  • file_path: The path to the input CSV file.
  • chunk_size: The number of rows to read at a time (default is set to 10,000).
  • output_file: Optional. If specified, the cleaned data will be saved to this file.
 

 

  • Functionality:
    • Initializes an empty list cleaned_chunks() to store cleaned data chunks.
    • Uses pd.read_csv() to read the CSV file in specified chunks.
    • Each chunk is cleaned using the clean_data function.
    • If output_file is provided, each cleaned chunk is appended to that file. The header argument is set to ensure that the header row is only written for the first chunk.
    • If output_file is not provided, cleaned chunks are stored in a list, and at the end, they are concatenated into a single DataFrame using pd.concat(), which is then returned.
  • Finally, a message is printed indicating where the cleaned data has been saved.

 

In the csv file, I have removed 5 data from a column.

 

 

In the output file that row has been removed.

 

 

2) Write a Python function to merge two datasets based on a common key and perform an inner join.

import pandas as pd

def merge_datasets(df1, df2, key):

merged_df = pd.merge(df1, df2, on=key, how=’inner’)

return merged_df

 

pd.merge(): This pandas function used to merge DataFrames. The on parameter specifies the key, and the how=’inner’ parameter ensures an inner join is performed.

The inner join returns only the rows where the key matches in both DataFrames. Rows with no match in either DataFrame are discarded.

here 9, 11,12, 14 ids are returned because we performed inner join.



3) How do you handle missing data in a dataset? what are the different strategies you would use to fill in the missing values?

Why It’s Necessary to Handle Missing Data:

 

Ensuring Accuracy and Consistency: Missing values can distort calculations, skew summary statistics, and lead to inaccurate insights. Proper handling ensures that the analysis reflects true trends and relationships in the data.

Preserving Data Integrity and Sample Size: Removing rows with missing values can reduce the dataset and introduce bias. Imputation helps retain valuable information and prevents biased results, maintaining the dataset’s representativeness.

Improving Interpretability and Visualizations: Missing data can complicate visualizations and make insights harder to understand. Addressing it ensures clear, consistent results that are easier for stakeholders to interpret and act upon.

Bias and Distortion: Missing data can lead to biased or incorrect analysis. For example, if many values are missing from a particular category, your conclusions may not accurately reflect the population.

Algorithm Performance: Many machine learning algorithms can’t handle missing data directly and will either fail or give incorrect results. Missing data must be addressed before modeling.

 

How to Handle Missing Data

There are several strategies to handle missing data, depending on the nature and extent of the missing values, as well as the importance of the data.

 

 

1. Removing (Dropping) Missing Data:

When the missing data is negligible or unimportant for the analysis.

 

import pandas as pd

 

# Drop rows with any missing values

df_cleaned = df.dropna()

 

# Drop columns with any missing values

df_cleaned = df.dropna(axis=1)

 

2. Imputation (Filling Missing Data):

 When dropping data would result in significant information loss.
We can replace missing values with a statistic such as the mean, median, mode, or a more sophisticated method.

a) Mean, Median, Mode Imputation:

 

1.Mean Imputation

When to use:

    • When the data is normally distributed (symmetric distribution).
    • For continuous numerical data (e.g., age, salary, temperature).
When not to use: Sensitive to outliers. If your data has extreme values, the mean might be skewed and not represent the true central value
 
df[‘column’].fillna(df[‘column’].mean(), inplace=True)
 
2. Median Imputation

When to use:

    • When the data is skewed (asymmetric distribution).
    • For continuous numerical data that contains outliers.
    • More robust than the mean because the median is less affected by outliers. Useful when the data is skewed or contains extreme values.
  • When not to use: May not capture the central tendency well for normally distributed data.
df[‘column’].fillna(df[‘column’].median(), inplace=True)
 
In a dataset of house prices, where most homes are priced moderately, but a few luxury homes are extremely expensive, median imputation would be more appropriate because it is not influenced by the few high-priced homes.

 

 

 

3. Mode Imputation
  • When to use:
    • For categorical data (e.g., gender, country, product categories).
    • For discrete data with few unique values (e.g., number of children).
    • The mode represents the most frequent value in the data, so it works well for categorical variables or when the most common value is more meaningful than the average.
  • When not to use: Can be less effective for numerical data or when there isn’t a clear most frequent value.
df[‘column’].fillna(df[‘column’].mode()[0], inplace=True)

 

 In a dataset of product categories where most purchases are from a specific category, mode imputation would fill in missing values with the most common category.
 
 

Summary of When to Use:

  • Mean: When data is normally distributed and there are no significant outliers.
  • Median: When data is skewed or contains outliers.
  • Mode: For categorical data or when there is a clear most frequent value.
 
 
a) Forward/ Backward fill:
  • You can propagate the previous or next value in the column to fill missing values. This is often used for time-series data.
# Forward fill (fill with the previous value)
df.fillna(method=’ffill’, inplace=True)
 
# Backward fill (fill with the next value)
df.fillna(method=’bfill’, inplace=True)
 
a) Interpolation:
  • For numerical or time-series data, interpolation can estimate the missing value based on the values before and after it.
# Interpolate missing values
df[‘column’].interpolate(method=’linear’, inplace=True)
 
 

3. Using Algorithms that Handle Missing Data

Some algorithms can inherently deal with missing data (e.g., XGBoost, K-Nearest Neighbors). These algorithms can automatically ignore missing values or impute them during the model-building process.

 

4) How would you filter out rows in a dataset where a specific column's value falls below a threshold?

We need to filter employees having salary more than 50000 from employees csv file.

 

import pandas as pd

 

# Load Excel file into DataFrame

df = pd.read_excel(’employees.xlsx’)

 

# Set the threshold

threshold = 50000

 

# Filter rows where ‘salary’ is greater than or equal to threshold

filtered_df = df[df[‘salary’] >= threshold]

 

# Save the filtered DataFrame to a new Excel file (optional)

filtered_df.to_excel(‘filtered_employees.xlsx’, index=False)

 

print(filtered_df)

 

5) How do you group data by multiple columns and calculate aggregate statistics (e.g., sum, mean)?

In Pandas, we can group data by multiple columns and calculate aggregate statistics like sum(), min(),count(), etc., using the groupby() function combined with aggregation functions.

our data:

                                         

To group: grouped = df.groupby([‘Department’, ‘Region’])

 Once the data is grouped, we can apply different aggregate functions to each group, like sum(), mean(), count() 
 
Sum():
sum_df = grouped.sum()
print(sum_df) 
Mean():
mean_df = grouped.mean()
print(mean_df)

                                           

We can also apply multiple aggregation functions at once using the agg() function. For example, you can calculate both the sum and the mean for each group:
 
agg_df = grouped.agg({
    ‘Employees’: [‘sum’, ‘mean’],
    ‘Salary’: [‘sum’, ‘mean’]
})
print(agg_df)

 

6) How do you apply custom functions to rows or columns in a DataFrame using apply() in Pandas?

In Pandas, the apply() function is a powerful tool used to apply custom functions to rows or columns in a dataFrame. It allows you to perform element-wise operations across an entire DataFrame or Series, giving flexibility in data transformation and aggregation.

DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwds)

  • axis=0 (default): The function will be applied to each column independently.
  • axis=1: The function will be applied to each row independently.
  • raw (Optional, default is False)
    • Controls whether the function receives a Series (default) or a NumPy array.
      • raw=False: The function receives a Series (more flexible but slower).
      • raw=True: The function receives a NumPy array (faster, but less flexible).
  • result_type (Optional, default is None)

    • Defines how the result should be shaped:
      • None: Pandas automatically infers the result type based on the function output.
      • 'expand': Expands the result into a DataFrame (useful when the function returns a list).
      • 'reduce': Forces a Series result.
      • 'broadcast': Broadcasts the result to the original shape of the DataFrame.

Use Case:

We have a DataFrame of salaries, and we want to increase each salary by 10%


import pandas as pd

# Create DataFrame

df = pd.DataFrame({

    ‘name’: [‘Alice’, ‘Bob’, ‘Charlie’],

    ‘salary’: [50000, 60000, 70000]

})

# Define custom function to increase salary by 10%

def increase_salary(salary):

    return salary * 1.10

# Apply the function to the ‘salary’ column

df[‘updated_salary’] = df[‘salary’].apply(increase_salary)

print(df)

 
 
 

Applying a Custom Function to Rows


we can apply custom functions to rows by specifying axis=1


# Define a custom function to categorize salary

def categorize_salary(row):

    if row[‘salary’] < 55000:

        return ‘Low’

    elif row[‘salary’] < 65000:

        return ‘Medium’

    else:

        return ‘High’

# Apply the function to rows (axis=1)

df[‘salary_category’] = df.apply(categorize_salary, axis=1)

print(df)

 
 
 

Applying Lambda Functions


Instead of defining a full custom function, we can use lambda functions for quick operations. Let’s apply a lambda function to calculate a percentage of the salary.


# Apply lambda function to the ‘salary’ column to calculate 5% of the salary

df[‘salary_5_percent’] = df[‘salary’].apply(lambda x: x * 0.05)

print(df)

 
 
 

Applying Functions to Multiple Columns


If our custom function needs to use multiple columns in its logic, we can apply it to rows and access multiple column values from each row.


Use Case:


Suppose we have a DataFrame with “price” and “quantity”, and we want to calculate the “total” for each row.


# Create DataFrame

df = pd.DataFrame({

    ‘price’: [100, 150, 200],

    ‘quantity’: [5, 3, 4]

})

# Define a custom function to calculate total

def calculate_total(row):

    return row[‘price’] * row[‘quantity’]

# Apply the function to rows (axis=1)

df[‘total’] = df.apply(calculate_total, axis=1)

print(df)

 

apply() is useful for data transformations and aggregations when in-built functions don’t provide sufficient flexibility.


1 Comment

Leave a Reply

Your email address will not be published. Required fields are marked *