Top Python questions/ Concepts for Data Analytics (part-2)

Top Python questions/ Concepts for Data Analytics (part-2)

Handling differet type of files in Python:

In Data analytics we deal with various types of data. We should know what are the different kinds of file formats are there, how to handle those and how to analyse the data. Most popular file formats are CSV, XLSX, JSON, XML. IN the next questions we will see how these different file formats differ from each other and how can we handle using python. 

1) Write a Python function that reads a JSON file and extracts specific key-value pairs from it.

JSON (JavaScript Object Notation) is a lightweight data format used primarily to store and exchange structured data between web services, APIs, and applications. It uses a simple syntax of key-value pairs, arrays, and nested objects, making it easy to represent complex hierarchical data structures.

Advantages of JSON over CSV:

  1. Structured and Nested Data: Unlike CSV, which represents data in a flat tabular structure, JSON can handle nested or hierarchical data (e.g., objects within objects, lists, etc.). This makes JSON more flexible for representing complex real-world entities.
  2. Human-Readable and Machine-Processable: JSON’s structure is easy for humans to read and understand, while also being efficiently processed by machines.
  3. Used in APIs: JSON is commonly used for transmitting data between a client and a server (especially in APIs), making it a standard in web development and modern data exchange.
  4. Supports Multiple Data Types: JSON supports various data types, including strings, numbers, booleans, arrays, and objects, which makes it more versatile than CSV, which only handles text.

However, JSON isn’t ideal for all use cases. If your data is strictly tabular with no need for hierarchical representation, CSV may be simpler and more efficient.

 

 

Python code:

import json

 

def extract_key_value_pairs(json_file, keys):

    “””

    Reads a JSON file and extracts specified key-value pairs from it.

    

    Arguments:

        json_file (str): Path to the JSON file.

        keys (list): List of keys to extract from each JSON object.

    

    Returns:

        list: A list of dictionaries with the extracted key-value pairs.

    “””

    extracted_data = [ ]

    

    # Open and load the JSON file

    with open(json_file, ‘r’) as file:

        data = json.load(file)  # Loads the JSON into a Python list of dictionaries

    

    # Loop through the JSON data and extract the desired key-value pairs

    for item in data:

        extracted_item = {key: item.get(key, None) for key in keys}

        extracted_data.append(extracted_item)

    

    return extracted_data

keys_to_extract = [‘name’, ‘age’]  # Specify the keys you want to extract

extracted_data = extract_key_value_pairs(‘data.json’, keys_to_extract)

print(extracted_data)

 

Output:

 
 
 

How to Read JSON into a Pandas DataFrame

You can use the pd.read_json() method to load the JSON data directly into a DataFrame.

 

If the JSON data contains nested fields, we can use the pd.json_normalize() method to flatten the data.

 

import pandas as pd

import json

 

# Load JSON data

with open(‘data.json’, ‘r’) as file:

    data = json.load(file)

 

# Flatten nested data

df = pd.json_normalize(data, sep=’_’)

 

print(df)

 
Output:
 

Advantages of Using Pandas with JSON

  • Flexibility: Easy to manipulate, slice, and analyze the data.
  • EDA Capabilities: Perform operations like filtering, grouping, and aggregating with simple methods.
  • Integration: You can use Pandas DataFrames directly with other libraries like Matplotlib, Seaborn, or Scikit-learn for further analysis and visualization.

2) How would you process an XML file in Python?

What is XML?

XML (Extensible Markup Language) is a markup language used to store and transport data. It is designed to be both human-readable and machine-readable, allowing structured data to be shared across different systems, applications, and platforms. Unlike HTML, which has predefined tags, XML allows users to define their own tags to describe the data. It’s widely used for data exchange over the internet, especially for web services.

<bookstore>

    <book>

        <title>Python Programming</title>

        <author>John Doe</author>

        <price>29.99</price>

    </book>

    <book>

        <title>Data Science with Python</title>

        <author>Jane Doe</author>

        <price>39.99</price>

    </book>

</bookstore>

 
  • The root element is <bookstore>, and each <book> contains child elements such as <title>, <author>, and <price>.
  • XML files can be expanded with custom tags like <book>  and <price> , making the language very flexible.    
 

Processing XML files in Python can be done using various libraries, each of which offers different functionalities depending on your needs. The most common libraries are:

  1. ElementTree (built-in library)
  2. lxml (external library)
  3. BeautifulSoup (for easier navigation)
 let’s consider the below xml: 
 
 <employees>
    <employee id=”1″>
        <name>John Doe</name>
        <age>28</age>
        <department>Finance</department>
    </employee>
    <employee id=”2″>
        <name>Jane Smith</name>
        <age>32</age>
        <department>Engineering</department>
    </employee>
</employees>
 
Code to Process the XML:
import xml.etree.ElementTree as ET
 
# Parse the XML file
tree = ET.parse(‘data.xml’)
 
# Get the root element of the XML
root = tree.getroot()
 
# Iterate over each employee element
for employee in root.findall(’employee’):
    emp_id = employee.get(‘id’)  # Get employee’s id attribute
    name = employee.find(‘name’).text  # Get employee’s name
    age = employee.find(‘age’).text    # Get employee’s age
    department = employee.find(‘department’).text  # Get employee’s department
    
    print(f”ID: {emp_id}, Name: {name}, Age: {age}, Department: {department}”)
 
 
Output will be:
ID: 1, Name: John Doe, Age: 28, Department: Finance
ID: 2, Name: Jane Smith, Age: 32, Department: Engineering
 
 

Extract and Modify XML Data

  • Get Elements: You can use root.findall() to extract specific nodes.
  • Access Attributes: Use .get() to extract attributes of an element.
  • Access Text: Use .text to get the text content inside a tag.
With BeautifulSoup:
from bs4 import BeautifulSoup
 
with open(‘data.xml’, ‘r’) as file:
    content = file.read()
 
# Parse the XML
soup = BeautifulSoup(content, ‘xml’)
 
# Find all employee names
names = soup.find_all(‘name’)
 
for name in names:
    print(name.text)

 

Traversing the XML Tree:

Once you have the root, you can navigate the XML structure:

  • root.findall(): Finds all elements matching a tag.
  • root.find(): Finds the first occurrence of an element matching a tag.
  • element.get(): Retrieves an attribute value from an element.
  • element.text: Retrieves the text within an element.

3) What are Python Decorators?

A Python decorator is a design pattern that allows you to modify or extend the behavior of functions or methods without changing their actual code. A decorator is a higher-order function that takes another function as an argument and returns a new function that usually adds some functionality before or after the original function executes.

In simpler terms, decorators allow you to “wrap” a function and add extra behavior to it dynamically, without modifying the function’s actual definition.

 

Example: Timing Decorator for Data Processing

Imagine you have a function that performs some heavy data processing, and you want to log how long it takes to run.

 

import time

 

def time_it(func):

    def wrapper(*args, **kwargs):

        start_time = time.time()

        result = func(*args, **kwargs)

        end_time = time.time()

        print(f”{func.__name__} took {end_time – start_time:.4f} seconds to execute.”)

        return result

    return wrapper

 
Now, use this decorator on a function that processes some data, for example, calculating the mean of a large dataset:

 

import numpy as np

 

@time_it

def process_data():

    # Simulate a large data processing task

    data = np.random.rand(1000000)

    return np.mean(data)

 

# Call the decorated function

mean_value = process_data()

 
 

4) What are *args, **kwargs in Python?

In the previous example we have used *args and **kwargs, now let’s see what are these!

1. *args: Variable Number of Positional Arguments

 *args allows you to pass a variable number of non-keyword arguments (i.e., positional arguments) to a function. Inside the function, args is a tuple that contains all the passed positional arguments.

 
def example_function(*args):
    for arg in args:
        print(arg)
 
# Calling the function with different numbers of arguments
example_function(1, 2, 3)
example_function(‘a’, ‘b’, ‘c’, ‘d’)
 
In this example we passed 3 and 4 arguments and the function processed both and print 1,2,3,’a’,’b’,’c’,’d’
 
 
2. **kwargs: Variable Number of Keyword Arguments

 **kwarga allows you to pass a variable number of keyword arguments to a function. Inside the function, kwargs is a dictionary that holds all the keyword arguments and their values.

def example_function(**kwargs):

    for key, value in kwargs.items():

        print(f”{key}: {value}”)

# Calling the function with keyword arguments

example_function(name=’John’, age=25, city=’New York’)

 
 
 

Why Use *args and **kwargs?

  1. Flexibility: They allow you to write functions that can handle varying numbers of arguments.
  2. Reusable code: They help create generic functions that can work with any number of inputs.
  3. Cleaner function signatures: Instead of defining a large number of parameters, *args and **kwargs simplify the function’s interface.

5) How would you pivot and unpivot data in a dataframe?

In Pandas, pivoting and unpivoting data is commonly used for reshaping a DataFrame. Here’s a breakdown of both operations:

 

Synatx: DataFrame.pivot(index, columns, values)

 Unpivoting (also called melting) is the reverse of pivoting. It transforms columns into rows. This is often useful to normalize or transform a pivoted DataFrame back to a long format.
syntax: pd.melt(DataFrame, id_vars, value_vars, var_name, value_name)
 
 
  • id_vars: Column(s) to keep fixed (won’t be unpivoted).
  • value_vars: Columns to unpivot.
  • var_name: Name for the new ‘variable’ column created from column headers.
  • value_name: Name for the ‘value’ column created from the data in the pivoted columns.   

let’s experiment on the below dataset:

                                                         

 

To pivot this data so that each city becomes a column and temperatures appear in respective rows:

pivot_df = df.pivot(index=’Date’, columns=’City’, values=’Temperature’)

print(pivot_df):

 

 
 
To unpivot:
unpivot_df = pd.melt(pivot_df.reset_index(), id_vars=’Date’, value_vars=[‘Los Angeles’, ‘New York’],
                     var_name=’City’, value_name=’Temperature’) 
print(unpivot_df)

 

6) Explain the difference between map(), apply(), and applymap() function in pandas

Map() is used on a series in pandas. Applies a function to each element in the series. Useful for element-wise operations, such as mapping values based on a dictionary, or applying a transformation to each element of a Series.

example:

import pandas as pd

 

# Sample Series

s = pd.Series([1, 2, 3, 4])

 

# Using map() to square each element

squared = s.map(lambda x: x ** 2)

print(squared)

 
 
 Apply() is used on both series and dataframes. Allows you to apply a function along either axis (columns or rows) of a DataFrame, or element-wise in a Series.
example: 
s = pd.Series([1, 2, 3, 4])
 
# Using apply() to double each element
doubled = s.apply(lambda x: x * 2)
print(doubled)
 
Applymap() is used only on dataframes. Applies a function element-wise across the entire DataFrame.

example: 
df = pd.DataFrame({
    ‘A’: [1, 2, 3],
    ‘B’: [4, 5, 6]
})
 
# Using applymap() to square each element in the DataFrame
squared_df = df.applymap(lambda x: x ** 2)
print(squared_df)
 

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *