Handling differet type of files in Python:
In Data analytics we deal with various types of data. We should know what are the different kinds of file formats are there, how to handle those and how to analyse the data. Most popular file formats are CSV, XLSX, JSON, XML. IN the next questions we will see how these different file formats differ from each other and how can we handle using python.
1) Write a Python function that reads a JSON file and extracts specific key-value pairs from it.
JSON (JavaScript Object Notation) is a lightweight data format used primarily to store and exchange structured data between web services, APIs, and applications. It uses a simple syntax of key-value pairs, arrays, and nested objects, making it easy to represent complex hierarchical data structures.
Advantages of JSON over CSV:
- Structured and Nested Data: Unlike CSV, which represents data in a flat tabular structure, JSON can handle nested or hierarchical data (e.g., objects within objects, lists, etc.). This makes JSON more flexible for representing complex real-world entities.
- Human-Readable and Machine-Processable: JSON’s structure is easy for humans to read and understand, while also being efficiently processed by machines.
- Used in APIs: JSON is commonly used for transmitting data between a client and a server (especially in APIs), making it a standard in web development and modern data exchange.
- Supports Multiple Data Types: JSON supports various data types, including strings, numbers, booleans, arrays, and objects, which makes it more versatile than CSV, which only handles text.
However, JSON isn’t ideal for all use cases. If your data is strictly tabular with no need for hierarchical representation, CSV may be simpler and more efficient.
Python code:
import json
def extract_key_value_pairs(json_file, keys):
“””
Reads a JSON file and extracts specified key-value pairs from it.
Arguments:
json_file (str): Path to the JSON file.
keys (list): List of keys to extract from each JSON object.
Returns:
list: A list of dictionaries with the extracted key-value pairs.
“””
extracted_data = [ ]
# Open and load the JSON file
with open(json_file, ‘r’) as file:
data = json.load(file) # Loads the JSON into a Python list of dictionaries
# Loop through the JSON data and extract the desired key-value pairs
for item in data:
extracted_item = {key: item.get(key, None) for key in keys}
extracted_data.append(extracted_item)
return extracted_data
keys_to_extract = [‘name’, ‘age’] # Specify the keys you want to extract
extracted_data = extract_key_value_pairs(‘data.json’, keys_to_extract)
print(extracted_data)
Output:
How to Read JSON into a Pandas DataFrame
You can use the pd.read_json()
method to load the JSON data directly into a DataFrame.
If the JSON data contains nested fields, we can use the pd.json_normalize() method to flatten the data.
import pandas as pd
import json
# Load JSON data
with open(‘data.json’, ‘r’) as file:
data = json.load(file)
# Flatten nested data
df = pd.json_normalize(data, sep=’_’)
print(df)
Advantages of Using Pandas with JSON
- Flexibility: Easy to manipulate, slice, and analyze the data.
- EDA Capabilities: Perform operations like filtering, grouping, and aggregating with simple methods.
- Integration: You can use Pandas DataFrames directly with other libraries like Matplotlib, Seaborn, or Scikit-learn for further analysis and visualization.
2) How would you process an XML file in Python?
What is XML?
XML (Extensible Markup Language) is a markup language used to store and transport data. It is designed to be both human-readable and machine-readable, allowing structured data to be shared across different systems, applications, and platforms. Unlike HTML, which has predefined tags, XML allows users to define their own tags to describe the data. It’s widely used for data exchange over the internet, especially for web services.
<bookstore>
<book>
<title>Python Programming</title>
<author>John Doe</author>
<price>29.99</price>
</book>
<book>
<title>Data Science with Python</title>
<author>Jane Doe</author>
<price>39.99</price>
</book>
</bookstore>
- The root element is <bookstore>, and each <book> contains child elements such as <title>, <author>, and <price>.
- XML files can be expanded with custom tags like <book> and <price> , making the language very flexible.
Processing XML files in Python can be done using various libraries, each of which offers different functionalities depending on your needs. The most common libraries are:
- ElementTree (built-in library)
- lxml (external library)
- BeautifulSoup (for easier navigation)
Extract and Modify XML Data
- Get Elements: You can use
root.findall()
to extract specific nodes. - Access Attributes: Use
.get()
to extract attributes of an element. - Access Text: Use
.text
to get the text content inside a tag.
Traversing the XML Tree:
Once you have the root
, you can navigate the XML structure:
root.findall()
: Finds all elements matching a tag.root.find()
: Finds the first occurrence of an element matching a tag.element.get()
: Retrieves an attribute value from an element.element.text
: Retrieves the text within an element.
3) What are Python Decorators?
A Python decorator is a design pattern that allows you to modify or extend the behavior of functions or methods without changing their actual code. A decorator is a higher-order function that takes another function as an argument and returns a new function that usually adds some functionality before or after the original function executes.
In simpler terms, decorators allow you to “wrap” a function and add extra behavior to it dynamically, without modifying the function’s actual definition.
Example: Timing Decorator for Data Processing
Imagine you have a function that performs some heavy data processing, and you want to log how long it takes to run.
import time
def time_it(func):
def wrapper(*args, **kwargs):
start_time = time.time()
result = func(*args, **kwargs)
end_time = time.time()
print(f”{func.__name__} took {end_time – start_time:.4f} seconds to execute.”)
return result
return wrapper
import numpy as np
@time_it
def process_data():
# Simulate a large data processing task
data = np.random.rand(1000000)
return np.mean(data)
# Call the decorated function
mean_value = process_data()
4) What are *args, **kwargs in Python?
In the previous example we have used *args and **kwargs, now let’s see what are these!
1. *args: Variable Number of Positional Arguments
*args allows you to pass a variable number of non-keyword arguments (i.e., positional arguments) to a function. Inside the function, args is a tuple that contains all the passed positional arguments.
2. **kwargs: Variable Number of Keyword Arguments
**kwarga allows you to pass a variable number of keyword arguments to a function. Inside the function, kwargs is a dictionary that holds all the keyword arguments and their values.
def example_function(**kwargs):
for key, value in kwargs.items():
print(f”{key}: {value}”)
# Calling the function with keyword arguments
example_function(name=’John’, age=25, city=’New York’)
Why Use *args and **kwargs?
- Flexibility: They allow you to write functions that can handle varying numbers of arguments.
- Reusable code: They help create generic functions that can work with any number of inputs.
- Cleaner function signatures: Instead of defining a large number of parameters, *args and **kwargs simplify the function’s interface.
5) How would you pivot and unpivot data in a dataframe?
In Pandas, pivoting and unpivoting data is commonly used for reshaping a DataFrame. Here’s a breakdown of both operations:
Synatx: DataFrame.pivot(index, columns, values)
id_vars
: Column(s) to keep fixed (won’t be unpivoted).value_vars
: Columns to unpivot.var_name
: Name for the new ‘variable’ column created from column headers.value_name
: Name for the ‘value’ column created from the data in the pivoted columns.
let’s experiment on the below dataset:
To pivot this data so that each city becomes a column and temperatures appear in respective rows:
pivot_df = df.pivot(index=’Date’, columns=’City’, values=’Temperature’)
print(pivot_df):
To unpivot:
6) Explain the difference between map(), apply(), and applymap() function in pandas
Map() is used on a series in pandas. Applies a function to each element in the series. Useful for element-wise operations, such as mapping values based on a dictionary, or applying a transformation to each element of a Series.
example:
import pandas as pd
# Sample Series
s = pd.Series([1, 2, 3, 4])
# Using map() to square each element
squared = s.map(lambda x: x ** 2)
print(squared)
example:
example: