Machine Learning 101, models decoded!

Machine Learning Models decoded!

Machine learning (ML) is the backbone of modern data-driven decision-making, transforming raw data into actionable insights. But what exactly is machine learning? How does it work? What are ML models, and why should we build them? In this blog, we will dive deep into these questions, explore the process of building machine learning models, and examine a real-life example from the banking sector. We’ll also touch upon a project you can build to gain fundamental knowledge , providing a step-by-step explanation.

What is Machine Learning?

Machine Learning is a subset of artificial intelligence (AI) that enables computers to learn from data and make predictions or decisions without being explicitly programmed. It’s the process of creating algorithms that can improve their performance over time as they process more data. By using past data, machine learning models can identify patterns and predict future outcomes.

What’s a Model?

A model in machine learning is essentially a mathematical representation of the relationship between input data (features) and output data (target or predictions). The goal of any ML model is to learn from data, so it can generalize to unseen data and make accurate predictions. Models can be used to classify objects, make recommendations, predict future values, and much more.

Why Build These Models?

Building machine learning models helps businesses and industries make informed, data-backed decisions. Models can automate routine tasks, forecast future events, and even uncover insights that humans might miss. From recommending what you should buy on e-commerce websites to predicting stock market trends, machine learning models are transforming industries.

Who Makes These Models?

These models are typically built by data scientists, machine learning engineers, and data engineers. Data scientists focus on statistical analysis, data cleaning, and model building, while ML engineers deploy and scale these models into production. The roles often overlap depending on the size and maturity of the company.

Are Machine Learning Models Apps?

Machine learning models themselves aren’t exactly apps, but they can be embedded into applications to make them useful. In technical terms, a machine learning model is a mathematical algorithm that makes predictions or decisions based on input data. When these models are deployed into an environment where users or other systems can interact with them (via APIs, web apps, etc.), they become part of a larger app or service.

For example, we will see a project we can build using kaggle dataset of Boston House Price Data, the machine learning model (a regression model) predicts house prices based on various features (like crime rate, number of rooms, etc.). While the model is not an app itself, We can wrap it in a Flask web app to make it accessible via a browser or API. In that sense, the model becomes part of an application

How Do Machine Learning Models Work?

Machine learning models work by learning patterns in data during a training phase. Here’s a breakdown of how they function:

Training: The model learns from historical data (for Banks customers’ income, loan history, previous loan defaults ). It identifies relationships between features (like income vs loan amount) and the target variable (whether a customer is eligible for loan or not).
Input/Prediction: Once trained, the model can accept new input data and predict an output
Deployment: To make the model available to end users or systems, it is deployed in an environment (like a web app or an API), where it can be accessed for predictions.

The Importance of Statistics in Machine Learning:

Statistics play a crucial role in machine learning, helping to understand data distributions, relationships, and model performance. For example, a fundamental concept in model building is correlation, which measures the relationship between variables.

Example: Linear Regression

In Linear Regression, statistics are used to understand the relationship between independent variables (e.g., square footage, number of rooms) and the dependent variable (e.g., house price). By using least squares estimation, we minimize the error and create a best-fit line to predict future house prices.

Where Are Models Stored?

Machine learning models can be stored in various places depending on the use case, scale, and infrastructure:

Local Storage: For small-scale or testing purposes, models can be stored on local machines in files such as .pkl or .joblib files. These are serialized versions of the models that can be loaded into memory when needed.
Cloud Storage: For scalable and production-ready systems, models are typically stored in the cloud using services like AWS S3, Google Cloud Storage, or Azure Blob Storage. The model is retrieved from the cloud when needed by an application or service.
Model Repositories: Larger organizations or MLOps environments use model registries like MLflow or Amazon Sagemaker to store and manage different versions of models.
Databases: Sometimes, models or their metadata are stored in databases like PostgreSQL or MongoDB. However, this is less common for the model itself and more for storing details about the model, such as metrics, versions, and performance data.

How Are Models Accessed?

Once a model is deployed, it can be accessed in several ways:

API: The model is often deployed as part of a web service or an API. You send a request with input data, and the model processes this data and returns a prediction as the response. In our Boston House Price Prediction project, the model is deployed on Heroku and accessed through an API.
- Example: An API might accept features like the number of rooms and crime rate as input in a JSON format. It then returns the predicted house price as the output.
Web Application: Models can be integrated into web applications. In the case of our project, we can build a Flask web app that allows users to input data (e.g., house features) through a form on a website. When they hit “submit,” the app sends the data to the model, and the predicted house price is displayed on the web page.
Batch Processing: Models can also be accessed for bulk or batch predictions. For example, banks might use a model to assess the risk level of thousands of customers at once, processing the data in a large batch.
Real-Time Prediction: In cases like fraud detection, models might be deployed in environments where they can make real-time predictions, immediately analyzing data and flagging suspicious transactions as they happen.

We will look at real life use case scenarios! Let’s find out how Banks use Data science and Machine learninng.

Data Science in the Banking Sector: A Deep Dive

Data and Use Cases in Banking

In the banking sector, machine learning is used for various applications such as fraud detection, loan approval, credit scoring, and customer segmentation. Let’s break down how this works in a detailed manner:

1. Types of Data Used in Banking

Banks handle vast amounts of data, which can be categorized into several types:

Customer Data: This includes personal information such as name, address, income, employment details, and credit history.
Transaction Data: Detailed records of customer transactions, such as withdrawals, deposits, transfers, and purchases.
Loan Data: Information on loans applied for and granted, including the amount, interest rates, tenure, and repayment behavior.
Behavioral Data: Data collected from user interactions, such as how often customers log in to mobile banking apps, their spending habits, and saving trends.
Third-Party Data: Data from external sources like credit bureaus, social media, or fintech partners to improve risk assessment or customer profiling.

Where Is This Data Stored?

Data Warehouses and Lakes: For massive storage and structured data, banks rely on data warehouses like Amazon Redshift or Google BigQuery. For unstructured and semi-structured data, data lakes such as Azure Data Lake or Amazon S3 are used.
MDM (Master Data Management): Banks also use MDM tools (like Informatica or SAP MDM) to ensure data consistency across departments.
Cloud Platforms: Banks often use cloud platforms like AWS, Microsoft Azure, or Google Cloud for scalability and security. In cloud environments, data is stored securely and access can be managed based on roles, ensuring compliance with data governance policies.

Teams Involved in ML Model Development in Banks:

Machine learning models in banking are developed through collaboration between various teams:

Data Engineers: Handle data ingestion, cleaning, and storage, ensuring that data pipelines are robust and scalable.
Data Scientists: Build models using statistical and machine learning techniques. They handle feature engineering, model selection, and evaluation.
ML Engineers: Focus on deploying models into production environments. They work on optimizing the models for performance and scalability.
DevOps Engineers: Responsible for managing the infrastructure, ensuring seamless deployment, and automating various parts of the model lifecycle (MLOps).

Common Machine Learning Models used in Banking:

1. Fraud Detection Using Classification Models

Fraud detection is one of the most critical applications in banking. It uses classification models like Random Forests or XGBoost to detect fraudulent transactions by analyzing patterns and anomalies in transaction data. The model flags transactions as either “fraudulent” or “non-fraudulent.”

Example: A customer makes multiple large transactions within a short period, which deviates from their normal behavior. A classification model can flag this transaction for further review.

2. Credit Scoring Using Logistic Regression

Credit scoring models evaluate the likelihood of a customer defaulting on a loan. Logistic regression is commonly used for binary classification, where the outcome is either default or no default.

Example: Predict whether a customer with a certain income level, employment status, and credit history will default on a personal loan.

3. Customer Segmentation Using Clustering

Banks often segment customers based on their transaction data, demographic data, and financial behavior. K-Means Clustering is widely used for grouping customers into segments for personalized marketing campaigns.

Example: Group customers into clusters such as high-net-worth individuals, middle-income earners, and young professionals to tailor financial products for each group.

Tools and Technologies for ML in Banking:

Analytics Platforms: Banks use SAS, Tableau, and Power BI for visualization and dashboards.
Data Processing: Python libraries like Pandas and NumPy are used for data manipulation, while Apache Spark is used for processing large datasets.
Cloud Services: Banks are increasingly moving toward cloud services like AWS Sagemaker, Google AI Platform, and Azure Machine Learning for developing, training, and deploying ML models.
MLOps Tools: MLflow and Kubeflow are popular tools for managing machine learning models in production, tracking experiments, and automating deployments.

An End-to-End Machine Learning Project:

In this section, we will walk through an entire machine learning project from start to finish. The project we will work on is predicting house prices in Boston using a well-known dataset. The process will cover everything from data collection, data cleaning, exploratory data analysis (EDA), model building, model evaluation, and finally, deploying the model as a web application. Let’s dive into it step-by-step!

1. Data Collection

The first step in any machine learning project is acquiring the data. For this project, we will use the Boston House Price Dataset, which contains information on housing prices in different neighborhoods of Boston. The data can be downloaded from platforms like Kaggle.

Tools: We’ll use Pandas for handling the data once we have it, allowing us to load and manipulate the dataset.

2. Data Cleaning

After collecting the data, it’s time to clean it. This involves removing or filling any missing values, handling outliers, and ensuring the dataset is in the right format.

We can identify missing values using Pandas and either fill them with a suitable value (like the mean) or drop them altogether if they’re not significant.

We may also want to drop unnecessary columns that won’t contribute to the model’s performance.

3. Exploratory Data Analysis (EDA)

EDA is crucial to understanding the relationships between different features in the dataset. We’ll use visualization libraries like Matplotlib and Seaborn to create visualizations such as histograms, box plots, and correlation matrices to explore patterns.

Correlation: We’ll calculate the correlation between different features to see which ones have the most influence on the house prices.
Categorical Data: If there are categorical features, such as neighborhood types, we may need to convert them into numerical values using techniques like one-hot encoding.

4. Model Building

Now that the data is clean and well-understood, we can move on to building the model. For this project, we will use a regression model since we are predicting continuous values (house prices).

We will use Scikit-Learn (sklearn) to split the dataset into training and testing sets, build a regression model, and evaluate its performance.

5. Checking for Overfitting and Underfitting

We want to ensure that the model generalizes well and doesn’t overfit the training data. We’ll evaluate the model on both the training and test sets using metrics like Mean Squared Error (MSE) and R-squared.

6. Model Optimization

If the model performs poorly, we may need to tune hyperparameters or use more advanced techniques like cross-validation to avoid overfitting or underfitting. Additionally, we might experiment with different types of regression models, such as Ridge or Lasso Regression, which include regularization terms.

7. Creating a Web Application

Now that the model is ready, the next step is to make it accessible. We will create a simple Flask web app that allows users to input data (house features), and the app will return the predicted price.

We’ll create an HTML form for user input and use Flask to process these inputs, pass them to the machine learning model, and display the prediction.

8. Deployment Options

Once the web app is built, the next step is deploying it so others can access it.

Deploying on Heroku

Heroku is a platform where you can deploy your apps. However, as of now, Heroku has shifted to a paid model. If you have Azure credits, you can deploy on Azure for free.

Azure Deployment: With Azure, you can use Azure App Services to deploy the model. It’s free for small projects, especially if you have free credits from an Azure subscription.
Local Deployment: If you want to keep it simple, you can also run the app on your local machine. Users can access it via your local network or tools like ngrok.

That’s it for now! We will dive deeper into data science, statistics, machine learning, and much more in future articles. I hope you found this blog valuable and learned something new about the world of machine learning. If you’re interested in more, feel free to check out my other articles on topics like Data Science, SQL, interview prep, and more. Stay curious, and keep exploring!