Internship ,machine learning basics

Note

Note: In this explanation, I am using the Ride Dynamic Pricing dataset from Kaggle as an example to illustrate each step of the machine learning process. You can find the dataset on Kaggle here.
https://www.kaggle.com/datasets/arashnic/dynamic-pricing-dataset

The dataset includes columns with information such as Number_of_Riders, Number_of_Drivers, Location_Category, Customer_Loyalty_Status, Number_of_Past_Rides, Average_Ratings, Time_of_Booking, Vehicle_Type, Expected_Ride_Duration, and Historical_Cost_of_Ride.

Here are the first five entries of the dataset:

Number_of_Riders	Number_of_Drivers	Location_Category	Customer_Loyalty_Status	Number_of_Past_Rides	Average_Ratings	Time_of_Booking	Vehicle_Type	Expected_Ride_Duration	Historical_Cost_of_Ride
90	45	Urban	Silver	13	4.47	Night	Premium	90	284.26
58	39	Suburban	Silver	72	4.06	Evening	Economy	43	173.87
42	31	Rural	Silver	0	3.99	Afternoon	Premium	76	329.80
89	28	Rural	Regular	67	4.31	Afternoon	Premium	134	470.20
78	22	Rural	Regular	74	3.77	Afternoon	Economy	149	579.68

This dataset provides detailed information about various features associated with ride pricing and is used here to demonstrate a machine learning workflow focused on dynamic pricing predictions

Overview

Step 1: Collect and Prepare the Data

Collect Data: First, we gather all the necessary data to train our model. This data can come from various sources, such as databases, APIs, surveys, or web scraping.
- Example: In this case, we already have the data on past rides from a cab service, which includes columns like Number_of_Riders, Number_of_Drivers, Location_Category, Customer_Loyalty_Status, etc.
Clean Data: Real-world data often contains missing values, errors, or inconsistencies. Cleaning the data involves filling in missing values, removing duplicates, correcting inaccuracies, and ensuring the data is consistent and usable.
- Example: Here, we need to handle any missing or incorrect values. For example, if some Average_Ratings entries are missing, we might fill those with an average rating. Or, if there are errors in Expected_Ride_Duration, we fix those to ensure accuracy.

Step 2: Exploratory Data Analysis (EDA)

What is EDA?: Exploratory Data Analysis (EDA) is a step to understand the data by identifying patterns, trends, and relationships. It helps us understand the features, distributions, and correlations, and spot any issues that might affect model performance.
Why do EDA?: EDA allows us to get familiar with the data and spot any patterns or trends that could improve the model’s predictions. For example, we might discover that certain features are more predictive than others or that some relationships are nonlinear, influencing our choice of model.
- Example: In the cab service dataset, EDA might reveal that the Time_of_Booking impacts the Historical_Cost_of_Ride—for example, peak booking times might be associated with higher costs. We might also find that customers with Customer_Loyalty_Status have more frequent rides, indicated by a higher Number_of_Past_Rides. These patterns help us understand which features might be most important for predicting updated prices.

Step 3: Define the Target for Prediction

Setting Up the Prediction Target: In some cases, we might need to create a new target variable. For example, if we want to predict a price increase, we define a new target that represents this increase. This new target isn’t necessarily a fixed percentage increase for every data point but rather reflects a general guideline for the model.
Why Define a New Target?: Creating a new target variable lets us train the model to meet specific business objectives (like a price increase) in a way that reflects typical trends without adhering to strict rules for each data point.
- Example: Here, we set up a new target variable, Updated_Price, representing a general price increase goal of at least 35%. This target isn’t an exact 35% increase on Historical_Cost_of_Ride for each ride but rather a guideline. This way, the model learns to predict prices that, on average, reflect this increase. For instance, if most rides in a given Location_Category require a higher fare, the model learns to account for that in the predicted Updated_Price.

Step 4: Split the Data (Train-Test Split)

What is a Train-Test Split?: We split our data into two sets: a training set to train the model and a testing set to evaluate it. This helps us check if the model performs well on new data, rather than just memorizing patterns in the training set.
Why Split the Data?: By testing the model on unseen data, we can assess its ability to generalize to new cases. This makes sure it isn’t just learning patterns unique to the training data but is flexible enough to make accurate predictions in different scenarios.
- Example: For the cab service data, we split it so that 80% of the ride records are used to train the model on predicting Updated_Price, while 20% are set aside to test the model. This lets us see if the model can consistently suggest reasonable price increases for rides that weren’t used in training.

Step 5: Choose and Train Multiple Models

What are Models?: Different algorithms, or models, allow us to predict the target variable. We can try multiple models, like linear regression, decision trees, and random forests, each of which captures relationships in different ways.
Why Use Multiple Models?: Some models work better for simple, linear relationships, while others are better for complex, nonlinear patterns. By trying multiple models, we increase our chances of finding one that fits the data well and gives reliable predictions.
- Example: With the cab service data, linear regression might be useful if the price increases are mostly proportional to features like Expected_Ride_Duration. If Location_Category and Vehicle_Type are critical in determining Updated_Price and involve complex relationships, then a decision tree or random forest might perform better. Testing each model helps us identify which approach best captures the factors affecting the increased prices.

Step 6: Evaluate and Choose the Best Model

Evaluate Performance: We evaluate each model on the test set using performance metrics (like mean squared error, mean absolute error, or R-squared) to check how close the predicted values are to the target.
Why Choose the Best Model?: Selecting the best-performing model ensures that we make accurate predictions that meet our objective, like the 35% increase in price. We choose the model that balances accuracy and the ability to generalize well, meaning it won’t just overfit to the training data but will also work for new cases.
- Example: For the cab service data, if the random forest model gives the closest predictions to the Updated_Price on the test data, we’d choose it. This means it reliably predicts increased prices in line with the 35% target, without being too specific to the training data. This model would then be used to suggest updated pricing for future rides.

SO why we use Predictions and not use our own logic to implement the update

Consider this sample code

# Function to apply the pricing strategies including demand-supply calculation

def apply_pricing_strategy_updated(row,

                                    demand_threshold=1.1,

                                    high_demand_cut_off=1.15,

                                    low_demand_cut_off=0.85,

                                    high_supply_cut_off=1.15,

                                    low_supply_cut_off=0.85):

    price = row['Historical_Cost_of_Ride']

    # Calculate Demand-Supply Ratio

    total_demand = row['Number_of_Riders']  # Total riders (demand)

    total_supply = row['Number_of_Drivers']  # Total drivers (supply)

    if total_supply > 0:  # Ensure no division by zero

        demand_supply_ratio = total_demand / total_supply

    else:

        demand_supply_ratio = float('inf')  # Set to infinity if no drivers are available

    # Increase price based on demand-supply ratio

    if demand_supply_ratio >= high_demand_cut_off:

        price *= 1.10  # 10% increase for significantly high demand

    elif demand_supply_ratio <= low_demand_cut_off:

        price *= 0.85 # 15% decrease for significantly low demand

    # Adjust for supply considerations

    if total_supply > high_supply_cut_off * total_demand:

        price *= 0.85  # 15% decrease if supply exceeds demand significantly

    elif total_supply < low_supply_cut_off * total_demand:

        price *= 1.10 # 10% increase if demand exceeds supply significantly
 

    # Customer Loyalty Pricing

    if row['Customer_Loyalty_Status'] == 'Regular':

        price *= 1.05  # 5% increase for regular customers

    elif row['Customer_Loyalty_Status'] == 'Silver':

        price *= 1.02  # 2% increase for silver customers

    # Ride Duration Adjustment

    if row['Expected_Ride_Duration'] > 60:

        price *= 1.05  # 5% increase for long rides (over 60 minutes)

    # Time of Booking Factor

    if row['Time_of_Booking'] in ['Night', 'Evening']:

        price *= 1.05  # 5% increase for high-demand times (night or evening)

    # Vehicle-Based Pricing Adjustment

    if row['Vehicle_Type'] == 'Premium':

        price *= 1.05  # 3% increase for premium vehicles

    elif row['Vehicle_Type'] == 'Economy':

        price *= 1.02  # 2% increase for economy vehicles

    return price


# Apply the updated pricing strategy to the dataset

df['new_updated_price'] = df.apply(apply_pricing_strategy_updated, axis=1)

# Calculate the total historical and new ride costs

total_historical_cost = df['Historical_Cost_of_Ride'].sum()

total_new_cost = df['new_updated_price'].sum()

# Calculate the overall percentage increase

total_percentage_increase = ((total_new_cost - total_historical_cost) / total_historical_cost) * 100


# Display total costs and percentage increase

print("Total Historical Ride Cost:", total_historical_cost)

print("Total New Updated Price:", total_new_cost)

print("Overall Percentage Increase:", total_percentage_increase)

# Optional: Calculate and print the overall total demand and supply metrics for the dataset

total_demand = df['Number_of_Riders'].sum()

total_supply = df['Number_of_Drivers'].sum()


print("Total Demand (Number of Riders):", total_demand)

print("Total Supply (Number of Drivers):", total_supply)


if total_supply > 0:

    overall_demand_supply_ratio = total_demand / total_supply

    print("Overall Demand-Supply Ratio:", overall_demand_supply_ratio)

else:

    print("Overall Demand-Supply Ratio: Undefined (no drivers available)")

Why Use Predictions Over Custom Logic for Pricing Updates

Rule-Based, Not Predictive:
- The provided code uses fixed rules (e.g., demand thresholds) to adjust prices. This means it does not learn from historical data or adapt to new trends; it simply applies pre-defined adjustments without flexibility.
No Learning from Data:
- There is no phase where a model learns from data patterns. Machine learning models analyze historical data to identify relationships and generalize these insights to new situations, which is absent in the provided logic.
Fixed Thresholds and Multipliers:
- The thresholds (e.g., high_demand_cut_off=1.15) are hard-coded and may not reflect real-world variability. A machine learning model would determine optimal thresholds based on actual data, allowing for more accurate adjustments.
Deterministic Output:
- The function produces the same output for the same input every time, leading to predictable but potentially inaccurate pricing. Predictive models, on the other hand, capture complex, data-driven relationships and adapt to changes in the data over time.
Limited Scalability:
- Rule-based systems require constant updates and revisions as market dynamics change. In contrast, machine learning models can be retrained with new data, maintaining accuracy as conditions evolve and patterns shift.
Inability to Capture Complex Relationships:
- The current logic does not account for interactions between various factors (e.g., how the combination of Time_of_Booking and Location_Category might influence price). Machine learning can uncover such interactions and provide nuanced pricing.

Summary

While the custom logic can provide a straightforward way to adjust prices, it lacks the adaptability, learning capability, and scalability of machine learning models. Predictive models can analyze complex patterns in historical data and adapt to changing conditions, resulting in more accurate and flexible pricing strategies.

Next Notes

1.Bias and Variance
2. Linear Regression