Internship ,machine learning basics
Note: In this explanation, I am using the Ride Dynamic Pricing dataset from Kaggle as an example to illustrate each step of the machine learning process. You can find the dataset on Kaggle here.
https://www.kaggle.com/datasets/arashnic/dynamic-pricing-dataset
The dataset includes columns with information such as Number_of_Riders, Number_of_Drivers, Location_Category, Customer_Loyalty_Status, Number_of_Past_Rides, Average_Ratings, Time_of_Booking, Vehicle_Type, Expected_Ride_Duration, and Historical_Cost_of_Ride.
Here are the first five entries of the dataset:
| Number_of_Riders | Number_of_Drivers | Location_Category | Customer_Loyalty_Status | Number_of_Past_Rides | Average_Ratings | Time_of_Booking | Vehicle_Type | Expected_Ride_Duration | Historical_Cost_of_Ride | |
|---|---|---|---|---|---|---|---|---|---|---|
| 90 | 45 | Urban | Silver | 13 | 4.47 | Night | Premium | 90 | 284.26 | |
| 58 | 39 | Suburban | Silver | 72 | 4.06 | Evening | Economy | 43 | 173.87 | |
| 42 | 31 | Rural | Silver | 0 | 3.99 | Afternoon | Premium | 76 | 329.80 | |
| 89 | 28 | Rural | Regular | 67 | 4.31 | Afternoon | Premium | 134 | 470.20 | |
| 78 | 22 | Rural | Regular | 74 | 3.77 | Afternoon | Economy | 149 | 579.68 |
This dataset provides detailed information about various features associated with ride pricing and is used here to demonstrate a machine learning workflow focused on dynamic pricing predictions
Overview
Step 1: Collect and Prepare the Data
-
Collect Data: First, we gather all the necessary data to train our model. This data can come from various sources, such as databases, APIs, surveys, or web scraping.
- Example: In this case, we already have the data on past rides from a cab service, which includes columns like
Number_of_Riders,Number_of_Drivers,Location_Category,Customer_Loyalty_Status, etc.
- Example: In this case, we already have the data on past rides from a cab service, which includes columns like
-
Clean Data: Real-world data often contains missing values, errors, or inconsistencies. Cleaning the data involves filling in missing values, removing duplicates, correcting inaccuracies, and ensuring the data is consistent and usable.
- Example: Here, we need to handle any missing or incorrect values. For example, if some
Average_Ratingsentries are missing, we might fill those with an average rating. Or, if there are errors inExpected_Ride_Duration, we fix those to ensure accuracy.
- Example: Here, we need to handle any missing or incorrect values. For example, if some
Step 2: Exploratory Data Analysis (EDA)
-
What is EDA?: Exploratory Data Analysis (EDA) is a step to understand the data by identifying patterns, trends, and relationships. It helps us understand the features, distributions, and correlations, and spot any issues that might affect model performance.
-
Why do EDA?: EDA allows us to get familiar with the data and spot any patterns or trends that could improve the model’s predictions. For example, we might discover that certain features are more predictive than others or that some relationships are nonlinear, influencing our choice of model.
- Example: In the cab service dataset, EDA might reveal that the
Time_of_Bookingimpacts theHistorical_Cost_of_Ride—for example, peak booking times might be associated with higher costs. We might also find that customers withCustomer_Loyalty_Statushave more frequent rides, indicated by a higherNumber_of_Past_Rides. These patterns help us understand which features might be most important for predicting updated prices.
- Example: In the cab service dataset, EDA might reveal that the
Step 3: Define the Target for Prediction
-
Setting Up the Prediction Target: In some cases, we might need to create a new target variable. For example, if we want to predict a price increase, we define a new target that represents this increase. This new target isn’t necessarily a fixed percentage increase for every data point but rather reflects a general guideline for the model.
-
Why Define a New Target?: Creating a new target variable lets us train the model to meet specific business objectives (like a price increase) in a way that reflects typical trends without adhering to strict rules for each data point.
- Example: Here, we set up a new target variable,
Updated_Price, representing a general price increase goal of at least 35%. This target isn’t an exact 35% increase onHistorical_Cost_of_Ridefor each ride but rather a guideline. This way, the model learns to predict prices that, on average, reflect this increase. For instance, if most rides in a givenLocation_Categoryrequire a higher fare, the model learns to account for that in the predictedUpdated_Price.
- Example: Here, we set up a new target variable,
Step 4: Split the Data (Train-Test Split)
-
What is a Train-Test Split?: We split our data into two sets: a training set to train the model and a testing set to evaluate it. This helps us check if the model performs well on new data, rather than just memorizing patterns in the training set.
-
Why Split the Data?: By testing the model on unseen data, we can assess its ability to generalize to new cases. This makes sure it isn’t just learning patterns unique to the training data but is flexible enough to make accurate predictions in different scenarios.
- Example: For the cab service data, we split it so that 80% of the ride records are used to train the model on predicting
Updated_Price, while 20% are set aside to test the model. This lets us see if the model can consistently suggest reasonable price increases for rides that weren’t used in training.
- Example: For the cab service data, we split it so that 80% of the ride records are used to train the model on predicting
Step 5: Choose and Train Multiple Models
-
What are Models?: Different algorithms, or models, allow us to predict the target variable. We can try multiple models, like linear regression, decision trees, and random forests, each of which captures relationships in different ways.
-
Why Use Multiple Models?: Some models work better for simple, linear relationships, while others are better for complex, nonlinear patterns. By trying multiple models, we increase our chances of finding one that fits the data well and gives reliable predictions.
- Example: With the cab service data, linear regression might be useful if the price increases are mostly proportional to features like
Expected_Ride_Duration. IfLocation_CategoryandVehicle_Typeare critical in determiningUpdated_Priceand involve complex relationships, then a decision tree or random forest might perform better. Testing each model helps us identify which approach best captures the factors affecting the increased prices.
- Example: With the cab service data, linear regression might be useful if the price increases are mostly proportional to features like
Step 6: Evaluate and Choose the Best Model
-
Evaluate Performance: We evaluate each model on the test set using performance metrics (like mean squared error, mean absolute error, or R-squared) to check how close the predicted values are to the target.
-
Why Choose the Best Model?: Selecting the best-performing model ensures that we make accurate predictions that meet our objective, like the 35% increase in price. We choose the model that balances accuracy and the ability to generalize well, meaning it won’t just overfit to the training data but will also work for new cases.
- Example: For the cab service data, if the random forest model gives the closest predictions to the
Updated_Priceon the test data, we’d choose it. This means it reliably predicts increased prices in line with the 35% target, without being too specific to the training data. This model would then be used to suggest updated pricing for future rides.
- Example: For the cab service data, if the random forest model gives the closest predictions to the
SO why we use Predictions and not use our own logic to implement the update
Consider this sample code
# Function to apply the pricing strategies including demand-supply calculation
def apply_pricing_strategy_updated(row,
demand_threshold=1.1,
high_demand_cut_off=1.15,
low_demand_cut_off=0.85,
high_supply_cut_off=1.15,
low_supply_cut_off=0.85):
price = row['Historical_Cost_of_Ride']
# Calculate Demand-Supply Ratio
total_demand = row['Number_of_Riders'] # Total riders (demand)
total_supply = row['Number_of_Drivers'] # Total drivers (supply)
if total_supply > 0: # Ensure no division by zero
demand_supply_ratio = total_demand / total_supply
else:
demand_supply_ratio = float('inf') # Set to infinity if no drivers are available
# Increase price based on demand-supply ratio
if demand_supply_ratio >= high_demand_cut_off:
price *= 1.10 # 10% increase for significantly high demand
elif demand_supply_ratio <= low_demand_cut_off:
price *= 0.85 # 15% decrease for significantly low demand
# Adjust for supply considerations
if total_supply > high_supply_cut_off * total_demand:
price *= 0.85 # 15% decrease if supply exceeds demand significantly
elif total_supply < low_supply_cut_off * total_demand:
price *= 1.10 # 10% increase if demand exceeds supply significantly
# Customer Loyalty Pricing
if row['Customer_Loyalty_Status'] == 'Regular':
price *= 1.05 # 5% increase for regular customers
elif row['Customer_Loyalty_Status'] == 'Silver':
price *= 1.02 # 2% increase for silver customers
# Ride Duration Adjustment
if row['Expected_Ride_Duration'] > 60:
price *= 1.05 # 5% increase for long rides (over 60 minutes)
# Time of Booking Factor
if row['Time_of_Booking'] in ['Night', 'Evening']:
price *= 1.05 # 5% increase for high-demand times (night or evening)
# Vehicle-Based Pricing Adjustment
if row['Vehicle_Type'] == 'Premium':
price *= 1.05 # 3% increase for premium vehicles
elif row['Vehicle_Type'] == 'Economy':
price *= 1.02 # 2% increase for economy vehicles
return price
# Apply the updated pricing strategy to the dataset
df['new_updated_price'] = df.apply(apply_pricing_strategy_updated, axis=1)
# Calculate the total historical and new ride costs
total_historical_cost = df['Historical_Cost_of_Ride'].sum()
total_new_cost = df['new_updated_price'].sum()
# Calculate the overall percentage increase
total_percentage_increase = ((total_new_cost - total_historical_cost) / total_historical_cost) * 100
# Display total costs and percentage increase
print("Total Historical Ride Cost:", total_historical_cost)
print("Total New Updated Price:", total_new_cost)
print("Overall Percentage Increase:", total_percentage_increase)
# Optional: Calculate and print the overall total demand and supply metrics for the dataset
total_demand = df['Number_of_Riders'].sum()
total_supply = df['Number_of_Drivers'].sum()
print("Total Demand (Number of Riders):", total_demand)
print("Total Supply (Number of Drivers):", total_supply)
if total_supply > 0:
overall_demand_supply_ratio = total_demand / total_supply
print("Overall Demand-Supply Ratio:", overall_demand_supply_ratio)
else:
print("Overall Demand-Supply Ratio: Undefined (no drivers available)")
Why Use Predictions Over Custom Logic for Pricing Updates
-
Rule-Based, Not Predictive:
- The provided code uses fixed rules (e.g., demand thresholds) to adjust prices. This means it does not learn from historical data or adapt to new trends; it simply applies pre-defined adjustments without flexibility.
-
No Learning from Data:
- There is no phase where a model learns from data patterns. Machine learning models analyze historical data to identify relationships and generalize these insights to new situations, which is absent in the provided logic.
-
Fixed Thresholds and Multipliers:
- The thresholds (e.g.,
high_demand_cut_off=1.15) are hard-coded and may not reflect real-world variability. A machine learning model would determine optimal thresholds based on actual data, allowing for more accurate adjustments.
- The thresholds (e.g.,
-
Deterministic Output:
- The function produces the same output for the same input every time, leading to predictable but potentially inaccurate pricing. Predictive models, on the other hand, capture complex, data-driven relationships and adapt to changes in the data over time.
-
Limited Scalability:
- Rule-based systems require constant updates and revisions as market dynamics change. In contrast, machine learning models can be retrained with new data, maintaining accuracy as conditions evolve and patterns shift.
-
Inability to Capture Complex Relationships:
- The current logic does not account for interactions between various factors (e.g., how the combination of
Time_of_BookingandLocation_Categorymight influence price). Machine learning can uncover such interactions and provide nuanced pricing.
- The current logic does not account for interactions between various factors (e.g., how the combination of
Summary
While the custom logic can provide a straightforward way to adjust prices, it lacks the adaptability, learning capability, and scalability of machine learning models. Predictive models can analyze complex patterns in historical data and adapt to changing conditions, resulting in more accurate and flexible pricing strategies.
Next Notes
1.Bias and Variance
2. Linear Regression