top of page
Search

🚗 A Deep Dive into Car Price Modeling

  • Writer: Alvaro Mejia
    Alvaro Mejia
  • Dec 22, 2025
  • 3 min read

This project is an evolution of a car price prediction exercise initially developed during the HACKABOSS Data Science bootcamp. Rather than treating it as a one-off assignment, I revisited it with a clear goal: turn a training project into something closer to a real-world data science workflow.

The focus wasn’t just on achieving a good metric, but on understanding the data, its limitations, and the trade-offs involved in building a usable model.


Role: Data Scientist, ML engineer & Developer

Team Size: 1 developer

Duration: 1 month

Stack: Python · Streamlit · Pandas · Plotly · Scikit-learn



📊 The Data: Rich, Messy, and Full of Extremes

The dataset includes technical, categorical, and numerical information about vehicles:

  • Mileage, power, dimensions, fuel type

  • Registration dates

  • Consumption and emissions

  • Body type, transmission, and configuration details

One of the first challenges was the extreme skewness of key variables like price, mileage, and power. Raw distributions were heavily right-tailed, making both EDA and modeling unreliable.

👉 Solution: Log-transformations (log1p) were applied strategically to stabilize variance and improve interpretability during modeling, while keeping the original values available for EDA.


🚨 Outliers: The Core Challenge

Outliers turned out to be one of the most important aspects of this project.

Instead of simply removing extreme values, I:

  • Identified outliers using Tukey’s fences

  • Quantified their presence per variable

  • Visualized their impact across different filters (e.g. body type, emissions label)

This revealed that many outliers are not errors — they represent premium vehicles, edge configurations, or niche markets. As a result, outliers were analyzed, but not aggressively removed.


🧠 Feature Engineering

Some features required domain-driven transformation:

  • car_age derived from registration year

  • Binary encoding for automatic transmission

  • Log-transformed mileage and power

  • Careful handling of numeric-but-categorical variables

Feature importance analysis showed that a small subset of features explained most of the variance: the car age, the power (HP), the transmission type (manual or automatic) and the mileage (represented in kilometers).

This allowed the creation of a reduced model with:

  • Fewer inputs

  • Faster inference

  • Much smaller deployment complexity


🤖 Modeling & Trade-offs

Multiple models were trained and compared:

  • Random Forest

  • Gradient Boosting

  • KNN

  • Decision Trees

While ensemble models achieved the best performance (R² ≈ 0.88), they introduced a new constraint: the model was too heavy for Github.

This led to a pragmatic decision:

  • Train a reduced Random Forest model

  • Accept a small performance drop (R² ≈ 0.82)

  • Gain simplicity, interpretability, and usability

This reflects a real-world principle:

The “best” model is often the one you can actually use.

🖥️ Streamlit App: From Analysis to Interaction

To make the work explorable and presentable, I built a Streamlit dashboard that allows:

  • Interactive EDA with filters and hue grouping

  • Correlation analysis between selected variables

  • Outlier and missing value inspection

  • Dataset export

  • A prediction interface using the trained model

The app is designed as if it were to be shown to a technical recruiter or data team, emphasizing clarity, control, and insight rather than flashy visuals.


🧩 Final Thoughts

This project reinforced several real-world lessons:

  • Outliers deserve understanding, not fear. It might be overwhelming at some point, but they can provide important insights, as well as heavily influence the model performance.

  • Simpler models often win in practice and they can be good enough for using.

  • When developing a Stremlit app, it's extremely important to take the user's point of view, and add practical features that enables him/her to get the most and best insights possible.

  • After trying the model by myself, my perception is that it might be inflating the prices. It might have been more convenient to start the training-test phase with a non-supervised technique, such as clustering, to create clusters of similar cars and then train a model for each cluster. This is something that could be applied in the future.

  • Other future improvements (FastAPI + Docker deployment) are planned, but the project is intentionally left complete and coherent at this stage.


🔗 Links

 
 
 

Comments


Alvaro Mejia

Data Scientist | Python, SQL | Machine Learning & Big Data Enthusiast

  • alt.text.label.LinkedIn
  • GitHub
  • Youtube

©2024 by Alvaro Mejia

bottom of page