🚗 A Deep Dive into Car Price Modeling

Alvaro Mejia
Dec 22, 2025
3 min read

This project is an evolution of a car price prediction exercise initially developed during the HACKABOSS Data Science bootcamp. Rather than treating it as a one-off assignment, I revisited it with a clear goal: turn a training project into something closer to a real-world data science workflow.

The focus wasn’t just on achieving a good metric, but on understanding the data, its limitations, and the trade-offs involved in building a usable model.

Role: Data Scientist, ML engineer & Developer

Team Size: 1 developer

Duration: 1 month

Stack: Python · Streamlit · Pandas · Plotly · Scikit-learn

📊 The Data: Rich, Messy, and Full of Extremes

The dataset includes technical, categorical, and numerical information about vehicles:

Mileage, power, dimensions, fuel type
Registration dates
Consumption and emissions
Body type, transmission, and configuration details

One of the first challenges was the extreme skewness of key variables like price, mileage, and power. Raw distributions were heavily right-tailed, making both EDA and modeling unreliable.

👉 Solution: Log-transformations (log1p) were applied strategically to stabilize variance and improve interpretability during modeling, while keeping the original values available for EDA.

🚨 Outliers: The Core Challenge

Outliers turned out to be one of the most important aspects of this project.

Instead of simply removing extreme values, I:

Identified outliers using Tukey’s fences
Quantified their presence per variable
Visualized their impact across different filters (e.g. body type, emissions label)

This revealed that many outliers are not errors — they represent premium vehicles, edge configurations, or niche markets. As a result, outliers were analyzed, but not aggressively removed.

🧠 Feature Engineering

Some features required domain-driven transformation:

car_age derived from registration year
Binary encoding for automatic transmission
Log-transformed mileage and power
Careful handling of numeric-but-categorical variables

Feature importance analysis showed that a small subset of features explained most of the variance: the car age, the power (HP), the transmission type (manual or automatic) and the mileage (represented in kilometers).

This allowed the creation of a reduced model with:

Fewer inputs
Faster inference
Much smaller deployment complexity

🤖 Modeling & Trade-offs

Multiple models were trained and compared:

Random Forest
Gradient Boosting
KNN
Decision Trees

While ensemble models achieved the best performance (R² ≈ 0.88), they introduced a new constraint: the model was too heavy for Github.

This led to a pragmatic decision:

Train a reduced Random Forest model
Accept a small performance drop (R² ≈ 0.82)
Gain simplicity, interpretability, and usability

This reflects a real-world principle:

The “best” model is often the one you can actually use.

🖥️ Streamlit App: From Analysis to Interaction

To make the work explorable and presentable, I built a Streamlit dashboard that allows:

Interactive EDA with filters and hue grouping
Correlation analysis between selected variables
Outlier and missing value inspection
Dataset export
A prediction interface using the trained model

The app is designed as if it were to be shown to a technical recruiter or data team, emphasizing clarity, control, and insight rather than flashy visuals.

🧩 Final Thoughts

This project reinforced several real-world lessons:

Outliers deserve understanding, not fear. It might be overwhelming at some point, but they can provide important insights, as well as heavily influence the model performance.
Simpler models often win in practice and they can be good enough for using.
When developing a Stremlit app, it's extremely important to take the user's point of view, and add practical features that enables him/her to get the most and best insights possible.
After trying the model by myself, my perception is that it might be inflating the prices. It might have been more convenient to start the training-test phase with a non-supervised technique, such as clustering, to create clusters of similar cars and then train a model for each cluster. This is something that could be applied in the future.
Other future improvements (FastAPI + Docker deployment) are planned, but the project is intentionally left complete and coherent at this stage.