Project Overview
A machine learning system that predicts Formula 1 race outcomes using historical race data and real-time qualifying performance. The project predicts race winners, podium finishes, points finishes, and top 5 placements. Scroll all the way down for Github link.
Data Sources
- Ergast F1 API: Historical race data (1950-2024). The features include qualifying times, grid positions, team performance, driver history, and track-specific statistics.
Technical Stack
- Python for data processing and modeling
- Machine Learning: Random Forest and Gradient Boosting Classifiers
- Web Interface: StreamLit dashboard
Project Structure
The project begins with a data collection that gathers and processes Formula 1 racing data. This processed data undergoes feature engineering to create meaningful predictive variables. The system then trains multiple prediction models for different race outcomes.
The above sample data does not have any qualifying times since it is from the 1980s.
Attached here is the fully processed data in the CSV form:
Data Modelling:
Here is the break down the key algorithmic components of this F1 prediction system:
- Data Processing and Feature Engineering:
def load_and_clean_data(self): selected_features = [ # Grid and Race Position Features 'GridPosition', 'Position', 'PositionsGained', # Qualifying Performance 'Q1_seconds', 'Q2_seconds', 'Q3_seconds', 'BestQualiTime', # Historical Performance 'RecentAvgPosition', 'AvgTrackPosition', 'TrackExperience', # Team Performance 'TeamSeasonPoints', 'TeamAvgPoints' # ... other features ]
This part selects and organizes the relevant features for prediction, grouping them into logical categories like qualifying performance and historical data.
- Multiple Prediction Targets:
def create_prediction_targets(self, df): return { 'Race Winner': (df['Position'] == 1).astype(int), 'Podium': (df['Position'] <= 3).astype(int), 'Points Finish': (df['Position'] <= 10).astype(int), 'Strong Result': ((df['Position'] <= 5) & (df['GridPosition'] > 5)).astype(int) }
The system creates multiple binary classification targets, allowing prediction of different race outcomes from the same data.
- Model Training and Evaluation:
models = { 'Random Forest': RandomForestClassifier(n_estimators=100, max_depth=10), 'Hist Gradient Boosting': HistGradientBoostingClassifier(max_iter=100, learning_rate=0.1) } for target_name, y in targets.items(): X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train)
The algorithm uses two different models (Random Forest and Gradient Boosting) and standardizes the data for better performance.
- Feature Importance Analysis:
def plot_feature_importance(self, model, feature_columns, target_name): if hasattr(model, 'feature_importances_'): importance_df = pd.DataFrame({ 'feature': feature_columns, 'importance': model.feature_importances_ }).sort_values('importance', ascending=True)
This analyzes which features most strongly influence the predictions, helping understand what factors matter most for race outcomes.
The algorithm combines historical data with current race weekend information to make predictions, using ensemble methods (Random Forest and Gradient Boosting) that are particularly good at handling the complex relationships in racing data. The multiple prediction targets allow for different levels of success to be predicted simultaneously.
Distribution Plots:
- GridPosition: The orange line (points finishers, labeled as 1) peaks around positions 3-5, while non-points finishers (blue, labeled as 0) typically start around positions 8-12. This clear separation shows that the starting position strongly influences point finishes.
- PositionsGained: Points finishers (orange) show a strong peak around 0 to +5 positions gained, while non-points finishers (blue) have a broader distribution toward negative values, indicating they typically lose positions.
- Qualifying Times (Q1, Q2, Q3): The overlapping distributions between points and non-points finishers suggest qualifying times aren't as decisive for predicting points finishes as other factors, which aligns with their low importance scores in the feature importance graph.
Feature Importance Graph:
The bar chart shows that Points and PositionsGained are the most influential features for predicting points finishes, with importance scores of ~0.5 and ~0.35 respectively. Grid position and number of laps are important, while qualifying times (Q1, Q2, Q3) have minimal impact.
Similar Distribution Plots and Feature Importance graphs have also been made for the Winner, Podium Finishers, Top 5, and Top Qualifiers.
Confusion Matrices:
A confusion matrix is a visual representation that shows how accurately the model predicts different race outcomes. For each prediction target (like winning, podium or points finish), the matrix shows four key pieces of information:
The color intensity in the matrices helps visualize the distribution of predictions. Darker blues indicate more predictions in that category, while lighter blues show fewer cases. This helps quickly identify where the model performs best and where it might need improvement.
Our model can predict True Negatives correctly and has had success with 0 false negatives, however, it hasn’t been able to accurately predict true positives. To change this, I will add more racing data from 50 years ago and also try to incorporate live-changing race and track data and see where it goes.
Results
The model demonstrates predictive capabilities, for podium and points-finish predictions.
Here is a snapshot of the final dashboard where one can input information and find out who’s going to win the next race of the season.
GitHub Repository:
F1-Race-Winner-Prediction
ewshu • Updated Jan 14, 2025