How to Compare Football Players the Data Way: A Personal Project (2024)

Introduction:

Last summer, I embarked on an exciting personal project that aimed to revolutionize the way we compare football players using data. Inspired by the Moneyball approach, I decided to publish my findings on Medium. Today, I want to share with you the method I used to analyze and compare players from the top 5 leagues over the last four seasons. This approach not only provides fair comparisons but also helps identify players with similar playing styles and performances.

Method Overview:

The process involves several steps, starting from data collection to advanced analytics techniques like Principal Component Analysis (PCA) and cosine similarity. Here's a brief overview of the method:

Data Collection: I gathered player data from FBRef, focusing on the top 5 leagues over the last four seasons.
Data Aggregation: For players with multiple seasons under their belt, I aggregated their data.
Normalization: To ensure fair comparisons, I normalized the features by 90 minutes.
Combining Features: I combined different aspects of play (shooting, passing, defending, etc.) into a single row for each player.
Dimensionality Reduction: I used PCA to reduce dimensionality and simplify the data.
Player Comparison: Finally, I applied cosine similarity to find and sort players by their similarity score to the target player.

Detailed Methodology:

1. Data Collection:
See Also
Seven Potential NFL Hall of Famers pt. 1 Ranked! The 100 best football players of all time

I started by collecting comprehensive player data from FBRef. This included statistics from various aspects of the game such as shooting, passing, defending, and more. I focused on players from the top 5 European leagues over the last four seasons.

2. Data Aggregation:

For players with multiple seasons of data, I aggregated their statistics. This step ensured that the data represented the player's overall performance rather than being biased by a single season.

3. Normalization:

To compare players fairly, I normalized all features by 90 minutes. This adjustment ensures that players who have played different amounts of time are compared on an equal footing.

4. Combining Features:

I combined various dataframes representing different aspects of play (shooting, passing, defending, etc.) into a single row for each player. This comprehensive dataset provided a holistic view of each player's abilities.

5. Dimensionality Reduction:

To manage the complexity of the dataset and focus on the most important features, I applied PCA. This technique reduces the dimensionality of the data while retaining the most significant variance, making it easier to compare players.

6. Player Comparison:

Using cosine similarity, I compared the target player to all other players in the dataset. Cosine similarity measures the cosine of the angle between two vectors, providing a similarity score. I then sorted the players by their similarity score to identify those with the most similar playing style and performance.

Why Use PCA and Cosine Similarity?

Principal Component Analysis (PCA):

Simplification: PCA helps in reducing the number of dimensions (features) in the dataset while retaining the most important information. This simplification makes the data easier to visualize and analyze.
Noise Reduction: By focusing on the principal components, PCA helps in reducing the noise and redundancy in the data, allowing us to concentrate on the most significant features that explain the variance in player performance.
Efficiency: Working with a reduced number of dimensions speeds up computations, making the analysis process more efficient without compromising on the quality of insights.

Cosine Similarity:

Measure of Similarity: Cosine similarity measures the cosine of the angle between two vectors, effectively quantifying how similar two players are in terms of their performance metrics. A cosine similarity score closer to 1 indicates higher similarity.
Scale Invariance: Unlike Euclidean distance, cosine similarity is not affected by the magnitude of the vectors. This property is particularly useful in player comparison, as it ensures that the similarity score is based purely on the direction (pattern) of performance metrics, not their scale.
Interpretability: The cosine similarity score provides an intuitive and straightforward measure of similarity, making it easy to identify and rank players with similar playing styles and performance levels.

Inspiration:

The idea of using cosine similarity was inspired by my end-of-studies project for my research master's degree, where I worked on face similarity using deep learning. During this project, I explored various methods for measuring similarity, and found that cosine similarity was the most effective metric. You can check out the project here: Face Similarity using Deep Learning. This experience helped me apply the same concept to player comparisons in football analytics.

Practical Application:

Here’s a practical example of how I applied this method:

Selecting a Target Player: Let's say we want to find players similar to Kevin De Bruyne.
Data Preparation: Collect and normalize De Bruyne's data along with the rest of the players.
Feature Combination and PCA: Combine and reduce the dataset dimensions using PCA.
Cosine Similarity: Calculate the cosine similarity between De Bruyne and other players.
Ranking: Sort players based on their similarity scores to De Bruyne.

Code and Implementation:

For those interested in the technical details, you can find the complete code for this project on my GitHub repository: Moneyball for Football: A Data Science Approach for Recruiting Players.

The code includes:

Data collection and aggregation scripts
Normalization procedures
PCA implementation
Cosine similarity calculation

Conclusion:

This project highlights the power of data science in football analytics. By using advanced techniques like PCA and cosine similarity, we can make more informed and objective player comparisons. This approach can help clubs in recruiting players, analysts in evaluating performance, and fans in understanding the game better.

Call to Action:

If you're interested in football analytics or data science, I encourage you to check out the full project on GitHub. Feel free to experiment with the code, apply it to your favorite players, and share your findings. Let's continue to push the boundaries of football analysis together!

Enjoy the content, and I'll see you on the dugout!