PinballSpinner is a website for various projects centered around pinball. Projects include:
Pinalytics is a (upcoming) site dedicated to analyzing IFPA Tournament and Player data mashed up with geography information.
Hurry-Up Tutorials are a series of videos that aim to share basic strategies on a number of different pinball machines. The goal is to share a simple overview of specific games.
Machine Links is a resource that compiles various go-to resources for learning about specific games. It was created specifically for Pinburgh, but has since been expanded to work as a game link resource.
MatchPlayAPI are resources related to extracting data from the MatchPlay.events API
This post is to discuss the tech stack used to power the Pinalytics site. This page represents Version 1.0.
Why Pinalytics?
I play competitive pinball, and I love data. Pinball tournaments and leagues generate data points related to players, locations, and event results where certain insights can be drawn out. Tim Sexton from Stern Pinball had created the Pinball Statistics Power Rankings which focused on individual player performance which I took inspiration from for this project, but the focus for this project is a macro level for the overall “strength” of a region.
I also created Pinalytics to make a project that uses dbt. What the heck is dbt? dbt (“data build tool”), is:
a development framework that combines modular SQL with software engineering best practices to make data transformation reliable, fast, and fun.
So what are the problems that you’re trying to solve?
Compile Relevant Player and Tournament Data
Classify Players and Tournaments into interesting cohorts
Add an extra layer to US Geography; instead of just using State lines we tie the ZIP Codes to a Designated Marketing Area centered around urban areas.
Capture main metrics for an area for a rolling 48 month basis, which matches the time for which an event value counts towards ranking
Cut to the Chase. What’s the Stack?
Python – to extract data from the IFPA website
Excel – Manipulate the flat files needed as additional sources
MySQL – Database storage (version 5.7.38) that is compatible with my current hosting service
dbt – Data Warehousing – GitHub Link
Python – A basic Data API for the use of front end
Website – PHP driven
Data Sources
IFPA Tournament Data
The IFPA (International Flipper Pinball Association) is the governing body supporting competitive pinball events. They have an API which gives relevant information about:
Players
Tournaments
Tournament Results
Calendar – A basic seed file was generated to allow calendar roll-ups across different vectors.
ZIP Codes – Basic ZIP code data for the United States
ZIP to DMA Mappings – DMAs are “Designated Marketing Areas”. You can see an interactive map of the Designated Marketing Areas here. They are highly tied to media and television markets.
Why DMAs? This is to help break state lines and align a ZIP code to a more natural association based on the region.
For example, I live near Philadelphia, PA. Pennsylvania has a number of thriving pinball scenes (across five different DMAs), but for me in Philadelphia it’s very hard to make a day trip to Pittsburgh (300mi, 482 km) for an event.
The goal of tying to a DMA is to look at the strength of a region and not just a state. Currently, it’s easy enough to look at a State or Province via the IFPA site to see the standings for the yearly North American Championships. But that doesn’t tell the whole story. DMAs allow us (for the US) to go a level deeper and find out what’s happening in different regions. Here’s a map of Pennsylvania, which includes DMAs from:
Buffalo, NY
Elmira, NY
Erie, PA
Harrisburg/Lancaster/Lebanon/York, PA
Johnstown/Altoona, PA
Philadelphia, PA
Washington, DC
Wilkes Barre/Scranton, PA
Youngstown, OH
Python Scripting
If you want good data and analytics, then you have to have the right data to begin with. The stack has some very basic Python scripts that iterate through player, tournaments, and results APIs and dump out the results in a structured CSV format.
The scripts included timerequestsjsoncsvmath and datetime
These would be run once a month to grab the latest dataset. A full refresh is required to make sure you have the latest and greatest, specifically for recently-created events or events where there was a change after the initial publication or results.
Future improvements will likely include:
Better extraction for incremental updates
Changes to the code to speed up the monthly extraction process which is currently very linear
Publish the GitHub Repo to make it public (needs some cleanup)
MySQL Database – RAW Schema
My current web hosting utilizes MySQL 5.7. Is the the latest and greatest? No, but it gets the job done. It’s also the database stack I have the most familiarity with from building other websites. I explored the possibility of using Postgres given it’s natural compatibility with dbt, but at the end I settled with something that I had more familiarity with to make the front-end build easier.
Using a mix of a local development environment, MySQL Workbench, and opening up the necessary IP addresses, we’re finally ready to rock and roll with dbt development.
dbt
Here’s where the fun begins. dbt allows us to take the raw data inputs and use modular SQL files to cleanse, enhance, and produce analytics outputs that can be consumed by the front end.
Sources and Seeds -> There’s six main source tables in both database and flat-file formats and this is where it all starts.
Staging -> Raw files come in, cleansed files come out. This stage is used to get rid of bad lines of data and start to enhance the outputs.
Marts -> I’m very much a believer in Kimball Modeling, and I love taking the concepts of dimensionality and facts to make sure the the marts layer is a clean output where the grain is simple, clean, and testable. I view Marts as a “Grand Central Station” concept where all data must pass through before moving onto the next layer.
Metrics -> This is the “final” layer of the dbt stack. The goal of the metrics layer is to have data products ready to go to be consumed by applications or people who need it.
Exposures -> Connections to where the data products are going.
In thinking about the DAG (Directed Acyclic Graph), in an ideal state your DAG would look simple and wonderful and something like this:
But in reality, things are of course messier and more complicated.
And that’s OK! In the case of this project, there’s work that goes into producing the Marts layer, which then has to be spun further with additional Staging models in order to make it to the Metrics layer.
Here’s a very zoomed-out view of what results in the DAG for this project. The red circles represent the Marts layer…which then continues on to a bunch more Staging files before we get to the Metrics layer.
Should we separate out the “Staging for Marts” and “Staging for Metrics” concept in a future version? Perhaps! For now, we’ll call this a “good enough” for this first iteration.
So, how’s that MySQL choice working out from a dbt perspective? Pretty good! But, some lessons learned along the way.
The MySQL server that my hosting company uses seems very underpowered. I started by doing development directly against it, but started running into time-out and connection errors which is why I ended up installing a local MySQL environment.
MySQL 5.7 doesn’t have CTEs. It does support subqueries, but because of the performance issues I was facing early on, I decided to:
Break out what would normally be a CTE into a Staging model
Materialize the Staging models
Also, MySQL is (understandably) a lot more persnickety than a data warehouse like Snowflake in terms of Primary Key and Index definitions, so thought needed to go into each model in terms of “what’s the key here and what is the model going to have to reference?”.
As a result, I learned a lot about pre-hooks and post-hooks and the proper way to make them work in a MySQL environment. Example Below:
dbt Testing helps with automated testing. The project here is relatively simple and I thought we’d have some straight-ahead tests that would easily pass:
Unique
Not Null
Well, it turns out I had a problem somewhere in the tournament_results model on the unique key!
There tournament results that weren’t keyed to a specific player because of player-suppression rules, so we have results that don’t have a player associated with them. They’re still valid data points, so the key was changed to ensure we kept the integrity of the table.
At the end of it all, we have our main Mart tables:
fct_ifpa_players
fct_ifpa_tournaments
fct_ifpa_tournament_results
…which then get spun into different Metrics models for front-end Consumption, like
international_summary
geography_summary
top_players_by_geography
…which those can be connected to Exposures to represent the data products being used downstream
Python – A basic Data API for the use of front end
Now that we have the data sorted and formatted the way we want for the application layer we need a way to get that data to the application.
Ideally we want it to look like this:
Why? Because then we can centrally control (and store) the various SQL queries to served up in a programmatic way for the web application, specifically for the charts. We can also store those queries in dbt making maintenance simpler.
In really though, here’s where it’s at:
Thanks to my friend Ryan Weicker who is leading the way with a basic Python/Flask application which will maintain the connection to the database and serve up the data in digestible CSVs. It’ll be used specifically for making charts. For now though, it’s a lot of on-the-fly manipulation of queries to get the outputs in the format needed.
Website
The website is a basic PHP site, with the following plugins and add-ons:
Chart.js is used for the charts. I tried out D3.js but it ended up being more complicated than what I needed and more complicated to implement.
select2 is being used for the interactive dropdowns
What’s Next?
We’re to the point where basics are in place across the board to be able to make incremental updates over time. Right now the plan is to keep the data feed flowing the next few months as individual months roll along.
Potential Version 2 enhancements might include:
Additional geography breakouts. Right now it’s only broken out for US Metro areas. I tried to find a Canadian breakout but the best I could find was this breakout but couldn’t find any connection to Postcode. The concept could be expanded to other areas, but I need other expertise.
Additional metric sets. What are other good metrics for this data set? This is where I’ll presume the pinball community will chime in. Ideas I have are things like engagement scores of players based on how frequently they play, or “customer cohorts” based on how long they’ve been an active player for.
Additional dimensionality and attributes. Right now the grouping of things like tournaments have been simple. But there’s more around event formats, Tournament Directors, or event text analysis of event descriptions.
Additional datasets. Right now the data is just focused on the some of core items available, but there’s more to potentially do around integrating other sets like data from the MatchPlay.events API or the NeverDrains universe.
Finish the API layer. This will make building future graphs easier, but it’ll also open the possibility to share the API out with others so that they can build their own applications or data analysis on top.
This post is to discuss the tech stack used to power the Pinalytics site. This page represents Version 2.0 and is an update on the previous 1.0 Version.
Why Pinalytics?
I play competitive pinball, and I love data. Pinball tournaments and leagues generate data points related to players, locations, and event results where certain insights can be drawn out. Tim Sexton from Stern Pinball had created the Pinball Statistics Power Rankings which focused on individual player performance which I took inspiration from for this project, but the focus for this project is a macro level for the overall “strength” of a region.
I also created Pinalytics to make a project that uses dbt. What the heck is dbt? dbt (“data build tool”), is:
a development framework that combines modular SQL with software engineering best practices to make data transformation reliable, fast, and fun.
Python – A basic Data API for the use of front end
Website – PHP driven
Data Sources
IFPA Tournament Data
The IFPA (International Flipper Pinball Association) is the governing body supporting competitive pinball events. They have an API which gives relevant information about:
Players
Tournaments
Tournament Results
Calendar – A basic seed file was generated to allow calendar roll-ups across different vectors.
ZIP Codes – Basic ZIP code data for the United States
ZIP to DMA Mappings – DMAs are “Designated Marketing Areas”. You can see an interactive map of the Designated Marketing Areas here. They are highly tied to media and television markets.
Why DMAs? This is to help break state lines and align a ZIP code to a more natural association based on the region.
For example, I live near Philadelphia, PA. Pennsylvania has a number of thriving pinball scenes (across five different DMAs), but for me in Philadelphia it’s very hard to make a day trip to Pittsburgh (300mi, 482 km) for an event.
The goal of tying to a DMA is to look at the strength of a region and not just a state. Currently, it’s easy enough to look at a State or Province via the IFPA site to see the standings for the yearly North American Championships. But that doesn’t tell the whole story. DMAs allow us (for the US) to go a level deeper and find out what’s happening in different regions. Here’s a map of Pennsylvania, which includes DMAs from:
Buffalo, NY
Elmira, NY
Erie, PA
Harrisburg/Lancaster/Lebanon/York, PA
Johnstown/Altoona, PA
Philadelphia, PA
Washington, DC
Wilkes Barre/Scranton, PA
Youngstown, OH
Python Scripting
If you want good data and analytics, then you have to have the right data to begin with. The stack has some very basic Python scripts that iterate through player, tournaments, and results APIs and dump the contents directly into a Postgres raw database
The scripts included time requests json csv math psycopg2 and datetime
These will be be run weekly to grab the latest dataset, but the overall website will focus on the most recently completed month. A full refresh is required to make sure you have the latest and greatest, specifically for recently-created events or events where there was a change after the initial publication or results.
That all being said, currently it takes about 10 hours to run the Tournaments script, 10 hours to run the players script, and about 48 hours for a full refresh of tournament history. Lots of potential optimization in the future on this front!
Future improvements will likely include:
Better logic to allow for incremental updates based on tournament recency
Changes to the code to speed up the extraction process which is currently very linear
Publish the GitHub Repo to make it public (needs some cleanup)
Move the Python scripts off my local desktop and into some kind of cloud environment
Postgres 13 Database – RAW Schema
Previously I was using MySQL 5.7 given that was what was available from my webhost. In Version 2.0 I wanted it to be compatible with dbt Cloud, which supports Postgres (among other things). After some experimenting in AWS RDS, I settled on a Postgres 13.11 database because it works with dbt Cloud and it deals with some authentication limitations of my host’s PHP/Postgres services.
Using a mix of a local development environment, pgAdmin4, and opening up the necessary IP addresses, we’re finally ready to rock and roll with dbt development.
dbt
Here’s where the fun begins. dbt allows us to take the raw data inputs and use modular SQL files to cleanse, enhance, and produce analytics outputs that can be consumed by the front end.
Sources and Seeds -> There’s six main source tables in both database and flat-file formats and this is where it all starts.
Staging -> Raw files come in, cleansed files come out. This stage is used to get rid of bad lines of data and start to enhance the outputs.
Marts -> I’m very much a believer in Kimball Modeling, and I love taking the concepts of dimensionality and facts to make sure the the marts layer is a clean output where the grain is simple, clean, and testable. I view Marts as a “Grand Central Station” concept where all data must pass through before moving onto the next layer.
Metrics -> This is the “final” layer of the dbt stack. The goal of the metrics layer is to have data products ready to go to be consumed by applications or people who need it.
Exposures -> Connections to where the data products are going.
In thinking about the DAG (Directed Acyclic Graph), in an ideal state your DAG would look simple and wonderful and something like this:
But in reality, things are of course messier and more complicated.
And that’s OK! In the case of this project, there’s work that goes into producing the Marts layer, which then has to be spun further with additional Staging models in order to make it to the Metrics layer.
Here’s a very zoomed-out view of what results in the DAG for this project. The red circles represent the Marts layer…which then continues on to a bunch more Staging files before we get to the Metrics layer.
Should we separate out the “Staging for Marts” and “Staging for Metrics” concept in a future version? Perhaps! For now, we’ll call this a “good enough” for this iteration.
Since I was previously using MySQL for Version 1.0, Version 2.0 put a focus on the conversion of MySQL to Postgres. The goal was to make sure everything was working with the new tech stack without much aim for net-new functionality in the database or the front-end.
Major changes included:
Since Postgres supports CTEs, I was finally able to take advantage of them.
MySQL from my hosting provider was quiet inefficient which required a lot of post-hooks for key definitions. Version 2.0 removes a lot of those primary key definitions because processing and JOIN relationshiping was much more efficient in Postgres.
Using the dbt Cloud IDE! What a breath of fresh air compared with using dbt Core locally.
dbt Testing
dbt Testing helps with automated testing. The project here is relatively simple and I thought we’d have some straight-ahead tests that would easily pass:
Unique
Not Null
Well, it turns out I had a problem somewhere in the tournament_results model on the unique key!
There tournament results that weren’t keyed to a specific player because of player-suppression rules, so we have results that don’t have a player associated with them. They’re still valid data points, so the key was changed to ensure we kept the integrity of the table.
At the end of it all, we have our main Mart tables:
fct_ifpa_players
fct_ifpa_tournaments
fct_ifpa_tournament_results
…which then get spun into different Metrics models for front-end Consumption, like
international_summary
geography_summary
top_players_by_geography
…which those can be connected to Exposures to represent the data products being used downstream
Python – A basic Data API for the use of front end
Now that we have the data sorted and formatted the way we want for the application layer we need a way to get that data to the application.
Ideally we want it to look like this:
Why? Because then we can centrally control (and store) the various SQL queries to served up in a programmatic way for the web application, specifically for the charts. We can also store those queries in dbt making maintenance simpler.
In really though, here’s where it’s at:
Thanks to my friend Ryan Weicker who is leading the way with a basic Python/Flask application which will maintain the connection to the database and serve up the data in digestible CSVs. It’ll be used specifically for making charts. For now though, it’s a lot of on-the-fly manipulation of queries to get the outputs in the format needed.
Website
The website is a basic PHP site, with the following plugins and add-ons:
Chart.js is used for the charts. I tried out D3.js but it ended up being more complicated than what I needed and more complicated to implement.
select2 is being used for the interactive dropdowns
What’s Next?
Now that we’re on Postgres and dbt Cloud, I’ve been able to set up scheduled jobs with the goal of keeping the site fresh once a month. It’s going to go a long way to automation and keeping updated data points available.
Potential Version 3 enhancements might include:
Additional geography breakouts. Right now it’s only broken out for US Metro areas. I tried to find a Canadian breakout but the best I could find was this breakout but couldn’t find any connection to Postcode. The concept could be expanded to other areas, but I need other expertise.
Additional metric sets. What are other good metrics for this data set? This is where I’ll presume the pinball community will chime in. Ideas I have are things like engagement scores of players based on how frequently they play, or “customer cohorts” based on how long they’ve been an active player for.
Additional dimensionality and attributes. Right now the grouping of things like tournaments have been simple. But there’s more around event formats, Tournament Directors, or event text analysis of event descriptions.
Additional charts. I’d love to get some maps embedded into the site.
Additional datasets. Right now the data is just focused on the some of core items available, but there’s more to potentially do around integrating other sets like data from the MatchPlay.events API or the NeverDrains universe.
Finish the API layer. This will make building future graphs easier, but it’ll also open the possibility to share the API out with others so that they can build their own applications or data analysis on top.