With a relatively larger dataset you could come up with some real interesting statistics on individual players performance, use of certain units, the success of various strategies and build orders, etc.
It would be real fun to try to predict the outcome of a game based on the first 3 minutes or something.
If it's quite good it could be useful as a tool for people trying to get better (although various tooling and custom maps have been around for these types of purposes since the game came out and most people even pros seem to just grind ladder for practice).
I was also hoping it could be used to analyze a replay and indicate how "back and forth" the game was, or if it was more one-sided, etc. Or if the top 100 pros were part of the featurization, then you could potentially determine how the odds change given a certain game state depending on who the pro is.
Feedback like "your baneling run-bys are most cost-effective with 7 banelings around the ten minute mark" is firstly unlikely to be accurate, and secondly unlikely to be useful to a pro because their play is so situational.
There also isn't enough data in the replay files to piece together the kind of specific information they would want.
SC2 has extremely high variance, even at the pro level. It's very hard to mine useful insights from game data. Even pro builds, which are supposedly standardized, have crazy variation: https://sc2.gg/reports/top-openings-2022/.
There are so many variations in games, and there are often adjustments on the fly or even sometimes mess ups.
(which both includes a full json parser for each parser)
It's not easy to work with proprietary formats, but they've both become pretty popular, so I would 100% recommend sinking more time into this project as long as it scratches your itch. Gamers are always looking for more stats and deeper insights
If you search Direct Strike Starcraft on youtube you should be able to see the gameplay itself.
For context, I have an open source video player designed for esports coaches. The main feature is that you can load in multiple video streams at once at synchronise them together. Mainly for FPS games like Valorant / Apex Legends (https://www.vodon.gg/ if you want to explore it).
I'm starting to get access to some streams of data from the games via coaches that use the tool. My very naïve approach was simply to load game events into the video timeline (so you could easily skip forwards to deaths, kills etc) but I hadn't thought about loading this into data analysis tools.
The game events themselves for Valorant seem like they'd be enough to almost construct an online replay from them as well, which could compliment the recorded gameplay (i.e. construct a dynamic map of where everybody is that could be brought up on screen).
It's a very cool space, if you'd like to chat more my email is in my profile.
https://www.nature.com/articles/s41597-023-02510-7
Feel free to look into the tools on my GitHub (https://github.com/Kaszanas). Since this is mostly the topic of my PhD I guess I will be updating the dataset in the near future. You may want to try and test your parser against it.
further research for you would probably include running Logistic Regression on aggregated data from each of the replays to try and have a model that can discern between winners and losers and see which parameters are key in your data.
Example: https://www.researchgate.net/publication/363613604_Determina...
And even further embedding the games as timeseries data via various methods.
I implemented a basic one in Rust a while back: https://github.com/ZephyrBlu/rust-parser
And a full one in Python with a few bells and whistles ages ago: https://github.com/ZephyrBlu/zephyrus-sc2-parser
Don't maintain either of them though :(, and the Rust one is super rough.
SC2 is a very interesting area for data analysis, but at the same time I found it very challenging. There is so much nuance and inconsistency across games it can be really hard to do accurately do things like categorize builds or measure build timings.
The area I ended up focusing on was builds, and I feel like I did some interesting stuff there: https://sc2.gg/reports/top-openings-2022/.
I found personal statistics less interesting than aggregate statistics. Even pro games are very volatile, ladder games even more so. Extremely hard to get reliable signal out of them if you're trying to track things across games. Even simple things like Collection Rate are poor indicators without significant categorization work (Matchup, build, opponent build, etc).
[0]: https://youtu.be/yBCe8SqGwK8
[1]: https://jku-vds-lab.at/publications/2022_embedding_structure...