Notice how they watched the game and got the statistics like that. The restrictions are about using the scoreboard and the data displays and reselling/commercialising that data. It is however legal to watch the game and compile and distribute your own stats due to the game entering public domain.
Due to this many betting companies and data collection companies have to pay people to watch the game vs just scraping the scoreboard (which is the context from which I learnt about this). ironically at venue OCR is a common way to get scoreboard data.
I'd be curious to see what useful insights could be gleamed from the match commentary. You have the main commentator giving play-by-play objective reporting and then a 'colour' commentator giving some subjective analysis during breaks in play. I bet there's a lot of interesting ways this could be used.
- The relatively trivial task of extracting textual data from the screen.
- The task of obfuscating that they're publishing other people's work as their own.
When I clicked the article I assumed they'd try to automatically construct analysis of the game by using AI to analyze frames of the game, but that's not what they are doing. They are extracting some trivial information from the frames, and then they process the audio of the referee mic and commentary.
In other words, the analysis has already been done by humans and they just want to re-publish this analysis as their own, without paying money for it. So they run it through an AI because in today's legal environment this seems to completely exempt you from copyright infringement or plagiarism laws.
A few years ago, media companies were rent-seeking parasites who leveraged the jack-booted thugs of law enforcement to protect an artificial monopoly using IP laws that were massive overreach and contrary to the interests of humanity.
Today, suddenly, media companies are pillars of society whose valuable contributions must be protected from the scourge of theft by everything from VC backed AI companies to armchair hackers who don’t respect the sanctity of IP.
It’s amazing how mutable these principles are. I’m sure plenty of people are somewhere between the two extreme, but the shift is so dramatic that I am 100% sure many individuals have completely revised their opinions of IP companies based largely on worries about their own work being disrupted.
At the very least it should create some empathy for the lawyers and business folk we all despised for their rent-seeking blah blah blah. They were just honestly espousing the positions their financial incentives aligned them to.
{ "current_play": "ruck", }
So the vision model can correctly identify that there's a ruck going on and that the ball is most likely in the ruck.
Why not build on this? Which team is in possession? Who was the ball carrier at the start of the ruck, and who tackled him? Who joined the ruck, and how quickly did they get there? How quickly did the attacking team get the ball back in hand, or the defending team turn over possession? What would be a good option for the outhalf if he got the ball right now?
All of these except the last would be straightforward enough for a human observer with basic rugby knowledge going through the footage frame by frame, and I bet it would be really valuable to analysts. It seems like computer vision technology is at a stage where this could be automated too.
not sure if it is done by a human or not
curious how “an AI can do it” yields much difference in terms of result for the casual watcher
TFA mentions comparing a frame with and without - but how do you generate that frame without? If you can already do it, what's useful about doing that?
And then he does a good ol' regular crop on the original image to get the UI excerpt to feed the vision model.
I'm surprised there's not enough fans willing to do that if you could gamify it.
In seriousness, this is a cool project and show how sophisticated analysis LLMs can do in a plug and play manner. They may not always be the best solution but a fantastic baseline that can be deployed and adapted to a usecase in less than an hour.
The scope is a bit different. The study uses an LLM to interpret pose estimation data and describe the behavior in each frame. The output is text which can be used to create embeddings of behavior. As someone who works in ethology, that's a clever (but maybe expensive) idea.
I think the author could use something similar. With multi-person pose estimation models.
> The plan was simple.
You know you're in for a funny read.
More seriously though, the JSON example from a vision language model is interesting but does not take into account how much extrapolation (hallucination) the model will insert over time.
For instance, even if not visible in the image, your VLM will probably start inserting details (such as the color of the team's jersey) based on knowing the team's three-letter identifier.
So the reliability of the system will go down over time, and it probably compounds if you're using some of that info to feed further steps in the loop.
You really need to take a 'full pitch' feed directly from the venue, rather than what is broadcast.
For now.
I was hoping for more.