I think this is an important thing to remember/consider. I can't tell you how many personal projects I've stalled on worrying about costs "XYZ service/platform/API is expensive" without considering what "expensive" actually means.
Yes, they could have used OCR/image recognition-type software but what's easier than piping an image to an API and asking it?
LLMs frustrate me with their inconsistency/"fuzziness" (repeating instructions, putting them in all caps, saying "please" just rubs me the wrong way) but I know personally I have a bad habit of "That would be too expensive" or "How does it scale to X" when neither the cost nor the scale would ever be a real issue in the thing I'm writing.
With that said, I wonder why they used AI at all here. Could they not have keyed off certain keywords or other information present in a screen scrape, rather than rely on Claude to parse it?
I’m reminded of the guy who setup an iPhone farm to use the iOS on-device OCR because he couldn’t find anything better.
If you literally need to just detect whether it's at the firmware splash screen or not, simply checking if enough pixels on the image are white would detect that splash screen just fine.
Maybe it needs to have more complex logic/detection and we need something more complex down the road. But it's like easy and cheap OCR for now.
what was kinda funnier was that I tried to get Claude to generate its own Go client code to upload the image and run the prompt; it totally totally hallucinated on that part :).
And of course the brilliant use of AI and discussion of how cost-effective it is. “Hook it up to an AI to save money” is the world we can look forward to. In this case the problem is recognizing which state a thing is in from a list of known states. Once the LLM gives the state in text form, all kinds of automation are unlocked. I think that class of problem - converting a state based on an image into text form - is wildly common, and will be on the lookout for it in my own automation work!
that's on me (author); I tried to cut the content down to a manageable post size that covered some interesting stuff - but probably dropped the connective tissue in the process. We'll keep this in mind for next time.
In this case, you’re assuming a huge number of things like infrastructure and other requirements are in place, and all of those things take a lot of time and work, if they’re even appropriate at all.
But even if that's the case , couldn't they use something like vram if they are running out of memory of are we back to square one?