I am telling you this because I read between the lines that you believe current technology is a reason for you to be hopeful. Sure, it should be. But never forget, your child can do much more then you as a sighted person will ever be able to understand. Don't let them drown in your own misery. Let them discover what they can do. You will be surprised what they come up with. And dont fall for Gear Acquision Syndrome. Sure, tools are nice, and they do get better, which is also nice. I LOVE vision models, to stay on topic somehow. However, I still leave my house with only a cane and my phone in my pocket. I do occasionally ask Siri "Where am I" to get an address if I happen to have forgotten where I am exactly, currently. But at the end of the day, my cane is what shows me the way. Most tech is hype, plain old hearing and your sense of touch gets you much farther then you might think.
Wish you all the best for your own journey, and the development of your child.
BUT NOW... THE FUTURE IS HERE.... an all-knowing god-like cell phone can tell these poor miserable individuals what the objects in their own homes are! No more tragic Mr. Magoo-ian accidents!
But thank you for posting this; It certainly enlightened me! I'll admit, all these AI solutions
opened issue for them to confirm this: https://github.com/apple/ml-fastvlm/issues/7
Especially if the API gives opportunity for app developers to load their custom LoRa fine-tunings onto OS-standard foundation models at runtime, then you can (ideally) have the best of both worlds -- fine-tuned app-specific models with reasonable app sizes.
I'd like to make a private Qwen or similar for my kids to prompt with a button and voice control. It doesn't need vision... Although eventually that'd be very cool.
Siri just sucks.
We might not be there yet...
However, if you're looking for instruction following (like an agent), I've tried to implement my own agent and have lost faith. Even GPT-4.1 will regularly gaslight me that no, it definitely ran the tool call to add the event to my calendar, when it just didn't. I can't get any more adherence out of it.
cool to see people doing stuff with smaller models
"There's a tree. There's a tree. There's a tree. There's a number of pedestrians. There's a tree. There's a sign." does not strike me as useful feedback for getting around.
"Pavement. Row of stores to the left. Joe's Grocery Store. Doors. Door handle. A shelf with bakery. A shelf with canned goods. A shelf with bottles. Coke bottle. Large Pepsi bottle. Apple juice bottle. Passageway. Checkout. Payment terminal. Door. Door handle. Pavement. ..."
Tesseract is awful for handwriting.