Yes, good point, JS/TS is definitely behind Python. That might explain some of it.
I expect most models to become multi modal in the future and am building towards. A lot of the core logic of agents will nevertheless be text based imo, so that’s a central piece, but I already added text to image and speech to text, and plan to add text to speech next.