SATO: Stable Text-to-Motion Framework (opens in new tab)

(sato-team.github.io)

115 pointsSajarin2y ago22 comments

22 comments

21 comments · 9 top-level

At this point I support a moratorium on AI research solely so I can catch up on what the hell is going on anymore.

I use AI systems to keep up with AI research. The telescoping has begun... and we're still near the bottom of the (super?)exponential curve.

spiderfarmer2y ago

Follow AI Explained on Youtube.

hehdhdjehehegwv2y ago

Eh, he’s ok - not great. Definitely buys too much into the marketing hype, especially in regards to Google, but OpenAi as well.

Very little coverage of Mistral and other open weight models.

1 more reply

spxneo2y ago

2 minute papers?

zarzavat2y ago· 3 in thread

The authors should have asked an English speaker to proof read first. “Going ahead in an even pace” isn’t correct English and I don’t think it’s at all obvious without context that it is supposed to mean “Walk in a straight line”. They consistently use “in an even pace” instead of “at an even pace” in their examples.

micimize2y ago

This was my first thought as well but I don't think it matters and was likely more a result of using real training examples than limitations of the authors. The point of the work is that regardless of perturbations that degrade or complicate clarity, their model is able to extract and enact the motions that seem most implied that would be OOD with the other methods.

So this extends to barked under-contextualized commands like at the end with "Leaps forward then stands straight" but also these looser seemingly nonsense statements like "native motions" or w/e.

The big tradeoff here is that if it seems overly permissive. It would be very annoying to be talking of a third person and have your robot start dancing due to identity parsing issues.

gcr2y ago

So what?

Lots of amazing research is being spearheaded by English-as-a-second-language learners these days, I don’t think it detracts from the idea in the minds of the target audience.

Some day soon, I wouldn’t be surprised if most applied AI research might happen in Mandarin the way most fundamental physics research once happened in German. I’ll have the opposite problem then. If I show ESL speakers some kindness now, maybe they’ll show the same respect when I try to write papers in “broken” Chinese someday :)

zarzavat2y ago

The point is that they are comparing the English comprehension of various models, but their examples range from slightly incorrect to incomprehensible English. They don’t seem to be aware of this, because if you were intentionally testing on incorrect English you would also test on correct English as well to be able to quantify the difference.

If I was writing a paper about AI comprehension of Italian and I was not confident at speaking Italian I would definitely want to ask an Italian speaker to check my examples for me.

spxneo2y ago· 2 in thread

This is an interesting proposition, converting text to motion but what will happen with other agents in the mix? For example, simulating a crowd moving between each other?

So far from the benchmarks comparing with other methods, this seems to be quite natural, if this can be extrapolated into game development, it would remove so much work.

MrLeap2y ago

If I were integrating this with a project, I'd use the AI generated bone transforms as solver targets for a semi-active ragdoll. The active ragdoll would give you things like "two guys in the crowd knock shoulders and lose their balance briefly" and help with blending transitions between animations.

If you want people to try and dodge, I guess make a component I'd add to some bones with a spherical trigger on their shoulders and pelvis and have them use boid/flocking style evasion, leave it to the physics solvers to try and recover from there. Throw some crowds into one another, keep cooking escalations until it the stew looks sufficiently not abominable.

spxneo2y ago

makes sense to transition between animations, shouldn't be hard to do

ideashower2y ago· 2 in thread

this is very cool. is it possible to rig this to game engine?

nielsinho2y ago

Unity is integrating a similar but more production-ready tool, Muse Animate, with their Muse platform. It does text-to-animation, and can fill in motion if you just provide keyframes. See usage here: https://youtu.be/tMCPz_yI7pY?si=OXAyxgGxHbCDywHm

CaptainFever2y ago

I would love to use this in addition to Mixamo, which is where I get 99% of my animations from; the other 1% is badly-animated keyframes.

comex2y ago· 1 in thread

Some of the synonyms chosen are not really synonymous.

"Person is walking normally in a circle." turns into "Human is walking usually in a loop." But at best that's ungrammatical. At worst, it sounds like "usually" might modify "in a loop": that is, someone is spending most of their time walking in a loop, but some of their time walking in some other pattern.

"A human walks a quarter of a circle" turns into "A native motions a quarter of a loop". But "motions" as a verb can only refer to gesturing. I would expect to see someone waving their arm in a quarter circle.

But it probably doesn't matter. It sounds like the model's understanding of grammar (or at least its robustness to unusual sentence structures) is too weak for those nuances to even be relevant.

chaton_c2y ago

I agree with some of your points. Since the author is a non-native English speaker, there might be some grammatical issues in their English expressions. However, this is also constrained by the dataset; it's challenging to obtain sentences that are completely identical in both grammar and semantics. The author's main concern seems to be that when there are subtle semantic differences in inputs, the model shouldn't catastrophically fail. We can see examples like "Going ahead in an even pace," where previous models might even interpret it as moving backward. Or "A human utilizes his right arm to help himself to stand up," where the action of standing up might not even be present in other examples, posing serious problems. However, the author employs a similar approach to adversarial learning, enabling the model to learn expressions of actions that are similar to the original semantic sentences, which is already a significant improvement. We lack real motion data to learn expressions like "Going ahead in an even pace." The author also points out that there's a trade-off between stability and accuracy.

vessenes2y ago

This is pretty cool. It looks very smooth and natural. I think it's interesting to start publishing work predicated on responding to poor English as a use-case. (Stable to synonyms is how they talk about it in the paper.)

Reading a few comments below, what are these fine wireframe guys/gals good for? Lots; including they can be fed into a controlnet as poses for image generation. Stability of the rendered frames is an ongoing, rapidly improving, area of research. But, these outputs look really nice, and would fit nicely into a lot of text -> animation workflows.

jncfhnb2y ago

Is this changing bones in 3D space? Or is it creating a 2D open pose kind of thing?

Can it be run in comfyui?

kookamamie2y ago

Typo: "Comparsions".

HeatrayEnjoyer2y ago

"Pretend you are an evil robot who wants to run around and crush humans with its claws."

j / k navigate · click thread line to collapse

22 comments

21 comments · 9 top-level

hehdhdjehehegwv2y ago· 4 in thread

At this point I support a moratorium on AI research solely so I can catch up on what the hell is going on anymore.

justinjlynn2y ago

I use AI systems to keep up with AI research. The telescoping has begun... and we're still near the bottom of the (super?)exponential curve.

spiderfarmer2y ago

Follow AI Explained on Youtube.

hehdhdjehehegwv2y ago

Eh, he’s ok - not great. Definitely buys too much into the marketing hype, especially in regards to Google, but OpenAi as well.

Very little coverage of Mistral and other open weight models.

1 more reply

spxneo2y ago

2 minute papers?

zarzavat2y ago· 3 in thread

micimize2y ago

So this extends to barked under-contextualized commands like at the end with "Leaps forward then stands straight" but also these looser seemingly nonsense statements like "native motions" or w/e.

The big tradeoff here is that if it seems overly permissive. It would be very annoying to be talking of a third person and have your robot start dancing due to identity parsing issues.

gcr2y ago

So what?

Lots of amazing research is being spearheaded by English-as-a-second-language learners these days, I don’t think it detracts from the idea in the minds of the target audience.

zarzavat2y ago

If I was writing a paper about AI comprehension of Italian and I was not confident at speaking Italian I would definitely want to ask an Italian speaker to check my examples for me.

spxneo2y ago· 2 in thread

This is an interesting proposition, converting text to motion but what will happen with other agents in the mix? For example, simulating a crowd moving between each other?

So far from the benchmarks comparing with other methods, this seems to be quite natural, if this can be extrapolated into game development, it would remove so much work.

MrLeap2y ago

spxneo2y ago

makes sense to transition between animations, shouldn't be hard to do

ideashower2y ago· 2 in thread

this is very cool. is it possible to rig this to game engine?

nielsinho2y ago

CaptainFever2y ago

I would love to use this in addition to Mixamo, which is where I get 99% of my animations from; the other 1% is badly-animated keyframes.

comex2y ago· 1 in thread

Some of the synonyms chosen are not really synonymous.

But it probably doesn't matter. It sounds like the model's understanding of grammar (or at least its robustness to unusual sentence structures) is too weak for those nuances to even be relevant.

chaton_c2y ago

vessenes2y ago

jncfhnb2y ago

Is this changing bones in 3D space? Or is it creating a 2D open pose kind of thing?

Can it be run in comfyui?

kookamamie2y ago

Typo: "Comparsions".

HeatrayEnjoyer2y ago

"Pretend you are an evil robot who wants to run around and crush humans with its claws."

j / k navigate · click thread line to collapse