undefined | Better HN

0 pointsTeMPOraL3y ago0 comments

I suspect that they consider txt2img to be more of a curiosity now. Sure, it's transformative; it's going to upend whole markets (and make some people a lot of money in the process) - however, it's just producing images. Contrast with LLMs, which have already proven to be generally applicable in great many domains, and that if you squint, are probably capturing the basic mechanisms of thinking. OpenAI lost the lead in txt2img, but GPT-4 is still way ahead of every other LLM. It makes sense for them to focus pretty much 100% on that.

0 comments

3 comments · 1 top-level

gwern3y ago· 2 in thread

I find it curious because (a) if they don't care about text2image, why launch it as a service to begin with? (b) if they don't care now, why keep it up and let it keep consuming resources, human & GPU? (c) if they do still care, because as other models & services have demonstrated there's a ton of interest in text2image, why not invest the relatively minor amount of resources it would take to keep it competitive (look how few people work at Midjourney, or are authors on imagegen papers)? It may have cost >$100m to make GPT-4, but making a decent imagegen model costs a lot less than that! (Even now, you could probably create a SOTA model for <$10m, especially if you have GPT-3/4 available for the text encoding part.)

But launching it and then just letting it stagnate indefinitely and get worse every day compared to its increasingly popular competitors seems like the worst of all worlds, and I can't see what is the OA strategy there.

TeMPOraLOP3y ago

Maybe they keep it up just so that they have something in txt2img space? It may not be the best, or even good, but you don't know that until you try it, and until then, it just enhances the value of the OpenAI platform. E.g. if you're building something backed by OpenAI LLMs, and are thinking about future txt2img integration, the existence of Dall-E might stop you from "shopping around" txt2img services in advance.

The way I see it, they don't need txt2img at this moment - GPT-4 ensures they're the top #1 name both in the industry and in AI-related news stories. But it doesn't mean they won't come back to it. Couple observations:

- OpenAI isn't a "release early, release often" shop. They might be already working on something, but they'll release it only when it is a qualitative improvement over everyone else (or at least Dall-E).

- A bunch of hobbyists is doing all their work for free anyway. Stable Diffusion itself may not be SOTA, but the totality of hundreds of different fine-tunes on Civitai very much is. With all those models being shared in the open and relatively easy/cheap to recreate, it would make sense for OpenAI to just stand by and watch, and only invest resources once hobbyists hit a plateau.

- Looking at those Civitai models, it seems to me that OpenAI could beat txt2img SOTA easily, at any moment, by taking (or re-creating, depending on the license) the best five to ten SD derivatives, and put them behind GPT-4, or even GPT-3.5, fine-tuned to 1) choose the best SD derivative for user's prompt, and 2) transform user's prompt to set of parameters (positive & negative prompts, diffuser algo, numeric params) crafted with choice from 1) in mind. It's a black box. On the Internet, no one can tell you're an ensemble model.

- They could even be doing it as we speak - addition of function calls is aligned with this direction, fine-tuning for good prompt generation is mostly a txt2txt exercise, and again, hobbyists around the world are busy building a high-quality human-curated data set of {what I want}x{model + positive prompt + negative prompt + diffuser + other params} -> {is this any good?}. If I were them, I'd just mine this and not say anything.

- Overall, I think that in txt2img space, currently the hard part isn't the "img" part, but the "txt" part. OpenAI has a huge advantage here, and as long as its true, they're in position to instantly overtake everyone else in this space. That is, they have an "Ultimate attack" charged and ready, and are patiently waiting for a good moment to trigger it.

- Didn't they hint that GPT-4 successor will be multimodal? That could end up being their comeback to txt2img. And img2txt. And a bunch of other modalities.

EDIT: As if on cue, the very thing I was speculating about above is being discussed wrt. LLMs right now:

- https://news.ycombinator.com/item?id=36413296 - GPT-4 is 8 GPTs in a trench coat

- https://news.ycombinator.com/item?id=36413768 - 3-4 orders of magnitude efficiency (size vs effect) improvement in code generation, if your training data isn't garbage

And in both threads, people bring up older papers and discuss the merits of combining smaller specialized models into a more generic whole.

gwern3y ago

Why do they need to have something in text2image? It in no way builds lockin to the API or anything, especially with how gimped it is.

1. Yes, they are. Look at the constant iterative rollouts of GPTs 2. Most of which is useless to them, not that they have made any use of it 3. the fact that it would be so easy to improve, and they haven't, only emphasizes my point. 4. sure, that could be useful. Except there's zero integration or mention. (They haven't even opened up the vision part of GPT-4 yet.) 5. the fact that it would be so easy to improve, and they haven't, only emphasizes my point. 6. why wait for GPT-5 possibly years from now?

1 more reply

j / k navigate · click thread line to collapse