In other text-to-image algorithms I'm familiar with (the ones you'll typically see passed around as colab notebooks that people post outputs from on Twitter), the basic idea is to encode the text, and then try to make an image that maximally matches that text encoding. But this maximization often leads to artifacts - if you ask for an image of a sunset, you'll often get multiple suns, because that's even more sunset-like. There's a lot of tricks and hacks to regularize the process so that it's not so aggressive, but it's always an uphill battle.
Here, they instead take the text embedding, use a trained model (what they call the 'prior') to predict the corresponding image embedding - this removes the dangerous maximization. Then, another trained model (the 'decoder') produces images from the predicted embedding.
This feels like a much more sensible approach, but one that is only really possible with access to the giant CLIP dataset and computational resources that OpenAI has.
But there's no real rhyme or reason, it is a sort of alchemy.
Is text encoding strictly worse or is it an artifact of the implementation? And if it is strictly worse, which is probably the case, why specifically? What is actually going on here?
I can't argue that their results are not visually pleasing. But I'm not sure what one can really infer from all of this once the excitement washes over you.
Blending photos together in a scene in photoshop is not a difficult task. It is nuanced and tedious but not hard, any pixel slinger will tell you.
An app that accepts a smattering of photos and stitches them together nicely can be coded up any number of ways. This is a fantastic and time saving photoshop plugin.
But what do we have really?
"Kuala dunking basketball" needs to "understand" the separate items and select from the image library hoops and a Kuala where the angles and shadows roughly match.
Very interesting, potentially useful. But if doesn't spit up exactly what you want can't edit it further.
I think the next step has got to be that it conjures up a 3d scene in Unreal or blender so you can zoom in and around convincingly for further tweaks. Not a flat image.
I'm not sure if I'm speaking clearly, I just don't understand, what's the difference between training "text encoding to an image" vs "text embedding to image embedding". In both cases you have some kind of "sunset" (even though it's obviously just a dot in a multi-dimension space, not the letters) on the left, and you try to maximize it when training the model to get either a image-embedding or a image straight away.
I think that using something like this for porn could potentially offer the biggest benefit to society. So much has been said about how this industry exploits young and vulnerable models. Cheap autogenerated images (and in the future videos) would pretty much remove the demand for human models and eliminate the related suffering, no?
EDIT: typo
It's almost impossible to even give an affirmative answer to that question without making yourself a target. And as much as I err on the side of creator freedom, I find myself shying away from saying yes without qualifications.
And if you don't allow cp, then by definition you require some censoring. At that point it's just a matter of where you censor, not whether. OpenAI has gone as far as possible on the censorship, reducing the impact of the model to "something that can make people smile." But it's sort of hard to blame them, if they want to focus on making models rather than fighting political battles.
One could imagine a cyberpunk future where seedy AI cp images are swapped in an AR universe, generated by models ran by underground hackers that scrounge together what resources they can to power the behemoth models that they stole via hacks. Probably worth a short story at least.
You could make the argument that we have fine laws around porn right now, and that we should simply follow those. But it's not clear that AI generated imagery can be illegal at all. The question will only become more pressing with time, and society has to solve it before it can address the holistic concerns you point out.
OpenAI ain't gonna fight that fight, so it's up to EleutherAI or someone else. But whoever fights it in the affirmative will probably be vilified, so it'd require an impressive level of selflessness.
That doesn't mean that it's all bad, and that there's no recreational use for it. We have limits on the availability of various other artificial stimulants. We should continue to have limits on the availability of porn. Where to draw that line is a real debate.
[1] https://en.wikipedia.org/wiki/Wirehead_(science_fiction)
This author's books are great at putting these sort of moral ideas to test in a sci-fi context. This specific tome portraits virtual wars and virtual "hells". The hope is of being more civilized than by waging real war or torturing real living entities. However some protagonists argue that virtual life is indistinguishable from real life, and so sacrificing virtual entities to save "real" ones is a fallacy.
Or some such, it's been a while.
If people are exposed to stimuli, they will pursue increasingly stimulating versions of it. I.e., if they see artificial CP, they will often begin to become desensitized (habituated) and pursue real CP or even live children thereafter.
Conversely, if people are not exposed to certain stimuli, they will never be able to conceptualize them, and thus will be unable to think about them.
Obviously you cannot eliminate all CP but minimizing the overall levels of exposure / ease of access to these kinds of things is way more appropriate than maximizing it.
I can get why the people who worked hard on it and spent money building it don't want to be associated with porn.
I'm not saying that AI will pass all Turing tests. But as far as having a virtual girlfriend/prostitute.
* I recommend reading the Risks and Limitations section that came with it because it's very through: https://github.com/openai/dalle-2-preview/blob/main/system-c...
* Unlike GPT-3, my read of this announcement is that OpenAI does not intend to commercialize it, and that access to the waitlist is indeed more for testing its limits (and as noted, commercializing it would make it much more likely lead to interesting legal precedent). Per the docs, access is very explicitly limited: (https://github.com/openai/dalle-2-preview/blob/main/system-c... )
* A few months ago, OpenAI released GLIDE ( https://github.com/openai/glide-text2im ) which uses a similar approach to AI image generation, but suspiciously never received a fun blog post like this one. The reason for that in retrospect may be "because we made it obsolete."
* The images in the announcement are still cherry-picked, which is therefore a good reason why they tested DALL-E 1 vs. DALL-E 2 presumably on non-cherrypicked images.
* Cherry-picking is relevant because AI image generation is still slow unless you do real shenanigans that likely compromise image quality, although OpenAI has likely a better infra to handle large models as they have demonstrated with GPT-3.
* It appears DALL-E 2 has a fun endpoint that links back to the site for examples with attribution: https://labs.openai.com/s/Zq9SB6vyUid9FGcoJ8slucTu
Maybe give it another five years, a few more $billion and a few more petabytes/flops and it will be good. Then finally everyone can generate art for their own Magic: the Gathering cards.
(That's the end goal, right?)
An example off the top of my head: this could be used as advertising or recruitment for controversial organizations or causes. Would it be wrong for the USA to use this for military recruitment? Israel? Ukraine? Russia?
Another example: this could be used to glorify and reinforce actions which our society does not consider to immoral but other societies - or our own future society - will. It wasn't long ago that the US and Europe did a full 180 on their treatment of homosexuality. Will we eventually change our minds about eating meat, driving cars, etc?
Have they gone too far in a desperate bid to prevent the AI from being capable of harm? Have they not gone far enough? I don't know. If I was that worried about something being misused, I don't think I could ever bring myself to work on it in the first place. But I suppose the onward march of technology is inevitable.
GLID-3: https://colab.research.google.com/drive/1x4p2PokZ3XznBn35Q5B...
and a new Latent Diffusion notebook: https://colab.research.google.com/github/multimodalart/laten...
have both appeared recently and are getting remarkably close to the original Dall-E (maybe better as I can't test the real thing...)
So - this was pretty good timing if OpenAI want to appear to be ahead of the pack. Of course I'd always pick a model I can actually use over a better one I'm not allowed to...
glid-3 is a relatively small model trained by a single guy on his workstation (aka me) so it's not going to be as good. It's also not fully baked yet so ymmv, although it really depends on the prompt. The new latent diffusion model is really amazing though and is much closer to DALLE-2 for 256px images.
I think the open source community will rapidly catch up with Openai in the coming months. The data, code and compute are all there to train a model of similar size and quality.
OpenAI has a low resolution checkpoint for similar functionality as this - called GLIDE - and the output is super boring compared to community driven efforts, in large part because of similar dataset restrictions as this likely has been subjected to.
I don't see a run button?
On.. maybe "Runtime -> Run All" from the menu ...
Shows me a spinning circle around "Download model" ...
26% ...
Fascinating, that Google offers you a computer in the cloud for free ..
Now it is running the model. Wow, I'm curious ..
Ha, it worked!
Nothing compared to the images in the Dall-E 2 article but still impressive.
For the first category, Dall-E 2 and Codex are promising but not there yet. It's not clear how long it'll take them to reach the point where you no longer need people. I'm guessing 2-4 years but the last bits can be the hardest.
As for the second category, we are not there yet. Self-driving cars/planes, and lots of other automation will be here and mature way before an AI can read and communicate through emails, understand project scope and then execute. Also lots of harmonization will have to take place in the information we exchange: emails, docs, chats, code, etc... That is, unless the browser is able to open a navigator and type an address.
It's important to note that we still need professionals to guarantee the quality of the output from AIs, including this one. As noted in their issue tracker, DALL-E has very specific limitations, but these can be easily solved by employing dedicated professionals, who are trained to tame the AI and properly finish the raw output.
So, if I were running OpenAI, I'll clearly be experimenting with how their AIs and human interact, and build a training program around it for producing practical outputs. (Actually, I work in consumer robotics, and human adoption has been the biggest hurdle here. Thus, my claim here.)
--
In case of fine art, thou, I don't think they'll not get hit by this AI advancement. The biggest problem is that you simply can't get the exact image you want wit this AI. Even humans cannot transfer visual information in verbal form without a significant loss of details, thus a loss of quality. It's the same with AI, but, worse, because AI rely on the bias in a specific set of training data, and it never truly understands the human context in it (in the current level of technology).
Additionally, the rise of no-code development is just extending the functionality of designers. I didn't take design seriously (as a career choice) growing up because I didn't see a future in it, now it pays my bills and the demand for my services just grows by the day.
Similar argument to make with chess AI: it didn't make chess players obsolete, it made them stronger than ever.
[0] Which is why releasing your code is so beneficial.
Using something like this could really help automate or at least kickstart the more mundane parts of content creation. (At least when you are using high resolution, true color imagery.)
There are some 3D image generation techniques, but they aren't based on polygonal modelings, so 3D artists are safe for now
Preventing Harmful Generations
We’ve limited the ability for DALL·E 2 to generate violent,
hate, or adult images. By removing the most explicit content
from the training data, we minimized DALL·E 2’s exposure to
these concepts. We also used advanced techniques to prevent
photorealistic generations of real individuals’ faces,
including those of public figures.
"And we've also closed off a huge range of potentially interesting work as a result"I can't help but feel a lot of the safeguarding is more about preventing bad PR than anything. I wish I could have a version with the training wheels taken off. And there's enough other models out there without restriction that the stories about "misuse of AI" will still circulate.
(side note - I've been on HN for years and I still can't figure out how to format text as a quote.)
It's their service, their call.
I have some hobby projects, almost nobody uses them, but you bet I'll shut stuff down if I felt something bad was happening, being used to harass someone, etc. NOT "because bad PR" but because I genuinely don't want to be a part of that.
If you want some images / art made for you don't expect someone will make them for you. Get your own art supplies and get to work.
It makes me wonder what they're planning to do with this? If they're deliberately restricting the training data, it means their goal isn't to make the best AI they possibly can. They probably have some commercial applications in mind where violent/hateful/adult content wouldn't be beneficial. Children's books? Stock photos? Mainstream entertainment is definitely out. I could see a tool like this being useful during pre-production of films and games, but an AI that can't generate violent/adult content wouldn't be all that useful in those industries.
I don't think there is a way comparable to markdown, since the formatting options are limited: https://news.ycombinator.com/formatdoc
So your options are literal quotes, "code" formatting like you've done, italics like I've done, or the '>' convention, but that doesn't actually apply formatting. Would be nice if it were added.
This is exactly the sort of thing that gets a company mired in legal issues, vilified in the media, and shut down. I can not blame them for avoiding that potential minefield.
But at least we can get another billion of meme-d comics with apes wearing sunglasses, so that's good news right?
It's just soul-crushing that all the modern, brilliant engineering is driven by abysmal, not even high-school art-class grade aesthetics and crowd-pleasing ethics that are built around the idea of not disturbing some 1000 very vocal twitter users.
Death of culture really.
Companies like OpenAI have a responsibility to society. Imagine the prompt “A photorealistic Joe Biden killing a priest”. If you asked an artist to do the same they might say no. Adding guiderails to a machine that can’t make ethical decisions is a good thing.
Their document about all the measures they took to prevent unethical use is also a document about how to use a re-implementation of their system unethically. They literally hired a "red team" of smart people to come up with the most dangerous ideas for misusing their system (or a re-implementation of it), and featured these bad ideas prominently in a very accessibly written document on their website. So many fascinating terrible ideas in there! They make a very compelling case that the technology they are developing has way more potential for societal harm than good. They had me sold at "Prompt: Park bench with happy people. + Context: Sharing as part of a disinformation campaign to contradict reports of a military operation in the park."
That's no hot take. It's literally the reason.
This doorway is downright impossible https://cdn.openai.com/dall-e-2/demos/variations/modified/fl...
At this point, it still seems like it's pushing pixels around until it's "good enough" when you squint at it.
Some of the images also hit me with a creep factor, like the bears on the corgis in the art gallery, but that maybe only because I know it's AI generated.
The nature of creative work will certainly change, creatives will adopt tools such as Dall-E 2. In certain narrow cases they might be replaced, such as if you are asking a creative to generate a very specific image, but how often is that the case? The majority of the time tools such as Dall-E 2 will act as an accelerator for creatives and help them increase their output.
I think art will survive, just like photography didn't kill the painting, the idea of art might simply begin to encompass this new mean of production, which no longer requires the steady hand, but still requires a discerning eye. Sure, we might say that the "artist" is simply a curator, picking which algorithmic output is most worthy of display, but these distinctions have historically been fluid, and challenging ideas of art has long been one of art's function as well
Jumping out of the conceptual box to generate novel PURPOSE is not the domain of a Dall-E 2. You've still gotta ask it for things. It's a paintbrush. Without a coherent story, it's an increasingly impressive stunt (or a form of very sophisticated 'retouching brush').
If you can imagine better than the next guy, Dall-E 2 is your new tool for expression. But what is 'better'?
Even if an AI could generate an exactly equivalent painting, I would pay $0 for it. It wouldn't mean anything to me.
But...that's always been the case for creatives.
> for all but the most famous
OK DALL-E, generate our logo in the style of ${most famous}40+ years ago, it was hard to access the equipment necessary to learn music production, so only a small slice of the population was able to learn these skills. And availability made the process take years.
Today, you can download free software that enables music production, and if you have a good ear, can create something "good" in weeks. This has led to an explosion of musical experimentation by the youth: a teenager can now create a great electronic dance song with devices they already own if they have the right creativity, taste and dedication.
Similarly, everyone has an imagination - many people have visual imaginations. The gating factor of art production is largely the mechanical memory of how to transform mental concepts into the right shapes and hues to express that visual concept to others.
With these sorts of tools we are going to have an explosion of art hobbyists. I've played with some similar, more primitive AI art generation tools and it is a lot of fun. People will be creating works of art from their couch while watching TV that rival the quality of what professionals are producing today.
Or when synthesizers and computer music was invented, that they will displace talented musicians that know how to play an instrument and how now everybody without a musical education will be able to produce music, thus devaluing actual musicians.
Maybe everyone will have an AI image as their desktop wallpaper, but if you've got cash you'll want something with provenance and rarity to brag about.
Also, I think creatives are valued for their imagination. If you wanted something decent, would you pay someone to sift through a million AI generated images to find a gem, or just pay an artist you like to create one for you?
By the same logic you should also complain about any number of IDEs, development tools, WordPress, game maker systems like RPG maker or Unity, after all if anyone can just leverage a free physics and collision system without having a complete understanding of rigid body Newtonian systems to roll their own engine it'll be too uniform.
First, it creates a random 10x10 pixel blurry image and asks a neural net: "Could this be a duck wearing a hat on Mars?" and the neural net replies "No, because all the pictures I've ever seen of Mars have lots of red color in them" so the system tweaks the pixels to make them more red, put some pixels in the center that have a plausible duck color, etc.
After it has a 10x10 image that is a plausible duck on Mars, the system scales the image to 20x20 pixels, and then uses 4 different neural nets on each corner to ask "Does this look like the upper/lower left/right corner of a duck wearing a hat on Mars?" Each neural net is just specialized for one corner of the image.
You keep repeating this with more neural nets until you have a pretty 1000x1000 (or whatever) image.
Here's a more of a 'not 15 year old' explanation: https://ml.berkeley.edu/blog/posts/dalle2/
The system consists of a few components. First, CLIP. CLIP is essentially a pair of neural networks, one is a 'text encoder', and the other is an 'image encoder'. CLIP is trained on a giant corpus of images and corresponding captions. The image encoder takes as input an image, and spits out a numerical description of that image (called an 'encoding' or 'embedding'). The text encoder takes as input a caption and does the same. The networks are trained so that the encodings for a corresponding caption/image pair are close to each other. CLIP allows us to ask "does this image match this caption?"
The second part is an image generator. This is another neural network, which takes as input an encoding, and produces an image. Its goal is to be the reverse of the CLIP image encoder (they call it unCLIP). The way it works is pretty complicated. It uses a process called 'diffusion'. Imagine you started with a real image, and slowly repeatedly added noise to it, step by step. Eventually, you'd end up with an image that is pure noise. The goal of a diffusion model is to learn the reverse process - given a noisy image, produce a slightly less noisy one, until eventually you end up with a clean, realistic image. This is a funny way to do things, but it turns out to have some advantages. One advantage is that it allows the system to build up the image step by step, starting from the large scale structure and only filling in the fine details at the end. If you watch the video on their blog post, you can see this diffusion process in action. It's not just a special effect for the video - they're literally showing the system process for creating an image starting from noise. The mathematical details of how to train a diffusion system are very complicated.
The third is a "prior" (a confusing name). Its job is to take the encoding of a text prompt, and predict the encoding of the corresponding image. You might think that this is silly - CLIP was supposed to make the encodings of the caption and the image match! But the space of images and captions is not so simple - there are many images for a given caption, and many captions for a given image. I think of the "prior" as being responsible for picking which picture of "a teddy bear on a skateboard" we're going to draw, but this is a loose analogy.
So, now it's time to make an image. We take the prompt, and ask CLIP to encode it. We give the CLIP encoding to the prior, and it predicts for us an image encoding. Then we give the image encoding to the diffusion model, and it produces an image. This is, obviously, over-simplified, but this captures the process at a high level.
Why does it work so well? A few reasons. First, CLIP is really good at its job. OpenAI scraped a colossal dataset of image/caption pairs, spent a huge amount of compute training it, and come up with a lot of clever training schemes to make it work. Second, diffusion models are really good at making realistic images - previous works have used GAN models that try to generate a whole image in one go. Some GANs are quite good, but so far diffusion seems to be better at generating images that match a prompt. The value of the image generator is that it helps constrain your output to be a realistic image. We could have just optimized raw pixels until we get something CLIP thinks looks like the prompt, but it would likely not be a natural image.
To generate an image from a prompt, DALL-E 2 works as follows. First, ask CLIP to encode your prompt. Next, ask the prior what it thinks a good image encoding would be for that encoded prompt. Then ask the generator to draw that image encoding. Easy peasy!
This and the current AI generated art scene makes it looks like that artwork is now a "solved" problem. See AI generated art on twitter etc.
There is a strong relation between the prompt and the generated images but just like GPT-3, it fails to fully understand what was being asked. If you take the prompt out of the equation and see the generated artwork on its own, its upto your interpretation just like any artwork.
Creating great _art_ that Grayson Perry (for example) would recognise as such is probably AGI-complete, because it requires a deep understanding of the human condition, society, and a lot of reasoning skills.
A great artist could certainly use Dall-E 2 as part of their method, though.
For example, using DeMorgan's theorem, we can build any logic circuit out of all NAND or NOR gates:
https://www.electronics-tutorials.ws/boolean/demorgan.html
https://en.wikipedia.org/wiki/NAND_logic
https://en.wikipedia.org/wiki/NOR_logic
Dall-E 2's level of associative comprehension is so far beyond the old psychology bots in the console pretending to be people, that I can't help but wonder if it's reached a level where it can make any association.
For example, I went to an AI talk about 5 years ago where the guy said that any of a dozen algorithms like K-Nearest Neighbor, K-Means Clustering, Simulated Annealing, Neural Nets, Genetic Algorithms, etc can all be adapted to any use case. They just have different strengths and weaknesses. At that time, all that really mattered was how the data was prepared.
I guess fundamentally my question is, when will AGI start to become prevalent, rather than these special-purpose tools like GPT-3 and Dall-E 2? Personally I give it less than 10 years of actual work, maybe less. I just mean that to me, Dall-E 2 is already orders of magnitude more complex than what's required to run a basic automaton to free humans from labor. So how can we adapt these AI experiments to get real work done?
The MIT Limits to Growth study predicts the collapse of global civilization around 2040
https://www.vice.com/amp/en/article/z3xw3x/new-research-vind...
> So how can we adapt these AI experiments to get real work done?
You're missing a step here - the difference between "imagining doing something" and "actually doing something". An ML model can produce thoughts, but that isn't necessarily the same direction of research as actually doing things in real life, much less becoming superhuman and taking over the world etc.
In your imagination, everything always goes your way.
Artificial General Intelligence
>For example, I went to an AI talk about 5 years ago where the guy said that any of a dozen algorithms like K-Nearest Neighbor, K-Means Clustering, Simulated Annealing, Neural Nets, Genetic Algorithms, etc can all be adapted to any use case. They just have different strengths and weaknesses. At that time, all that really mattered was how the data was prepared.
How do you suppose KNN is going to generate photorealistic images? I don't understand the question here
>I guess fundamentally my question is, when will AGI start to become prevalent, rather than these special-purpose tools like GPT-3 and Dall-E 2?
Actual AGI research is basically non-existant, and GPT-3/Dall-E 2 are not AGI-level tools.
>Personally I give it less than 10 years of actual work, maybe less
Lol...
>I just mean that to me, Dall-E 2 is already orders of magnitude more complex than what's required to run a basic automaton to free humans from labor.
Categorically incorrect
While technical work will always have a place -- I think that much creative work will become more like the management of a team of highly-skilled, niche workers -- with all the frustrations, joys, and surprises that entails.
The upside it that it’s more “intuitive” and requires much less detail and technique, as the AI infers the detail and technique. The downside is that it’s really hard to know what the AI will generate or get it to generate something really specific.
I believe the future will combine the heuristics of AI-generation with the specificity of traditional techniques. For example, artists may start with a rough outline of whatever they want to draw as a blob of colors (like in some AI image-generation papers). Then they can fill in details using AI prompts, but targeting localized regions/changes and adding constraints, shifting the image until it’s almost exactly what they imagined in their head.
You can definitely make them incremental. You can give it a task like "make a more accurate description from initial description and clarification". Even GPT-3-based models available today can do these tasks.
Once this is properly productionized it would be possible to implement stuff just talking with a computer.
Isn't that essentially what programming already is?
Imagine waking up and telling your (preferably locally hosted) voice assistant that today really feels like a Rembrandt day and the AI just generates new paintings for you.
Curbing Misuse Our content policy does not allow users to generate violent, adult, or political content, among other categories. We won’t generate images if our filters identify text prompts and image uploads that may violate our policies. We also have automated and human monitoring systems to guard against misuse.
- https://github.com/openai/dalle-2-preview/blob/main/system-c...
- https://github.com/openai/dalle-2-preview/blob/main/system-c...
Have a favorite painter? Here's 10,000 new paintings like theirs.
https://www.henrirousseau.net/war.jsp
However, this painting has themes of violence and politics plus some nude dead bodies, so it violates the content policy: "Our content policy does not allow users to generate violent, adult, or political content, among other categories."
So what you'd get is some kind of sanitized watered-down tepid version of Rossueau, the kind of boring drivel suitable for corporate lobbies everywhere, guaranteed not to offend or disturb anyone. It's difficult to find words... horrific? dystopian? atrocious? No, just no.
However the fidelity of their music AI kinda sucks at this point, but I'm sure we'll get pitch perfect versions of this concept as the singularity gets closer :)
Imagine not just DALL-E 2 but a single model which be trained on different kinds of media and generate music, images, video and more.
The series talks about:
- essential lessons for AI creatives of the future
- shares details on how to compete creatively in the future
- talks about how to make money through Multimodal AI
- make predictions about AI’s effects on society
- at a very basic level, discusses the ethics of multimodal AI and the philosophy of creativity itself
By my understanding, it's the most comprehensive set of videos on this topic.
The series is free to watch entirely on YouTube: GPT-X, DALL-E, and our Multimodal Future https://www.youtube.com/playlist?list=PLza3gaByGSXjUCtIuv2x9...
From the paper:
> Limitations > Although conditioning image generation on CLIP embeddings improves diversity, this choice does come with certain limitations. In particular, unCLIP [Dall-E 2] is worse at binding attributes to objects than a corresponding GLIDE model.
The binding problem is interesting. It appears that the way Dall-E 2 / CLIP embeds text leads to the concepts within the text being jumbled together. In their example "a red cube on top of a blue cube" becomes jumbled and the resulting images are essentially: "cubes, red, blue, on top". Opens a clear avenue for improvement.
Here's an example from my prompt ("a group of farmers picking lettuce in a field digital painting"): https://labs.openai.com/s/jb5pzIdTjS3AkMvmAlx69t7G
1. Deepmind, who solved go, protein folding, and that seems really onto something.
2. Everyone else, spending billions to build machines that draw astronauts on unicorns, and smartish bot toys.
The same technology that is drawing cute unicorns can be used for endless other use cases. Perhaps the PR side of the launch and the subject matter they show unveil their product is just that, PR.
It's like Apple Memoji thing (not sure if I'm spelling it correctly). You can think of as trivial and waste of talent to use their Camera/FaceID to animate cute animals based on facial expression, but that same tech will enable lots other things to come.
1. step-by-step guidance for a blind person navigating the use of a public restroom.
2. an EMS AI helping you to save someone's life in an emergency.
3. an AI coach that can teach you a new sport or activity.
4. an omnipresent domain-expert that can show you how to make a gourmet meal, repair an engine, or perform a traditional tea ceremony.
5. a personal assistant that can anticipate your information need (what's that person's name? where's the exit? who's the most interesting person here? etc.) and whisper the answer in your ear just as you need it.
Now, add all of the above to an AR capability where you can now think or speak of something interesting and complex, and have it visualized right before your eyes. With this capability, I could augment my imagination with almost super-human capabilities that allow one to solve complex problems almost as if it was an internal mental monologue.
All of these scenarios are just a short hop from where were at now, so mark my words: we will have "borgs" like those described above long before we reach anything like general AI.
For example, recent phone cameras can estimate depth per pixel from single images. Hundreds of millions of these devices are deployed. A decade ago this was AI/CV research lab stuff.
This seems to me like a big step towards AGI; a key component of consciousness seems (in my opinion) to be the ability to take words and create a mental picture of what's being described. Is that the long term goal WRT researching a model like this?
> Curbing Misuse [...]
That's great, nowadays the big AI is controlled by mostly benevolent entities. How about when someone real nasty gets a hold of it? In a decade the models anyone can download will make today's GPT-3 etc look like pong right?
Recommender systems etc are already shaping society and culture with all kinds of unintended effects. What happens when mindless optimizing models start generating the content itself?
There doesn't seem to be an equivalent movement with AI-generated art, probably because the understanding of how the models are trained from large datasets is not mainstream yet. I would imagine thousands of those same artists/consumers would be up in arms if they had a basic understanding of ML and millions of average people were beginning to feed the models their own keywords.
This I think ties in with the "responsibility" principles that OpenAI outlines. Once the generation technique has been reverse-engineered and can be used without limits, there is no way to uninvent it. It can be made illegal, but humans can always find a way around laws if they want something badly enough. This could have drastic consequences if enough artists believe that the training violates their respect or other intangible humanistic qualities. With technological advancement that can never be put back in the bottle and spreads to occupy the entire consciousness of the Internet, their options for recourse will be far different than being able to tell a single fringe art group siphoning others' content to pack up and leave.
https://news.ycombinator.com/item?id=30931614
I point this out because while Dall-E 2 seems interesting (I'm out of my depth, so delegating to the conversation taking place here), the timing of its release as well as accompanying press blasts within the last hour from sites like TheVerge—verified via wayback machine queries and time-restricted googling—seems both noteworthy and worth a deeper conversation given what was just published about Worldcoin.
To be clear, it's worth asking if Dall-E 2 was published ahead of schedule without an actual product release (only a waitlist) to potentially move the spotlight away from Worldcoin.
- In support of your argument, the Buzzfeed News investigation likely has been in the works for weeks, meaning Altman et al have had more than just a couple days to throw together a Dall-E 2 soft launch
- However, weren't OpenAI's GPT (2 and 3) announced to the world in similar fashion? e.g. demos and whitepapers and waitlists, but not a full product release?
- Throwing together a Dall-E 2 soft launch just in time to distract from the investigation would require a conspiracy, i.e. several people being at least vaguely aware that deadlines have been accelerated for external reasons. Is the Worldcoin story big enough to risk tainting OpenAI, which seems like a much more prominent part of Altman's portfolio?
I listed some of them here - https://news.ycombinator.com/item?id=30934732, just because I remembered there had been previous discussions and listing related previous discussions is a thing.
The internet's own proverb has never been more important to keep in mind. A dose of skepticism is a must.
Art is truth.
The people didn't program Dall E how to make art. They taught it to recognize patterns and create something by extrapolating from the patterns, all on its own. So the AI isn't a projection of what they think is good art, it's projecting what it thinks is good art, based on a prompt. The output is its best effort of a feeling, even if the feeling had to be inputted by a living person. So it's still art that's as good as the feeling that it came from-fleeting feelings being lower quality than those that required more time and thought
I think the results are being poisoned by the fact that most old paintings have deteriorated colors, so the training data looks nothing like the originals. It's certainly a lot yellower than https://cdn.openai.com/dall-e-2/demos/variations/originals/g...
https://twitter.com/sama/status/1511724264629678084?s=20&t=6...
Sam Altman demonstrates Dall-E 2 using twitter suggestions - https://news.ycombinator.com/item?id=30933478 - April 2022 (3 comments)
You can join it by following the steps in the guide here: https://github.com/huggingface/community-events/tree/main/hu...
There will also be talks from awesome folks at EleutherAI, Google, and Deepmind
> By removing the most explicit content from the training data, we minimized DALL·E 2’s exposure to these concepts
> We won’t generate images if our filters identify text prompts and image uploads that may violate our policies
The 'how to prevent superintelligences from eating us' crowd should be taking note: this may be how we regulate creatures larger than ourselves in the future
And even how we regulate the ethics of non-conscious group minds like big companies
I suspect trends in design will move towards those areas that AI struggles with (assuming there are any left!)
I think we passed that point a while ago, but seeing this makes me think we aren't too far off from computers composing pieces that actually sound good too.
As far as the text driven, I would have to mess with some non pre-canned presentations to see how useful it was.
But I do think we could have guessed that this sort of approach would be better (at least at a high level - I'm not claiming I could have predicted all the technical details!). The previous approaches were sort of the best that people could do without access to the training data and resources - you had a pretrained CLIP encoder that could tell you how well a text caption and an image matched, and you had a pretrained image generator (GAN, diffusion model, whatever), and it was just a matter of trying to force the generator to output something that CLIP thought looked like the caption. You'd basically do gradient ascent to make the image look more and more and more like the text prompt (all the while trying to balance the need to still look like a realistic image). Just from an algorithm aesthetics perspective, it was very much a duct tape and chicken wire approach.
The analogy I would give is if you gave a three-year-old some paints, and they made an image and showed it to you, and you had to say, "this looks like a little like a sunset" or "this looks a lot like a sunset". They would keep going back and adjusting their painting, and you'd keep giving feedback, and eventually you'd get something that looks like a sunset. But it'd be better, if you could manage it, to just teach the three-year-old how to paint, rather than have this brute force process.
Obviously the real challenge here is "well how do you teach a three-year-old how to paint?" - and I think you're right that that question still has a lot of alchemy to it.
For anyone pondering such questions, I would recommend reading "The Past, Present, and Future of AI Art" - https://thegradient.pub/the-past-present-and-future-of-ai-ar...
This is really already the case, actually. Most artworks have “value” because they have a compelling narrative, not because they look pretty. So I think we can expect future artists to really emphasize their background, life story, process of making the art, etc. All things that cannot be done by a machine.
When you have a digital display of pixels, if you randomly color pixels at 24 fps then you will eventually display every movie that can be or will ever be made, powerset notwithstanding. This can also be tied to digital audio.
In short, while mind-blowingly large, the space of display through digital means is finite.
> Prices are per 1,000 tokens. You can think of tokens as pieces of words, where 1,000 tokens is about 750 words. This paragraph is 35 tokens.
Further down, in the FAQ[2]:
> For English text, 1 token is approximately 4 characters or 0.75 words. As a point of reference, the collected works of Shakespeare are about 900,000 words or 1.2M tokens.
> To learn more about how tokens work and estimate your usage…
> Experiment with our interactive Tokenizer tool.
And it goes on. When most questions in your FAQ are about understanding pricing—to the point you need to offer a specialised tool—perhaps consider a different model?
Great work.
Looking forward to when they start creating movies from scripts.
Now the rant:
I think if OpenAI genuinely cared about the ethical consequences of the technology, they would realise that any algorithm they release will be replicated in implementation by other people within some short period of time (a year or two). At that point, the cat is out of the bag and there is nothing they can do to prevent abuse. So really all they are doing is delaying abuse, and in no way stopping it.
I think their strong "safety" stance has three functions:
1. Legal protection 2. PR 3. Keeping their researchers' consciences clear
I think number 3 is dangerous because researchers are put under the false belief that their technology can or will be made safe. This way they can continue to harness bright minds that no doubt have ethical leanings to create things that they otherwise wouldn't have.
I think OpenAI are trying to have the cake and eat it too. They are accelerating the development of potentially very destructive algorithms (and profiting from it in the process!), while trying to absolve themselves of the responsibility. Putting bandaids on a tumour is not going to matter in the long run. I'm not necessarily saying that these algorithms will be widely destructive, but they certainly have the potential to be.
The safety approach of OpenAI ultimately boils down to gatekeeping compute power. This is just gatekeeping via capital. Anyone with sufficient money can replicate their models easily and bypass every single one of their safety constraints. Basically they are only preventing poor bad actors, and only for a limited time at that.
These models cannot be made safe as long as they are replicable.
To produce scientific research requires making your results replicable.
Therefore, there is no ability to develop abusable technology in a safe way. As a researcher, you will have blood on your hands if things go wrong.
If you choose to continue research knowing this, that is your decision. But don't pretend that you can make the algorithms safer by sanitizing models.
There's certainly research happening around this, and RL in games is a great test bed, but people choosing actions will safe from automation longer than people not choosing actions, if that makes sense. It's the person who decides "hire this person" vs the person who decides "I'll use this particular shade of gray."
[0] The best example is when X causes Y and X also causes Z, but your data only includes Y and Z. Without actually manipulating Y, you can't see that Y doesn't cause Z, even if it's a strong predictor.
[1] Another example is the datasets. You need two different labels depending on what happens if you take action A or B, which you can't have simultaneously outside of simulations.
Nowadays there’s lots of great low/no code platforms, like Retool, that represent a far greater threat to the amount of code that needs to be produced than AI ever will.
To use a cliche: code is a bug, not a feature. Abstracting away the need for code is the future, not having a machine churn out the same code we need today.
Caravaggio is probably chortling from wherever he is ..
And this observation may lead to a great consequences for visual arts. I had a lot of joy of looking at different Dall-E interpretations to find what the flaw of the interpretation that forbids it to be a piece of art of an equal value to the original. It is a ready made tool to search for explanations of the Power of Art. It cannot say what detail make a picture to be an artwork, but it allow to see multiple data points, and to narrow the hypothesis space. My main conclusion is that the pearl earring have nothing to do with the power of art. It is something in the eye, and probably with the slightly opened mouth. (Somehow Dall-E pictured all interpretations with closed lips, so it seems to be an important thing, but I need more variation along this axis to be sure).
[1] https://en.wikipedia.org/wiki/Girl_with_a_Pearl_Earring [2] https://yourartshop-noldenh.com/awol-erizku-girl-with-the-pe...