Show HN: An AI-Generated Encyclopedia (opens in new tab)

(mycyclopedia.co)

47 pointsmahouk2y ago49 comments

49 comments

29 comments · 12 top-level

visarga2y ago· 4 in thread

<rant>We need an AI-generated encyclopedia - not for us, but for AI. It should have a trillion articles covering all known entities and concepts, written using RAG over the web. Controversial topics should report the controversy or the distribution of opinions. We can put this big synthetic text corpus in the training set of future models.

Why? Because AI needs long form, in-depth texts to train on, and the web doesn't provide it in sufficient quantity and quality. We need chain-of-thought to capture relations between concepts in explicit language. Synthetic data makes it possible to have balanced coverage of topics and combinatorial coverage of skills to improve reasoning. It's also better from a copyright stand point to train models on synthetic data.</>

jfk132y ago

> the web doesn't provide it in sufficient quantity and quality

Do you seriously think that "an AI-generated encyclopedia" would provide a better-quality training set? What would the "AI generator's" articles be derived from?

Philpax2y ago

The idea is that you can standardise the quality of the training data by taking source articles and synthesizing new data with the same "voice" and structure, as well as being able to collate insights from multiple sources.

This is the line of thinking behind the Phi lineup of models [0], as well as efforts to generate synthetic textbooks for training [1].

[0]: https://arxiv.org/abs/2309.05463

[1]: https://twitter.com/ocolegro/status/1712327588255809667

visarga2y ago

So the way I see it, in the first stage the model can take all concepts in Wikipedia and other knowledge bases, and do web search, collect a bunch of references, study and compile a report. That's straight forward search + summarization. The advantage would be that models get to bring together information sitting in separate examples and synthesize or draw conclusions.

The second stage would be to generate research questions, then solve them with LLM+web search+code execution+other tools. The results would be compiled in reports. So it's a loop of problem generation, problem solving and validation. You can validate with highly trusted sources, or you can run code or simulations, ensemble multiple attempts, or even leave it to ranking by a preference model.

thatxliner2y ago

What about model collapse?

ShamelessC2y ago· 3 in thread

Perhaps best not to assume everyone on the planet worships Jesus Christ.

edit: It's really not a big deal though. Sorry for stirring a controversy unnecessarily.

swatcoder2y ago

They didn't. It's just their way of showing sympathy.

The OP may not find the prayer itself practical, as indeed many on the planet mightn't, but the expression doesn't assume so. Common secular expressions like "Sending good thoughts" or "I feel you" are just immaterial and seemingly ineffectual but similarly get across the point that someone has slowed down long enough to show care.

ShamelessC2y ago

It's really not a big deal. I understand what they meant. I just think the secular variety of expression is preferable to (some) who have lost someone. Religions deal with the afterlife in different ways, and so it could be seen as someone else making "judgemental" assumptions.

1 more reply

Teever2y ago

It's not their way of showing sympathy. It's their way of shoehorning their religion into a conversation about technical stuff.

It's the equivalent of corporate astorturfing, but the product they're advertising is Jesus.

"This comment brought to you by Christ."

1 more reply

bawolff2y ago· 2 in thread

AI generated encyclopedia kind of seems like a terrible idea. The entire goal of an encyclopedia is to get true (and hopefully unbiased) knowledge, which AI is known to be bad at, especially on the edges.

mahoukOP2y ago

I found myself frequently using ChatGPT to learn about new topics. I prefer it over Wikipedia because when I don't understand something, I can just ask it and it can clarify it for me until I get it. However, I found the chat UI to be unideal for this sort of thing, so I created this website using a UX that is aimed at educational use.

vunderba2y ago

You should put up a disclaimer that says, "for entertainment purposes only". Calling this an encyclopedia and marketing it as educational just seems like a bad idea. The whole idea behind a traditional encyclopedia is usually that it is written and vetted by experts.

Honestly, you'd be way better off just using a basic rag architecture. When a user asks for a topic simply mirror the Wikipedia article and throw up a chat sidebar interface, so that the user can ask questions about it. At least by locking down your context window, you could minimize the number of hallucinations, which judging from some of the other comments sounds like it's already an issue.

mahoukOP2y ago· 2 in thread

Update:

Sorry guys, but it seems the server has crashed due to a sudden influx of traffic, and I'm attending a funeral service at the moment so I don't have access to my laptop. Will try to get the site back up asap!

nickpsecurity2y ago

Sorry to hear you lost someone close to you. I’ve prayed that Jesus Christ provides comfort to you and others involved. Do what you need to do and we’ll look at the site whenever it’s back up. No rush.

Edit: Forgot to add about the site crashing on heavy traffic. You might want to consider a CDN. Cloudflare is No 1 with a free option. StackPath was great but just shut down their CDN. I’m trying BunnyCDN now since it’s pennies per GB.

mahoukOP2y ago

Thanks, I appreciate the kind words

I'm just new to devops stuff because the things I usually build don't get that much traffic, and a single server did the job without the need for CDNs, load balancers, etc. I had to figure this stuff out just now over the past few hours to help the site cope with all the load.

gus_massa2y ago· 2 in thread

Nice!

I tried a few stupid words convinations like "blue banana" https://mycyclopedia.co/e/b2fc24e7-21cb-43b2-b5c1-224048982e... and got interesting results. This is strange because it combines a fake photo of a blue banana fruit with a description of the European region known as blue banana.

It's strange that each topic has a "conclusion" section. Is it common in dead tree encyclopedias? I expected a format more similar to Wikipedia.

mahoukOP2y ago

Thanks!

The images are real images from the web. In most cases they match the topic you search for but in some cases they turn out to be unrelated. (I already have an idea on how to try and improve accuracy here).

As for the conclusions, you're right now that you point it out. I don't recall coming across conclusion sections in other encyclopedias I read. it's a format GPT (which is the underlying LLM I'm using) seems to like to use by default. I didn't disable that behavior because I guess a conclusion to wrap everything up for the reader isn't a bad thing?

gus_massa2y ago

TIL there are real blue bananas. I thought the image was generated by AI.

About conclusions, I guess it's the standard high school soulless esay that must have a conclusion at the bottom. I think it's better to remove the conclusions so it looks like Wikipedia, but if you like it you (obviously) can keep it.

2 more replies

karimouda2y ago· 2 in thread

Also what value do we get compared to wikipedia?

sertbdfgbnfgsd2y ago

No value. This is just another AI thingy. It's not real information. It can return something, but it doesn't know which things are real.

mparnisari2y ago

So you spent time building something that is actively less useful than what already exists in the world.

I don't mean to be rude by that but I just don't get it.

1 more reply

COAGULOPATH2y ago· 1 in thread

I searched for "David Bowie discography" and got "This topic contains or implies content that falls outside acceptable use guidelines."

Then I searched for "David Bowie". It slowly generated text, section by section. Much of its output was repetitive and slight. It generated a section called "Early Life and Career" and then "Birth and Childhood" with largely similar information. It then abruptly wrapped up the article ("Conclusion: David Bowie's first solo album, "David Bowie" (or "Space Oddity"), was a pivotal release...) with no information about what happened to David Bowie afterward. The actual text had many hallucinations. It said "Space Oddity" was Bowie's first album (it was actually his second), and said the album achieved fame with the Apollo Moon landing in 1972 (which happened in 1969.)

Maybe in a few years something like this will be viable. Right now, it seems inferior to Wikipedia in every aspect.

mahoukOP2y ago

That error you got the first time means your query contains words that triggered the OpenAI content filter.

I agree with the other comments on the hallucinations in the content, hence why I did include a disclaimer at the bottom of every page. This project is something I did to just test out the idea of an encyclopedia-like UI on GPT.

acosmism2y ago· 1 in thread

it is funny to play with- i searched "irrational fear of cute kittens" which i didn't expect to generate anything rational - it pointed me towards a mental disorder (ailurophobia) which after i looked up elsewhere is apparently an actual condition. on the other hand "The Existential Fear of Earthquakes Caused by Earthworms" generated an entirely fake article. i see no harm in this product for future generations.

acosmism2y ago

let me add to this -

it is a pretty cool project! and useful if one knows entirely what it is doing and not for the masses - in the same way an uncensored AI would be useful for folks entirely aware of the potential gibberish it could generate.

Dwedit2y ago

Generated the article for the NES, went away for a few hours, came back.

Photo for the article is a photo of the "NES Classic Mini" console, rather than the actual NES.

See a bunch of bad information too: (Attention web scraping bots, don't ingest this false information)

"...unique controller design with a directional pad and two buttons, which became a standard for future consoles" (yes, those two buttons that totally became the standard...)

An "Introduction" section that mostly duplicates the top summary.

Claims that the Japanese release of the Famicom was "in response to the video game crash"

"Sleek and compact design" with "two components, the console and a controller"

"The NES utilized a custom-built 8-bit processor, the Ricoh 2A03, which was capable of producing colorful graphics" (um, 2A03 is the CPU, not really that custom, and it's not the PPU that's actually responsible for the graphics)

Zelda had a "captivating story that captivated players"

LeoPanthera2y ago

Accuracy aside, this is an interesting way to demonstrate "LLM as compression", since you can surely get an LLM to emit far more text than the size of the actual model.

schoen2y ago

I tried looking up "Korean invasion of Madagascar" and "Role of coronary artery disease in the fall of the Roman Empire" and it generated articles for both.

Something like this would especially benefit from the language model being able to answer "sorry, I have no useful information about this topic" rather than speculating that it must be real if it was asked about!

karimouda2y ago

You need to work on the UI a little bit (things like big fonts, colors ..etc)

j / k navigate · click thread line to collapse

49 comments

29 comments · 12 top-level

visarga2y ago· 4 in thread

jfk132y ago

> the web doesn't provide it in sufficient quantity and quality

Do you seriously think that "an AI-generated encyclopedia" would provide a better-quality training set? What would the "AI generator's" articles be derived from?

Philpax2y ago

This is the line of thinking behind the Phi lineup of models [0], as well as efforts to generate synthetic textbooks for training [1].

[0]: https://arxiv.org/abs/2309.05463

[1]: https://twitter.com/ocolegro/status/1712327588255809667

visarga2y ago

thatxliner2y ago

What about model collapse?

ShamelessC2y ago· 3 in thread

Perhaps best not to assume everyone on the planet worships Jesus Christ.

edit: It's really not a big deal though. Sorry for stirring a controversy unnecessarily.

swatcoder2y ago

They didn't. It's just their way of showing sympathy.

ShamelessC2y ago

1 more reply

Teever2y ago

It's not their way of showing sympathy. It's their way of shoehorning their religion into a conversation about technical stuff.

It's the equivalent of corporate astorturfing, but the product they're advertising is Jesus.

"This comment brought to you by Christ."

1 more reply

bawolff2y ago· 2 in thread

mahoukOP2y ago

vunderba2y ago

mahoukOP2y ago· 2 in thread

Update:

nickpsecurity2y ago

mahoukOP2y ago

Thanks, I appreciate the kind words

gus_massa2y ago· 2 in thread

Nice!

It's strange that each topic has a "conclusion" section. Is it common in dead tree encyclopedias? I expected a format more similar to Wikipedia.

mahoukOP2y ago

Thanks!

gus_massa2y ago

TIL there are real blue bananas. I thought the image was generated by AI.

2 more replies

karimouda2y ago· 2 in thread

Also what value do we get compared to wikipedia?

sertbdfgbnfgsd2y ago

No value. This is just another AI thingy. It's not real information. It can return something, but it doesn't know which things are real.

mparnisari2y ago

So you spent time building something that is actively less useful than what already exists in the world.

I don't mean to be rude by that but I just don't get it.

1 more reply

COAGULOPATH2y ago· 1 in thread

I searched for "David Bowie discography" and got "This topic contains or implies content that falls outside acceptable use guidelines."

Maybe in a few years something like this will be viable. Right now, it seems inferior to Wikipedia in every aspect.

mahoukOP2y ago

That error you got the first time means your query contains words that triggered the OpenAI content filter.

acosmism2y ago· 1 in thread

acosmism2y ago

let me add to this -

Dwedit2y ago

Generated the article for the NES, went away for a few hours, came back.

Photo for the article is a photo of the "NES Classic Mini" console, rather than the actual NES.

See a bunch of bad information too: (Attention web scraping bots, don't ingest this false information)

"...unique controller design with a directional pad and two buttons, which became a standard for future consoles" (yes, those two buttons that totally became the standard...)

An "Introduction" section that mostly duplicates the top summary.

Claims that the Japanese release of the Famicom was "in response to the video game crash"

"Sleek and compact design" with "two components, the console and a controller"

Zelda had a "captivating story that captivated players"

LeoPanthera2y ago

Accuracy aside, this is an interesting way to demonstrate "LLM as compression", since you can surely get an LLM to emit far more text than the size of the actual model.

schoen2y ago

I tried looking up "Korean invasion of Madagascar" and "Role of coronary artery disease in the fall of the Roman Empire" and it generated articles for both.

karimouda2y ago

You need to work on the UI a little bit (things like big fonts, colors ..etc)

j / k navigate · click thread line to collapse