Honestly, you'd be way better off just using a basic rag architecture. When a user asks for a topic simply mirror the Wikipedia article and throw up a chat sidebar interface, so that the user can ask questions about it. At least by locking down your context window, you could minimize the number of hallucinations, which judging from some of the other comments sounds like it's already an issue.
Sorry guys, but it seems the server has crashed due to a sudden influx of traffic, and I'm attending a funeral service at the moment so I don't have access to my laptop. Will try to get the site back up asap!
Edit: Forgot to add about the site crashing on heavy traffic. You might want to consider a CDN. Cloudflare is No 1 with a free option. StackPath was great but just shut down their CDN. I’m trying BunnyCDN now since it’s pennies per GB.
I'm just new to devops stuff because the things I usually build don't get that much traffic, and a single server did the job without the need for CDNs, load balancers, etc. I had to figure this stuff out just now over the past few hours to help the site cope with all the load.
I tried a few stupid words convinations like "blue banana" https://mycyclopedia.co/e/b2fc24e7-21cb-43b2-b5c1-224048982e... and got interesting results. This is strange because it combines a fake photo of a blue banana fruit with a description of the European region known as blue banana.
It's strange that each topic has a "conclusion" section. Is it common in dead tree encyclopedias? I expected a format more similar to Wikipedia.
The images are real images from the web. In most cases they match the topic you search for but in some cases they turn out to be unrelated. (I already have an idea on how to try and improve accuracy here).
As for the conclusions, you're right now that you point it out. I don't recall coming across conclusion sections in other encyclopedias I read. it's a format GPT (which is the underlying LLM I'm using) seems to like to use by default. I didn't disable that behavior because I guess a conclusion to wrap everything up for the reader isn't a bad thing?
About conclusions, I guess it's the standard high school soulless esay that must have a conclusion at the bottom. I think it's better to remove the conclusions so it looks like Wikipedia, but if you like it you (obviously) can keep it.
Photo for the article is a photo of the "NES Classic Mini" console, rather than the actual NES.
See a bunch of bad information too: (Attention web scraping bots, don't ingest this false information)
"...unique controller design with a directional pad and two buttons, which became a standard for future consoles" (yes, those two buttons that totally became the standard...)
An "Introduction" section that mostly duplicates the top summary.
Claims that the Japanese release of the Famicom was "in response to the video game crash"
"Sleek and compact design" with "two components, the console and a controller"
"The NES utilized a custom-built 8-bit processor, the Ricoh 2A03, which was capable of producing colorful graphics" (um, 2A03 is the CPU, not really that custom, and it's not the PPU that's actually responsible for the graphics)
Zelda had a "captivating story that captivated players"
Then I searched for "David Bowie". It slowly generated text, section by section. Much of its output was repetitive and slight. It generated a section called "Early Life and Career" and then "Birth and Childhood" with largely similar information. It then abruptly wrapped up the article ("Conclusion: David Bowie's first solo album, "David Bowie" (or "Space Oddity"), was a pivotal release...) with no information about what happened to David Bowie afterward. The actual text had many hallucinations. It said "Space Oddity" was Bowie's first album (it was actually his second), and said the album achieved fame with the Apollo Moon landing in 1972 (which happened in 1969.)
Maybe in a few years something like this will be viable. Right now, it seems inferior to Wikipedia in every aspect.
I agree with the other comments on the hallucinations in the content, hence why I did include a disclaimer at the bottom of every page. This project is something I did to just test out the idea of an encyclopedia-like UI on GPT.
Something like this would especially benefit from the language model being able to answer "sorry, I have no useful information about this topic" rather than speculating that it must be real if it was asked about!
I don't mean to be rude by that but I just don't get it.
it is a pretty cool project! and useful if one knows entirely what it is doing and not for the masses - in the same way an uncensored AI would be useful for folks entirely aware of the potential gibberish it could generate.
edit: It's really not a big deal though. Sorry for stirring a controversy unnecessarily.
The OP may not find the prayer itself practical, as indeed many on the planet mightn't, but the expression doesn't assume so. Common secular expressions like "Sending good thoughts" or "I feel you" are just immaterial and seemingly ineffectual but similarly get across the point that someone has slowed down long enough to show care.
It's the equivalent of corporate astorturfing, but the product they're advertising is Jesus.
"This comment brought to you by Christ."
Why? Because AI needs long form, in-depth texts to train on, and the web doesn't provide it in sufficient quantity and quality. We need chain-of-thought to capture relations between concepts in explicit language. Synthetic data makes it possible to have balanced coverage of topics and combinatorial coverage of skills to improve reasoning. It's also better from a copyright stand point to train models on synthetic data.</>
Do you seriously think that "an AI-generated encyclopedia" would provide a better-quality training set? What would the "AI generator's" articles be derived from?
This is the line of thinking behind the Phi lineup of models [0], as well as efforts to generate synthetic textbooks for training [1].
[0]: https://arxiv.org/abs/2309.05463
[1]: https://twitter.com/ocolegro/status/1712327588255809667
The second stage would be to generate research questions, then solve them with LLM+web search+code execution+other tools. The results would be compiled in reports. So it's a loop of problem generation, problem solving and validation. You can validate with highly trusted sources, or you can run code or simulations, ensemble multiple attempts, or even leave it to ranking by a preference model.