1. Scraping website. Used default node scraper. 5 mins.
2. Generated huggingface embeddings. 10 mins.
3. Use code AI to generate basic website. 5 mins.
4. Created prompt to limit to answers known. 1 min.
So in 30 mins we are able to create a site search powered by generative AI.
Disclaimer. This is still a work in progress.
One of the tricky things to overcome is the length limits of the prompt you can feed into things like GPT-3. There are some suggestions on the OpenAI website on how to overcome this. The main one seems to be to filter the content using embeddings and then only feed "relevant" sections into the prompt.
Would be interested to know what your approach is.
Could you share anything (e.g. how many rows of data and tokens in each row) around how much it cost you to use GPT Index? It looks interesting, but it seems it'd be expensive.
I do think LLMs will eventually be the solution to this long-standing problem. Even if you could ask questions and it would do a general summary for it and then point you in the right directions for more details we’d be way ahead.
https://github.com/openai/openai-cookbook/blob/main/examples...
Since I'm planning to use that for where i work, for the same use-case.
If you want a real moat, figure out how to parse existing PDF documentation that is really badly formatted. Think diagrams and tables with text floating in various places, etc. Documentation of this style is very common in industries where physical things are being built in the real world. The standards documentation (IEEE, ANSI, NFPA, etc) doesn’t usually parse cleanly, much less the messier internal documentation within the businesses.
Grobid is the best example of such a documentation parser, but it is so laser focused on academic papers that it fails to properly process industry-style standards and SOP documentation. What the world needs right now is a Grobid that works for other kinds of messy documentation.
One thing that all these models will lack is the ability to include diagrams (on both the input and output side). Working out a clever way to do that would be very cool.
At the moment there are some difficulties with the GPT interface - the most tricky one being the limit on the length of the input prompt. I'm not sure at the moment how much fine tuning helps with this.
But, my assumption is that OpenAI will improve this, so there's not a huge way to differentiate here.
It’s frustrating though as often there are hundreds to thousands of pages of this stuff with diagrams and drawings randomly situated together on the pages. Documentation like this was designed to be dense for printing and consumed by a human that is familiar with it from regular use. I’m a bit concerned that the only solution may be paying a technical expert to sit down and convert it all to blocks of text. It would be an expensive endeavor, and even after it’s complete any changes (which happen often) would have to be continually maintained.
If that’s the only solution then I may still go for it, as I think the value to the business of having all knowledge instantly searchable and then automatically summarized will be considerable.
More details:
https://www.reddit.com/r/ChatGPT/comments/zsnscy/i_asked_cha...
Adding diagrams as inputs is probably as easy as feeding in an additional CLIP embeddings during training. The trick here will be how to get enough training data. Perhaps there are enough StackOverflow questions with images in the question. For output, you could also finetune some diffusion model on that data.
- Custom/FineTuned GPT3 models are trained on your data
- Can be embedded within your app or docs site.
Seeking Feedback
- What do you think about it.
- Would you add this to your docs site or within your app when it's available
- What can I add/change to make this better.
Show HN's area really for things that can be tried and you're not quite at that stage yet. Take a look at:
https://news.ycombinator.com/showhn.html
Generally, if you have 'add me to waitlist' it's not a Show HN (yet).
These chat interfaces are live and work on custom/fine-tuned models that have been trained on documentation of these companies.
Over the last 3hrs users exchanged over 1000 messages with the bots. So it's live, it works, and users can try and play with it right now and give feedback.
It should check all the boxes.
The waitlist is to allow us to onboard companies in an orderly way as each bot is trained on their custom documentation.
I have changed the words to "Get Access". I hope that helps.
> Is the word cat made of 4 or 5 letters?
>> The word cat is made of 4 letters, 3 of which are in the stem.
Adding specific product documentation would give it "domain expertise" of sorts (assuming this is what OP did), but training it ONLY on your documentation -- I don't know enough to call it, but my intuition is that it would weaken it's ability to interpret the users' questions.
I've taken Show HN out of the title now.
Over the last 9hrs users exchanged over 6000 messages with the bots. So it's live, it works, and users can try and play with it right now and give feedback.
A few hrs ago the bots hit OpenAis usage limits, Else its was working actively with over several hundred unique users interacting with the bot asking 4-5 questions each
how would I do a SQL query with questdb to grab every other second of data?
This is a common use case for us. We have built SQL functions to easily do this task. The function below will return every other second of data:SELECT * FROM `time_series` WHERE (INTERVAL '2s')This query can be run directly from the web console or via Postgres wire protocol using Postgres client libraries such as pg4ce, postgres-java and others.The above query will return all records in the table with timestamp greater than 12
Me: When was the last consolidation of Directive 2013/34/EU?
GPT: Directive 2013/34/EU, also known as the Annual Accounts Directive, was last consolidated on November 19, 2019. The consolidation process involves collecting all the amendments and updates to the directive and incorporating them into a single, updated version of the directive. ...
Me: Are you sure about the date of November 19, 2019? I could not find the consolidation on the EU official journal.
GPT: I apologize for the error. Upon further research, it appears that Directive 2013/34/EU has not been consolidated since it was originally adopted in 2013. Directive 2013/34/EU, also known as the Annual Accounts Directive,...
Me: Can you give me a list of the dates it was amended?
GPT:
- April 15, 2014: Directive 2014/56/EU amended Directive 2013/34/EU to clarify and supplement certain provisions regarding the preparation and publication of annual accounts and consolidated accounts.
- June 17, 2014: Directive 2014/65/EU amended Directive 2013/34/EU to introduce new rules on the provision of investment services and the performance of investment activities...
Me: I think you're wrong. Delegated Directive 2014/71/EU amends Directive 2011/65/EU.
GPT: I apologize for the error in my previous response. You are correct that Delegated Directive 2014/71/EU amends Directive 2011/65/EU, which is also known as the RoHS Directive.
It never gets it right. See https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32... for the correct data.
So, a question along the lines of "when did XXX happen?" should be answered with something that looks like "XXX happened on YYY" - but the YYY just needs to be anything that looks like a date to be a plausible answer.
There's a great example in this paper (posted to hackernews earlier) - https://link.springer.com/article/10.1007/s11023-022-09602-0
GPT-3 prompted to truthfully continue ‘John Prescott was born’ outputs ‘in Hull on June 8th 1941.’. The probabilities for other possible continuations show that Hull is by far the most plausible continuation for GPT-3
The problem is, John Prescott was born in Prestatyn. But the language model has "compressed" everything it knows about John Prescott and the most commonly associated town with John Prescott is "Hull" as that's where he was an MP.The other major LLM improvement I expect is multi-round code execution. Let the model iterate on bug fixing, like us. We need the extra steps, the LLM does too.
I guess we could train models on similar corpus scraped from the internet - mostly old General Aviation manuals, I guess, but a) that's not enough corpus, and b) those GA docs are so old I'm afraid my AI will start chain smoking and casually using the word "broad".
On this GPT chat, I asked, "provide an example oauth2 config for QuestDB" and "how do I configure oauth2 for signaldot?" and it responded like a search engine with paragraphs from documentation that matched the keywords, perhaps within an extra degree of reasoning.
As someone who uses fantastically (farcically?) complex security technologies, I'd tell architects, PMs, documentation writers and developers: Your Cookbook Is Your Product.
LLM's have a GIGO problem, and until you can train it on example configs, it's going to be a search engine over someone's architectural grandstanding.
https://www.microsoft.com/en-us/research/wp-content/uploads/...
I started reading some of the papers, got to know that even to build a model required a huge number of resources and just dropped it.
Unless GPT can learn directly from code and explain it properly.
With chatGPT I've had some problems trying to deduce from code prompts, but this seems promising.
One founder I showed it to, suggested embedding this chat within their app so users get answers without leaving the product.
Sorry for the snark, but what is your plan for quality assurance that this this meeting users needs instead of just BS'ing them out of your hair?
I just set a prompt limit of 100 characters should be plenty. Bot is till down until OpenAI resets limits. https://twitter.com/chandanmaruthi/status/161213449603556966...
QuestDb; Columnar time-series database with high performance ingestion and SQL analytics you know and love from QuestDB open source, now on the cloud.
Q: How high is the performance?
GPT3: We have tested QuestDB against other databases using benchmarking frameworks and benchmarking tools. We have also developed our own performance tests to simulate real-world use cases involving large data sets, multiple ingestion streams, and complex SQL queries. The results of these efforts are summarized in the table below:QuestDB is the fastest open source time series database available today. It powers real-life applications such as Yahoo! which stores 1 billion records per day in a single instance of QuestDB with
GPT3 says QuestDb is web scale and I for one trust it implicitly.> QuestDB is a popular time series database used by many companies for their application monitoring and real-time analytics. We have built QuestDB as a system to track the progress of our own application, which has evolved from an experiment into production at scale. As we grow, so does our use of QuestDB; it's now the core data store for all things InfluxData.We are looking to hire talented engineers who share our mission of empowering data producers with tools to
Not super impressed