A few weeks ago I asked in Hacker News "I'm in the middle of a graduate degree and am reading lots of papers, how could I get ChatGPT to use my whole library as context when answering questions?"
And I was told, basically, "It's really easy! Just First you just extract all of the text from the PDFs into arxiv, parse to separate content from style, then store that in a a DuckDB database, with zstd compression, then just use some encoder model to process all of these texts into Qdrant database. Then use Vicuna or Guanaco 30b GPTQ, with langcgain, and....."
I was like, ok... guess I won't be asking ChatGPT where I can find which paper talked about which thing after all.
What is the value add of generative output?
If you're dealing with 100s of papers, then having a front end that can deal with vague queries would be a huge benefit.
>This is a minimal package for doing question and answering from PDFs or text files (which can be raw HTML). It strives to give very good answers, with no hallucinations, by grounding responses with in-text citations.
Sliding window chunking, RAG, etc. seem more sophisticated than the other document LLM tools, so I would love to try this out if you ever add the ability to run LLMs locally!
Goes from 20 tokens per second to 15 tokens per second nearing the ~3k token context length, with similar quality in output to chatgpt 3.5.
One aspect I don't quite understand is why you filter by the sliding window chunks vs just using the medium chunks? If I understand it correctly, you find the large chunks that contain the matched small chunks from the first retrieval. Then in the third retrieval, you are getting the medium chunks that comprise the large chunks? What extra value does that provide?
The third retrieval focuses on "medium chunks" within these identified large chunks. This ensures that only the most relevant information is passed to the Language Model, enhancing both time efficiency and focus. For example, if you're asking for a paper summary, I can zero in on medium chunks within the Abstract, Introduction, and Conclusion sections, eliminating noise from other irrelevant sections. Additionally, this strategy helps manage token limitations, like GPT-3.5's 4000-token cap, by selectively retrieving information
% python docs2db.py
Processing files: 6%
Traceback (most recent call last):
File "[...]/IncarnaMind/docs2db.py", line 179, in process_metadata
file_name = doc[0].metadata["source"].split("/")[-1].split(".")[0]
IndexError: list index out of range```` for d in doc: print("metadata:", d.metadata) ```
before file_name = doc[0].metadata["source"].split("/")[-1].split(".")[0]
Opened up an issue in GitHub so as not to pollute this thread.
The retrieval process consists of three stages. The first stage retrieves small chunks from multiple documents to create a document filter using their metadat. This filter is then applied in the second stage to extract relevant large chunks, essentially sections of documents, which further refines our search parameters. Finally, using both the document and large chunk filters, the third stage retrieves the most pertinent medium-sized chunks of information to be passed to the Language Model, ensuring a focused and relevant response to your query.
Another issue I've run into with doc-answer LLMs is that they don't handle synonyms well. If I don't know the terminology for the tool, say llama-index [0], I can't ask around the concept to see if something like what I'm describing exists.
A part of me thinks a lang-chain with the LLM in it might be useful.
Something like
1. User makes vague query "hey, llama-index, how do I create a moving chunk answer thing with llama-index?"
2. Initial context comes back to the LLM, and the LLM determines there is not straight forward answer to the question.
2a. The LLM might ask followup questions "when you say X, what do you mean?" to clarify terms it doesn't have ready answers for.
2b. The LLM says "hm, let me think about that. I'll email you when I have a good answer."
2c. The LLM reads the docs and relevant materials and attempts to solve the problem.
3. Email the user with a potential answer to the question.
4. Stashes the solution text in the docs if the user OKs the plan. Updates an embedding table to include words/terms used that the docs didn't contain.
This last step is the most important. Some kind of method to capture common questions and answers, synonyms, etc. would ensure that the model has access to (potentially) increasingly robust information.
Of course, you can only generate these categories once you see what kind of questions your users ask, but this means your product can continuously improve.
not sure if its my fault, but i kept getting random errors during first run like
LookupError: ***********************************
Resource stopwords not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('stopwords')
---the fix is quite easy, just need to go into python repl and execute those.
after downloading the requisite resources, it works fine!
We have a similar thing (w/ UIs for search/chat) at https://github.com/arguflow/arguflow .
- nick@arguflow.gg