Show HN: IncarnaMind-Chat with your multiple docs using LLMs (opens in new tab)

(github.com)

57 pointsjoeyxiong2y ago28 comments

28 comments

25 comments · 9 top-level

SamBam2y ago· 4 in thread

This looks awesome, and really useful.

A few weeks ago I asked in Hacker News "I'm in the middle of a graduate degree and am reading lots of papers, how could I get ChatGPT to use my whole library as context when answering questions?"

And I was told, basically, "It's really easy! Just First you just extract all of the text from the PDFs into arxiv, parse to separate content from style, then store that in a a DuckDB database, with zstd compression, then just use some encoder model to process all of these texts into Qdrant database. Then use Vicuna or Guanaco 30b GPTQ, with langcgain, and....."

I was like, ok... guess I won't be asking ChatGPT where I can find which paper talked about which thing after all.

skeptrune2y ago

I don't know why you need the "ask chatGPT" piece. Why not just semantic search on the documents?

What is the value add of generative output?

all22y ago

I think the value is "Hey, I remember a paper talking X topic with Y sentiment, it also mentioned data from <vague source>. Which paper was that?"

If you're dealing with 100s of papers, then having a front end that can deal with vague queries would be a huge benefit.

1 more reply

jarvist2y ago

https://github.com/whitead/paper-qa

>This is a minimal package for doing question and answering from PDFs or text files (which can be raw HTML). It strives to give very good answers, with no hallucinations, by grounding responses with in-text citations.

ajhai2y ago

We built https://github.com/trypromptly/LLMStack to serve exactly this persona. A low-code platform to quickly build RAG pipelines and other LLM applications.

smcleod2y ago· 3 in thread

Only supports private / closed LLMs like OpenAI and Claud. People need to design for local LLM first, then for-profit providers.

joeyxiongOP2y ago

Yeah, This can definatly be used for local models, but the problem is that most personal computers cannot host large LLMs and the cost is not cheaper than closed LLMs. But for organisations, local LLMs are a better choice.

bytefactory2y ago

I think local LLMs are great for tinkerers, and with quantization can run on most modern PCs. I am not comfortable sending over my personal data over to OpenAI/Anthropic, so I've been playing around with https://github.com/PromtEngineer/localGPT/, GPT4All, etc. which keep the data all local.

Sliding window chunking, RAG, etc. seem more sophisticated than the other document LLM tools, so I would love to try this out if you ever add the ability to run LLMs locally!

stevenhuang2y ago

It's a lot closer these days with 30B 4bit quantized GPTQ fitting in 1 RTX 3090.

Goes from 20 tokens per second to 15 tokens per second nearing the ~3k token context length, with similar quality in output to chatgpt 3.5.

pstorm2y ago· 2 in thread

I'm impressed by your chunking and retrieval strategies. I think this aspect is often overly simplistic.

One aspect I don't quite understand is why you filter by the sliding window chunks vs just using the medium chunks? If I understand it correctly, you find the large chunks that contain the matched small chunks from the first retrieval. Then in the third retrieval, you are getting the medium chunks that comprise the large chunks? What extra value does that provide?

joeyxiongOP2y ago

Thank you for your comment. The sliding window approach allows me to dynamically identify relevant "large chunks," which can be thought of as sections in a document. Often, your questions may pertain to multiple such sections. Using only medium chunks for retrieval could result in sparse or fragmented information.

The third retrieval focuses on "medium chunks" within these identified large chunks. This ensures that only the most relevant information is passed to the Language Model, enhancing both time efficiency and focus. For example, if you're asking for a paper summary, I can zero in on medium chunks within the Abstract, Introduction, and Conclusion sections, eliminating noise from other irrelevant sections. Additionally, this strategy helps manage token limitations, like GPT-3.5's 4000-token cap, by selectively retrieving information

pstorm2y ago

Ah I see! So, the large/sliding window chunks act as a pre-filter for the medium chunks. That makes a lot of sense. I appreciate the response

SamBam2y ago· 2 in thread

Testing it out. I'm getting an error after I added my pdfs to the data directory and then ran

    % python docs2db.py    
      Processing files:   6%
        Traceback (most recent call last):
          File "[...]/IncarnaMind/docs2db.py", line 179, in process_metadata
          file_name = doc[0].metadata["source"].split("/")[-1].split(".")[0]
        IndexError: list index out of range

joeyxiongOP2y ago

Hi, I've pushed the new commit to the main branch. Could you please test it out? If it still has this error, you can check if your doc has relevant metadata.

```` for d in doc: print("metadata:", d.metadata) ```

before file_name = doc[0].metadata["source"].split("/")[-1].split(".")[0]

SamBam2y ago

One PDF is causing the issue. When I remove it I get another issue, `No such file or directory: 'database_store/file_names.pkl'`.

Opened up an issue in GitHub so as not to pollute this thread.

gsuuon2y ago· 2 in thread

Those diagrams are nice! What did you use to make them? The sliding window mechanic is interesting but I'm not seeing how the first, second and third retrievers relate. Only the final medium chunks are used, but how are those arrived at?

joeyxiongOP2y ago

Hi, I created the diagrams using Figma.

The retrieval process consists of three stages. The first stage retrieves small chunks from multiple documents to create a document filter using their metadat. This filter is then applied in the second stage to extract relevant large chunks, essentially sections of documents, which further refines our search parameters. Finally, using both the document and large chunk filters, the third stage retrieves the most pertinent medium-sized chunks of information to be passed to the Language Model, ensuring a focused and relevant response to your query.

gsuuon2y ago

So to rephrase: stage 1 - it finds the top k most relevant small chunks, stage 2 - it searches the source documents of those small chunks for most relevant large chunks, stage 3 - it searches medium chunks in the source documents of relevant small chunks that are contained in the large chunks found in stage 2?

1 more reply

all22y ago· 1 in thread

A team where I work recently rolled out a doc-answer LLM and context was an issue we ran into. Retrieved doc chunks didn't have nearly enough context to answer some of the broader questions well.

Another issue I've run into with doc-answer LLMs is that they don't handle synonyms well. If I don't know the terminology for the tool, say llama-index [0], I can't ask around the concept to see if something like what I'm describing exists.

A part of me thinks a lang-chain with the LLM in it might be useful.

Something like

1. User makes vague query "hey, llama-index, how do I create a moving chunk answer thing with llama-index?"

2. Initial context comes back to the LLM, and the LLM determines there is not straight forward answer to the question.

2a. The LLM might ask followup questions "when you say X, what do you mean?" to clarify terms it doesn't have ready answers for.

2b. The LLM says "hm, let me think about that. I'll email you when I have a good answer."

2c. The LLM reads the docs and relevant materials and attempts to solve the problem.

3. Email the user with a potential answer to the question.

4. Stashes the solution text in the docs if the user OKs the plan. Updates an embedding table to include words/terms used that the docs didn't contain.

This last step is the most important. Some kind of method to capture common questions and answers, synonyms, etc. would ensure that the model has access to (potentially) increasingly robust information.

sergiotapia2y ago

You can have a pre-qualification step to qualify the answer into several highly specific categories. These categories have highly tailored context that allow much better answers.

Of course, you can only generate these categories once you see what kind of questions your users ask, but this means your product can continuously improve.

dilap2y ago· 1 in thread

I feel like an LLM trained on Slack could be something like the perfect replacement for trying to maintain docs.

Palmik2y ago

There are many such products if you look around (e.g. [1]). Fine-tuning is probably not going be enough to retain all the information in this case, so you definitely want to augment with retrieval.

[1] https://news.ycombinator.com/item?id=34909921

augusteo2y ago· 1 in thread

thank you for the effort.

not sure if its my fault, but i kept getting random errors during first run like

LookupError: ***********************************

  Resource stopwords not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('stopwords')

---

the fix is quite easy, just need to go into python repl and execute those.

after downloading the requisite resources, it works fine!

joeyxiongOP2y ago

Thanks for pointing it out. It's not your fault at all. I should write the pre-requisite resources downloader in the code as well. Will merge the new code into the main branch.

skeptrune2y ago

Can we talk about how dynamic chunking works by any chance? That is the most interesting piece imo.

We have a similar thing (w/ UIs for search/chat) at https://github.com/arguflow/arguflow .

- nick@arguflow.gg

j / k navigate · click thread line to collapse

28 comments

25 comments · 9 top-level

SamBam2y ago· 4 in thread

This looks awesome, and really useful.

A few weeks ago I asked in Hacker News "I'm in the middle of a graduate degree and am reading lots of papers, how could I get ChatGPT to use my whole library as context when answering questions?"

I was like, ok... guess I won't be asking ChatGPT where I can find which paper talked about which thing after all.

skeptrune2y ago

I don't know why you need the "ask chatGPT" piece. Why not just semantic search on the documents?

What is the value add of generative output?

all22y ago

I think the value is "Hey, I remember a paper talking X topic with Y sentiment, it also mentioned data from <vague source>. Which paper was that?"

If you're dealing with 100s of papers, then having a front end that can deal with vague queries would be a huge benefit.

1 more reply

jarvist2y ago

https://github.com/whitead/paper-qa

ajhai2y ago

We built https://github.com/trypromptly/LLMStack to serve exactly this persona. A low-code platform to quickly build RAG pipelines and other LLM applications.

smcleod2y ago· 3 in thread

Only supports private / closed LLMs like OpenAI and Claud. People need to design for local LLM first, then for-profit providers.

joeyxiongOP2y ago

bytefactory2y ago

Sliding window chunking, RAG, etc. seem more sophisticated than the other document LLM tools, so I would love to try this out if you ever add the ability to run LLMs locally!

stevenhuang2y ago

It's a lot closer these days with 30B 4bit quantized GPTQ fitting in 1 RTX 3090.

Goes from 20 tokens per second to 15 tokens per second nearing the ~3k token context length, with similar quality in output to chatgpt 3.5.

pstorm2y ago· 2 in thread

I'm impressed by your chunking and retrieval strategies. I think this aspect is often overly simplistic.

joeyxiongOP2y ago

pstorm2y ago

Ah I see! So, the large/sliding window chunks act as a pre-filter for the medium chunks. That makes a lot of sense. I appreciate the response

SamBam2y ago· 2 in thread

Testing it out. I'm getting an error after I added my pdfs to the data directory and then ran

    % python docs2db.py    
      Processing files:   6%
        Traceback (most recent call last):
          File "[...]/IncarnaMind/docs2db.py", line 179, in process_metadata
          file_name = doc[0].metadata["source"].split("/")[-1].split(".")[0]
        IndexError: list index out of range

joeyxiongOP2y ago

Hi, I've pushed the new commit to the main branch. Could you please test it out? If it still has this error, you can check if your doc has relevant metadata.

```` for d in doc: print("metadata:", d.metadata) ```

before file_name = doc[0].metadata["source"].split("/")[-1].split(".")[0]

SamBam2y ago

One PDF is causing the issue. When I remove it I get another issue, `No such file or directory: 'database_store/file_names.pkl'`.

Opened up an issue in GitHub so as not to pollute this thread.

gsuuon2y ago· 2 in thread

joeyxiongOP2y ago

Hi, I created the diagrams using Figma.

gsuuon2y ago

1 more reply

all22y ago· 1 in thread

A team where I work recently rolled out a doc-answer LLM and context was an issue we ran into. Retrieved doc chunks didn't have nearly enough context to answer some of the broader questions well.

A part of me thinks a lang-chain with the LLM in it might be useful.

Something like

1. User makes vague query "hey, llama-index, how do I create a moving chunk answer thing with llama-index?"

2. Initial context comes back to the LLM, and the LLM determines there is not straight forward answer to the question.

2a. The LLM might ask followup questions "when you say X, what do you mean?" to clarify terms it doesn't have ready answers for.

2b. The LLM says "hm, let me think about that. I'll email you when I have a good answer."

2c. The LLM reads the docs and relevant materials and attempts to solve the problem.

3. Email the user with a potential answer to the question.

4. Stashes the solution text in the docs if the user OKs the plan. Updates an embedding table to include words/terms used that the docs didn't contain.

sergiotapia2y ago

You can have a pre-qualification step to qualify the answer into several highly specific categories. These categories have highly tailored context that allow much better answers.

Of course, you can only generate these categories once you see what kind of questions your users ask, but this means your product can continuously improve.

dilap2y ago· 1 in thread

I feel like an LLM trained on Slack could be something like the perfect replacement for trying to maintain docs.

Palmik2y ago

There are many such products if you look around (e.g. [1]). Fine-tuning is probably not going be enough to retain all the information in this case, so you definitely want to augment with retrieval.

[1] https://news.ycombinator.com/item?id=34909921

augusteo2y ago· 1 in thread

thank you for the effort.

not sure if its my fault, but i kept getting random errors during first run like

LookupError: ***********************************

  Resource stopwords not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('stopwords')

---

the fix is quite easy, just need to go into python repl and execute those.

after downloading the requisite resources, it works fine!

joeyxiongOP2y ago

Thanks for pointing it out. It's not your fault at all. I should write the pre-requisite resources downloader in the code as well. Will merge the new code into the main branch.

skeptrune2y ago

Can we talk about how dynamic chunking works by any chance? That is the most interesting piece imo.

We have a similar thing (w/ UIs for search/chat) at https://github.com/arguflow/arguflow .

- nick@arguflow.gg

j / k navigate · click thread line to collapse