story

Rga: Ripgrep, but also search in PDFs, E-Books, Office documents, zip, etc. (opens in new tab)

github.com

516 pointsbukacdan1y ago57 comments

57 comments

Anyone know how it handles ligatures? Depending on font and tooling the word "fish" may end up in various docs as the glyphs [ﬁ, s, h] or [f, i, s, h].

According to a quick check against /usr/share/dict/words "fi" occurs in about 1.5% of words and "fl" occurs in about 1%. There are other ligatures that sometimes occur but those are the most common in English I believe.

I don't have any sense of how common ligature usage is anymore (I notice that the word "Office" in the title of this article is not rendered with a ligature by Chrome) but it might be insanity inducing to end up on the wrong side of a failed search where ligatures were not normalized.

kranner1y ago

Seems to work well when it's searching the PDF text layer as ligatures are a font rendering effect. You're right — ligatures are not as common in modern books.

Might be iffier in OCR mode: it seems to use Tesseract, which is known to have issues recognising ligatured text.

shellac1y ago

The (standard) ripgrep regex engine has full unicode support. My reading of that is that it should handle such equivalences like matching the decomposed version.

burntsushi1y ago

It does not. Almost no regex engine does that.

To add more color to this, the precise details of what "Unicode support" means are documented here: https://github.com/rust-lang/regex/blob/master/UNICODE.md

In effect, all of UTS#18 Level 1 is covered with a couple caveats. This is already a far cry better than most regex engines, like PCRE2, which has limited support for properties and no way to do subtraction or intersection of character classes. Other regex engines, like Javascript, are catching up. While UTS#18 Level 1 make ripgrep's Unicode support better than most, it does not make it the best. The third party Python `regex` library, for example, has very good support, although it is not especially fast[1].

Short of building UTS#18 2.1[2] support into the regex engine (unlikely to ever happen), it's likely ripgrep could offer some sort of escape hatch. Perhaps, for example, an option to normalize all text searched to whatever form you want (nfc, nfd, nfkc or nfkd). The onus would still be on you to write the corresponding regex pattern though. You can technically do this today with ripgrep's `--pre` flag, but having something built-in might be nice. Indeed, if you read UTS#18 2.1, you'll note that it is self-aware about how difficult matching canonical equivalents is, and essentially suggests this exact work-around instead. The problem is that it would need to be opt-in and the user would need to be aware of the problem in the first place. That's... a stretch, but probably better than nothing.

[1]: https://github.com/BurntSushi/rebar?tab=readme-ov-file#summa...

[2]: https://unicode.org/reports/tr18/#Canonical_Equivalents

2 more replies

virtualritz1y ago

> I notice that the word "Office" in the title of this article is not rendered with a ligature by Chrome

Chrome mobile on Android does render Office with what looks like at least an fi ligature for me (it should use an ffi one but still).

Maybe it depends on the font?

miki1232111y ago

> I don't have any sense of how common ligature usage is anymore

It's much more common in PDFs than it is on the web, at least when the underlying plaintext is concerned.

wanderingmind1y ago

Awesome tool and I use it often. One under utilized feature of rga is its integration with fuzzy search (fzf) that provides interactive outputs compared to running the commands and collecting outputs in sequence. So in short use rga-fzf instead of rga in CLI.

iwishiknewlisp1y ago

The fzf repo has a guide/example code for ripgrep integration that works pretty well.

https://github.com/junegunn/fzf/blob/master/ADVANCED.md#ripg...

sim7c001y ago

wish i knew about this workin support jobs sifting for logs and lines in zip 'support file' packages. very nice!

justinmayer1y ago

Integrating ripgrep-all with fzf makes for a powerful combination when you want to recursively search the contents of a given directory.

I have been using a shell function to do this, and it works wonderfully well: https://github.com/phiresky/ripgrep-all/wiki/fzf-Integration

The built-in rga-fzf command appeared in v0.10 and ostensibly obviates the need for the above shell function, but the built-in command produces errors for me on MacOS: https://github.com/phiresky/ripgrep-all/issues/240

bomewish1y ago

I also have a custom rga-fzf function, I think adapted from that:

rga-fzf () { local query=$1 local extension=$2 if [ -z "$extension" ] then RG_PREFIX="rga --files-with-matches --no-ignore" else RG_PREFIX="rga --files-with-matches --no-ignore --glob '*.$extension'" fi echo "RG Prefix: $RG_PREFIX" echo "Search Query: $query" FZF_DEFAULT_COMMAND="$RG_PREFIX '$query'" fzf --sort --preview="[[ ! -z {} ]] && rga --colors 'match:bg:yellow' --pretty --context 15 {q} {} | less -R" --multi --phony -q "$query" --color "hl:-1:underline,hl+:-1:underline:reverse" --bind "change:reload:$RG_PREFIX {q}" --preview-window="50%:wrap" --bind "enter:execute-silent(echo {} | xargs -n 1 open -g)" }

It allows one to continually open lots of files found through rga-fzf, so one can look at them in $EDITOR all at once. Useful sometimes.

Gehinnn1y ago

Love this for searching in movie subtitles!

nanna1y ago

Lazy question but anyone integrated this with Emacs Dired, to transparently search all the files?

setopt1y ago

According to Reddit [1], you can use the existing rg.el package, and just point it to the rga binary instead of the rg binary, and it is supposed to just work.

[1]: https://www.reddit.com/r/emacs/comments/1eghspj/comment/lg6q...

nanna1y ago

Huh, thanks yeah that does switch the binary to rga, but with rga you need to specify a wildcard operator for the path parameter in order to search PDFs, otherwise it only searches plaintext files, and I'm not sure how to make rg.el's RG function add that... must be a variable but not finding it.

  (use-package rg
    ;; ripgrep
    :ensure t
    :config
    (setq rg-executable (executable-find "rga")
     rg-buffer-name "rga"))

hprotagonist1y ago

https://randomeffect.net/post/2022/10/07/use-ripgrep-all-fro... possibly?

rectang1y ago

To what extent does reading these formats accurately require the execution of code within the documents? In other words, not just stuff like zip expansion by a library dependency of rga, but for example macros inside office documents or JavaScript inside PDFs.

Note: I have no reason to believe such code execution is actually happening — so please don't take this as FUD. My assumption is that a secure design would involve running only external code and thus would sacrifice a small amount of accuracy, possibly negligible.

fwip1y ago

Also note that it's not necessarily safe to read these documents even if you don't intend on executing embedded code. For example, reading from pdfs uses poppler, which has had a few CVEs that could result in arbitrary code execution, mostly around image decoding. https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=poppler

(No shade to poppler intended, just the first tool on the list I looked at.)

westurner1y ago

Couldn't or shouldn't each parser be run in a container with systemd-nspawn or LXC or another container runtime? (Even if all it's doing is reading a file format into process space as NX data not code as the current user)

1 more reply

rectang1y ago

That's a qualitatively different kind of security topic, though. On the one hand, we have a bug in a tool that reads a passive format with complete accuracy. On the other we have the need to sacrifice some amount of accuracy to avoid executing embedded code in a dynamic file format.

sim7c001y ago

this is why i do like to try and parse shit myself for my own tools, not that thats without risk but i dont share my.code so its untargeted. however, to support a wide variety like this the tools are ok. most code honestly in a pdf will not target pdftotext , i think. i think it would target the thing people open pdfs with like browsers and maybe a few readers like adobe and foxit reader. pdftotext seems more like an 'academic target', like a nice exersize but not very fruitful in an actual attack. i might be wrong tho.

1 more reply

traverseda1y ago

None of them really execute "code". Pandoc has a pretty good write up of the security implications or running it, which I think applies just as much to the other ones, with the added caveat of zip bombs.

https://pandoc.org/MANUAL.html#a-note-on-security

It's just text, this isn't ripgrepping through your excel macros, just the data that's actually in the excel file.

maxerickson1y ago

I don't think there's a default excel adapter in rga.

(I wanted one somewhat recently, and then doing a find for xls on the linked page returns 0 results)

1 more reply

maxerickson1y ago

On average, the macros in an Office document add features to the software and aren't run to render any content. So like toggling a group of settings or inserting some content or whatever. They may change the content, but it's done at a point in time by the user, not each time the document is opened.

And then, on average, most users don't use macros in their documents.

So yes, negligible.

anthk1y ago

Use Recoll for that; check the recommended dependencies from your package manager. Synaptic it's good for this with a click from the right mouse button on the package.

EDIT: For instance, under Trisquel/Ubuntu/Debian and derivatives, click on 'recollcmd', and with the right click button mark all the dependencies.

Install RecollGUI for a nice UI.

Now you will have something like Google Search but libre in your own desktop.

hagbard_c1y ago

To take it further install recoll-webui [1] and SearxNG [2], enable the recoll engine in the latter at point it at the former for a web-accessible search engine for local as well as remote content. Make sure to put local content behind a password or other type of authentication unless you intend for it to be searchable by outside visitors.

Source: I made the recoll engine for Searx/SearxNG and have been using this system for many years now with a full-text index over close to a terabyte worth of data.

[1] https://github.com/koniu/recoll-webui

[2] https://github.com/searxng/searxng

gjadi1y ago

For emacs users in the room, there is consult-recoll.

nullifidian1y ago

It's somewhat similar to 'recoll' in its functionality, only with recoll you need to index everything before search. It even uses the same approach of using third-party software like poppler for extracting the contents.

medoc1y ago

By the way Recoll also has a utility named rclgrep which is an index-less search. It does everything that Recoll can do which can reasonably done without an index (e.g.: no proximity search, no stem expansion etc.). It will search all file types supported by Recoll, including embedded documents (email attachments, archive members, etc.). It is not built or distributed by default, because I think that building an index is a better approach, but it's in the source tar distribution and can be built with -Drclgrep=true. Disclosure: I am the Recoll developper.

nanna1y ago

Wow this is a gem of a comment. I use Recoll heavily, it's a real super power for an academic, but I had no idea about rclgrep. Thank you for all your work.

1 more reply

rollcat1y ago

I think an index of all documents (including the contained text etc) should be a standardized component / API of every modern OS. Windows has had one since Vista (no idea about the API though), Spotlight has been a part of OS X for two decades, and there are various solutions for Linux & friends; however as far as I can tell there's no cross-platform wrapper that would make any or all of these easy to integrate with e.g. your IDE. That would be cool to have.

usefulcat1y ago

Would be cool if it also searched metadata in images, audio, video.

lafrenierejm1y ago

See https://github.com/phiresky/ripgrep-all/issues/221 for an example of doing this via a custom adapter and `exiftool`.

gcr1y ago

Does this work on Android? I’d love to put this on my eInk tablet so I could get actual search for my book library.

nsonha1y ago

Is there anything like this with vector an multimodal search as well? I know I'm asking too much.

wdkrnls1y ago

How does this compare with ugrep? I know that does many of these things while sticking with C++.

pdpi1y ago

Ugrep seems to be a completely new codebase, whereas RGA is a layer on top of ripgrep. Based on the benchmarks on the ripgrep github repo, rg is a bit better than 7x faster than ugrep.

runningj1y ago

it would be great if support xlsx.

lafrenierejm1y ago

I have an open PR [1] that's relevant. The proposed changes would allow users to process XLS and XLSX files like any other Zip archive.

[1]: https://github.com/phiresky/ripgrep-all/pull/247

skeptrune1y ago

This is sweet

jedisct11y ago

Ever heard of ugrep?

seanthemon1y ago

Seems this project predates ugrep and has a nicer interface.

mbrubeck1y ago

Both projects were started around the same time in 2019. The initial git commits were about five weeks apart:

https://github.com/Genivia/ugrep/commit/e37c986dd842adc3b2c2...

https://github.com/phiresky/ripgrep-all/commit/16b4277d361ce...

1 more reply

dcreater1y ago

Rip greps GitHub readme shows clearly superior performance to ugrep

burjui1y ago

C++ & autotools. No, thanks.

filterfish1y ago

Or: apt-get install ugrep

dcreater1y ago

What dependencies does it install and does it create a bunch of caches, indices that clog up storage and/or memory? (Besides rip grep)

lafrenierejm1y ago

> What dependencies does it install

For all of the built-in adapters to work, you'll need ffmpeg, pandoc, and poppler-utils. See the Scoop package [1] for a specific example of this.

> does it create a bunch of caches, that clog up storage and/or memory?

YMMV, but in my opinion ripgrep-all is pretty conservative in its caching. The cache files are all isolated to a single directory (whose location respects OS convention) and their contents are limited to plaintext that required processing to extract.

[1]: https://github.com/ScoopInstaller/Main/blob/master/bucket/rg...

j / k navigate · click thread line to collapse

57 comments

staplung1y ago

Anyone know how it handles ligatures? Depending on font and tooling the word "fish" may end up in various docs as the glyphs [ﬁ, s, h] or [f, i, s, h].

kranner1y ago

Seems to work well when it's searching the PDF text layer as ligatures are a font rendering effect. You're right — ligatures are not as common in modern books.

Might be iffier in OCR mode: it seems to use Tesseract, which is known to have issues recognising ligatured text.

shellac1y ago

The (standard) ripgrep regex engine has full unicode support. My reading of that is that it should handle such equivalences like matching the decomposed version.

burntsushi1y ago

It does not. Almost no regex engine does that.

To add more color to this, the precise details of what "Unicode support" means are documented here: https://github.com/rust-lang/regex/blob/master/UNICODE.md

[1]: https://github.com/BurntSushi/rebar?tab=readme-ov-file#summa...

[2]: https://unicode.org/reports/tr18/#Canonical_Equivalents

2 more replies

virtualritz1y ago

> I notice that the word "Office" in the title of this article is not rendered with a ligature by Chrome

Chrome mobile on Android does render Office with what looks like at least an fi ligature for me (it should use an ffi one but still).

Maybe it depends on the font?

miki1232111y ago

> I don't have any sense of how common ligature usage is anymore

It's much more common in PDFs than it is on the web, at least when the underlying plaintext is concerned.

wanderingmind1y ago

iwishiknewlisp1y ago

The fzf repo has a guide/example code for ripgrep integration that works pretty well.

https://github.com/junegunn/fzf/blob/master/ADVANCED.md#ripg...

sim7c001y ago

wish i knew about this workin support jobs sifting for logs and lines in zip 'support file' packages. very nice!

justinmayer1y ago

Integrating ripgrep-all with fzf makes for a powerful combination when you want to recursively search the contents of a given directory.

I have been using a shell function to do this, and it works wonderfully well: https://github.com/phiresky/ripgrep-all/wiki/fzf-Integration

bomewish1y ago

I also have a custom rga-fzf function, I think adapted from that:

It allows one to continually open lots of files found through rga-fzf, so one can look at them in $EDITOR all at once. Useful sometimes.

Gehinnn1y ago

Love this for searching in movie subtitles!

nanna1y ago

Lazy question but anyone integrated this with Emacs Dired, to transparently search all the files?

setopt1y ago

According to Reddit [1], you can use the existing rg.el package, and just point it to the rga binary instead of the rg binary, and it is supposed to just work.

[1]: https://www.reddit.com/r/emacs/comments/1eghspj/comment/lg6q...

nanna1y ago

  (use-package rg
    ;; ripgrep
    :ensure t
    :config
    (setq rg-executable (executable-find "rga")
     rg-buffer-name "rga"))

hprotagonist1y ago

https://randomeffect.net/post/2022/10/07/use-ripgrep-all-fro... possibly?

rectang1y ago

fwip1y ago

(No shade to poppler intended, just the first tool on the list I looked at.)

westurner1y ago

1 more reply

rectang1y ago

sim7c001y ago

1 more reply

traverseda1y ago

https://pandoc.org/MANUAL.html#a-note-on-security

It's just text, this isn't ripgrepping through your excel macros, just the data that's actually in the excel file.

maxerickson1y ago

I don't think there's a default excel adapter in rga.

(I wanted one somewhat recently, and then doing a find for xls on the linked page returns 0 results)

1 more reply

maxerickson1y ago

And then, on average, most users don't use macros in their documents.

So yes, negligible.

anthk1y ago

Use Recoll for that; check the recommended dependencies from your package manager. Synaptic it's good for this with a click from the right mouse button on the package.

EDIT: For instance, under Trisquel/Ubuntu/Debian and derivatives, click on 'recollcmd', and with the right click button mark all the dependencies.

Install RecollGUI for a nice UI.

Now you will have something like Google Search but libre in your own desktop.

hagbard_c1y ago

Source: I made the recoll engine for Searx/SearxNG and have been using this system for many years now with a full-text index over close to a terabyte worth of data.

[1] https://github.com/koniu/recoll-webui

[2] https://github.com/searxng/searxng

gjadi1y ago

For emacs users in the room, there is consult-recoll.

nullifidian1y ago

medoc1y ago

nanna1y ago

Wow this is a gem of a comment. I use Recoll heavily, it's a real super power for an academic, but I had no idea about rclgrep. Thank you for all your work.

1 more reply

rollcat1y ago

usefulcat1y ago

Would be cool if it also searched metadata in images, audio, video.

lafrenierejm1y ago

See https://github.com/phiresky/ripgrep-all/issues/221 for an example of doing this via a custom adapter and `exiftool`.

gcr1y ago

Does this work on Android? I’d love to put this on my eInk tablet so I could get actual search for my book library.

nsonha1y ago

Is there anything like this with vector an multimodal search as well? I know I'm asking too much.

wdkrnls1y ago

How does this compare with ugrep? I know that does many of these things while sticking with C++.

pdpi1y ago

Ugrep seems to be a completely new codebase, whereas RGA is a layer on top of ripgrep. Based on the benchmarks on the ripgrep github repo, rg is a bit better than 7x faster than ugrep.

runningj1y ago

it would be great if support xlsx.

lafrenierejm1y ago

I have an open PR [1] that's relevant. The proposed changes would allow users to process XLS and XLSX files like any other Zip archive.

[1]: https://github.com/phiresky/ripgrep-all/pull/247

skeptrune1y ago

This is sweet

jedisct11y ago

Ever heard of ugrep?

seanthemon1y ago

Seems this project predates ugrep and has a nicer interface.

mbrubeck1y ago