According to a quick check against /usr/share/dict/words "fi" occurs in about 1.5% of words and "fl" occurs in about 1%. There are other ligatures that sometimes occur but those are the most common in English I believe.
I don't have any sense of how common ligature usage is anymore (I notice that the word "Office" in the title of this article is not rendered with a ligature by Chrome) but it might be insanity inducing to end up on the wrong side of a failed search where ligatures were not normalized.
Might be iffier in OCR mode: it seems to use Tesseract, which is known to have issues recognising ligatured text.
To add more color to this, the precise details of what "Unicode support" means are documented here: https://github.com/rust-lang/regex/blob/master/UNICODE.md
In effect, all of UTS#18 Level 1 is covered with a couple caveats. This is already a far cry better than most regex engines, like PCRE2, which has limited support for properties and no way to do subtraction or intersection of character classes. Other regex engines, like Javascript, are catching up. While UTS#18 Level 1 make ripgrep's Unicode support better than most, it does not make it the best. The third party Python `regex` library, for example, has very good support, although it is not especially fast[1].
Short of building UTS#18 2.1[2] support into the regex engine (unlikely to ever happen), it's likely ripgrep could offer some sort of escape hatch. Perhaps, for example, an option to normalize all text searched to whatever form you want (nfc, nfd, nfkc or nfkd). The onus would still be on you to write the corresponding regex pattern though. You can technically do this today with ripgrep's `--pre` flag, but having something built-in might be nice. Indeed, if you read UTS#18 2.1, you'll note that it is self-aware about how difficult matching canonical equivalents is, and essentially suggests this exact work-around instead. The problem is that it would need to be opt-in and the user would need to be aware of the problem in the first place. That's... a stretch, but probably better than nothing.
[1]: https://github.com/BurntSushi/rebar?tab=readme-ov-file#summa...
[2]: https://unicode.org/reports/tr18/#Canonical_Equivalents
Chrome mobile on Android does render Office with what looks like at least an fi ligature for me (it should use an ffi one but still).
Maybe it depends on the font?
It's much more common in PDFs than it is on the web, at least when the underlying plaintext is concerned.
https://github.com/junegunn/fzf/blob/master/ADVANCED.md#ripg...
I have been using a shell function to do this, and it works wonderfully well: https://github.com/phiresky/ripgrep-all/wiki/fzf-Integration
The built-in rga-fzf command appeared in v0.10 and ostensibly obviates the need for the above shell function, but the built-in command produces errors for me on MacOS: https://github.com/phiresky/ripgrep-all/issues/240
rga-fzf () { local query=$1 local extension=$2 if [ -z "$extension" ] then RG_PREFIX="rga --files-with-matches --no-ignore" else RG_PREFIX="rga --files-with-matches --no-ignore --glob '*.$extension'" fi echo "RG Prefix: $RG_PREFIX" echo "Search Query: $query" FZF_DEFAULT_COMMAND="$RG_PREFIX '$query'" fzf --sort --preview="[[ ! -z {} ]] && rga --colors 'match:bg:yellow' --pretty --context 15 {q} {} | less -R" --multi --phony -q "$query" --color "hl:-1:underline,hl+:-1:underline:reverse" --bind "change:reload:$RG_PREFIX {q}" --preview-window="50%:wrap" --bind "enter:execute-silent(echo {} | xargs -n 1 open -g)" }
It allows one to continually open lots of files found through rga-fzf, so one can look at them in $EDITOR all at once. Useful sometimes.
[1]: https://www.reddit.com/r/emacs/comments/1eghspj/comment/lg6q...
(use-package rg
;; ripgrep
:ensure t
:config
(setq rg-executable (executable-find "rga")
rg-buffer-name "rga"))Note: I have no reason to believe such code execution is actually happening — so please don't take this as FUD. My assumption is that a secure design would involve running only external code and thus would sacrifice a small amount of accuracy, possibly negligible.
(No shade to poppler intended, just the first tool on the list I looked at.)
https://pandoc.org/MANUAL.html#a-note-on-security
It's just text, this isn't ripgrepping through your excel macros, just the data that's actually in the excel file.
(I wanted one somewhat recently, and then doing a find for xls on the linked page returns 0 results)
And then, on average, most users don't use macros in their documents.
So yes, negligible.
EDIT: For instance, under Trisquel/Ubuntu/Debian and derivatives, click on 'recollcmd', and with the right click button mark all the dependencies.
Install RecollGUI for a nice UI.
Now you will have something like Google Search but libre in your own desktop.
Source: I made the recoll engine for Searx/SearxNG and have been using this system for many years now with a full-text index over close to a terabyte worth of data.
https://github.com/Genivia/ugrep/commit/e37c986dd842adc3b2c2...
https://github.com/phiresky/ripgrep-all/commit/16b4277d361ce...
For all of the built-in adapters to work, you'll need ffmpeg, pandoc, and poppler-utils. See the Scoop package [1] for a specific example of this.
> does it create a bunch of caches, that clog up storage and/or memory?
YMMV, but in my opinion ripgrep-all is pretty conservative in its caching. The cache files are all isolated to a single directory (whose location respects OS convention) and their contents are limited to plaintext that required processing to extract.
[1]: https://github.com/ScoopInstaller/Main/blob/master/bucket/rg...