I wanted to do this for a very long time. It is more of a sketch or prototype. I'd really appreciate your feedback!
The entire purpose of this project seems to hinge on this assertion, but there isn't a single example.
I don't understand what makes regex unnatural for editing? What is meant by editing? Why do people struggle with groups?
There are lots of examples of the syntax for this project, but why is it better than regular regex?
If there were a few examples showing "here's the vanilla regex version, here's my version, here's why my version makes this easier" I might be able to understand this project.
Just skimming through the readme, I'm also in doubt that the paradigm shift would really be any better than folk regexp.
That doesn't make the existence of the project any less cooler, of course. Congratulation to the author, and don't let such a comment narrow down your motivation, as it's really not the point. If you enjoy do it, great. If you learn something in the process, awesome. If it can lead to something that I can't envision, how marvelous. But it doesn't, that was still a great epic and you can be proud.
pattern = r'(x)y(z)'
replacement = r'\1Y\2'
result = re.sub(pattern, replacement, text)
I would like to replace it with 'xy:Yz' pattern:
result = re.trre('xy:Yz', text)
If you need your x, z to be more complicated patterns or even regex themselves it can be more handy using this approach.
I guess I'm still struggling to see how it's simpler overall.
Most of the examples on your page don't involve groups at all, e.g.:
$ echo 'catcatcat' | trre '((cat):(dog))*'
dogdogdog
That already seems a lot more complicated than just: re.sub('cat', 'dog', 'catcatcat')
I don't need to use groups that often in regex replacements, and when I do I'm already trying to do something complicated, and it's not clear to me why the colon syntax is easier to write, easier to understand, or if it's as flexible.Not trying to criticize the project, just genuinely trying to understand the specific strengths and limitations of the proposed syntax. E.g. what if I want to turn xyz into zYx?
IIUC
Great project
I found the operator precedence unnatural, and it looks like a lot of other folks in this thread did too. I would naturally assume `cat:dog` would be equivalent to `(cat):(dog)` rather than `ca(t:d)og`.
> I would naturally assume `cat:dog` would be equivalent to `(cat):(dog)` rather than `ca(t:d)og`
It was confusing to me too until I remembered that we all kind of use regexes sort of wrong. They're "really" supposed to be considered as generators and not matchers. So IIR cat|dog as a "regular expression" (not a regex) is supposed to formaly expand to
{catog,cadog}
For matching, this set of strings can then be substring matched against some larger text.
The problem is that almost no regex matching engine actually does this, and so now they'll do all kinds of strange things either to meet our expectations, or for efficiency or something.
If you go and try a bunch of different regex tools you'll get variations that either service (cat)|(dog) or (cat)|(dog)|(ca[td]og) or something else.
So from a more formal conceptualization I think cat:dog should produce ca(t:d)og not (cat):(dog). But our experience with "regex" tools has subverted that formalization and now everybody just puts parens around expressions they want to alternate.
My real minor issue with this proposal, as interesting and well thought out as it is, is that it feels like it's just trying to get back at regular expressions as generators, which they actually are and it's coming from a place on the other side of a few decades of how we've been abusing them as regexes for user expectations. In other words, the problem is the tooling, not the syntax.
source: I've worked adjacent to this space in the past and if you've never thought of regexes as string set generators you can toy with the idea here
https://onlinestringtools.com/generate-string-from-regex
but again, how these generator tools works is also very specific. The ones I used to work with had a variety of ways to specify constraints on closures and such to restrict the generators.
In pure theoretical computer science, regular expressions exist as an abstract concept independent from syntax or parsers. They are an “algebra”, which means they are composed of elements connected with operators, but they are not inherently tied to a syntax. In the most fundamental formulation of regular expressions (the one in the Chomsky hierarchy), the only operators are alteration (which modern syntaxes express as “|”), the Kleene star (“*”) and — notably — concatenation, which modern syntaxes simply omit, in a way comparable to how modern mathematics notation omits the multiplication operator when you write “2x”.
In the same way that maths needs rules to define whether “2x²” means “(2x)²” or “2(x²)”, regex syntax needs such rules too. This is called operator precedence. I’m sure you’ve heard that before, but you just might not have realized that the regular expression “ab” has an operator in it because it is typically not written.
Now I’m not going to argue that the operator precedence in maths notation is haphazard or without reason — but it is arbirary. It was arbitrarily chosen to be the most useful to mathematicians using the notation. And it turns out that giving exponentiation higher precedence than (invisible) multiplication (meaning: “2x²” means “2(x²)” rather than “(2x)²”) is more useful.
So coming back to the original example, whether “cat:dog” means “ca(t:d)og” or “(cat):(dog)” is simply a matter of defining the precedence of the “:” operator relative to the concatenation operator. You can argue (and I would agree with you) that one is more useful than the other, and therefore preferable (in the same way that “(cat)|(dog)” is more useful than “ca(t|d)og”), but neither of them is more fundamentally correct or primal or, as you put it, “supposed to formally expand to”.
If I shift it behind concatenation there could be another problem. E.g. with non-associative : should be illegal. And I am not sure how to treat this:
cat:dog:mouse
In the current version I inject the epsilon (empty string). It looks natural E.g. to remove every second letter I could run '..:' which is technically '.(.:eps)':
echo 'abcde' | ./trre '..:'
result: 'ace'
actually ':' association could have a meaning as a composition of regular relations; but I found it too complicated for now.
I would not worry about "cat:dog:mouse" because intuitively it is clearly correct and it means replacing cat with mouse. With parentheses it could be written as "((cat:dog):mouse)".
Then there is no syntactic special case. This is just EXPR:EXPR; the special case is that both EXPR are character class syntax, and so the tr-like range mapping applies.
anyhow
wrt syntax, are you sure you want ':' to bind stronger than concatenation 'ab' ?
> wrt syntax, are you sure you want ':' to bind stronger than concatenation 'ab' ?
That's something I am still not sure about. I took a hundred examples and it looked more natural this way (: lower then .). But I can change it with the change of one digit in the code, literally. That's why I'm posting here. I need some real feedback.
also,
* colon as member of a character class (and dash and how they interact)
* posix character classes (and the colon in the syntax)
* word boundaries are really useful in practice
* think of ünicøde early and look at what russ cox did there
boundaries, what do you decide to exclude? for example back references and grouping have fun with DFAs (again see russ cox and re2)
composition is fantastic, and a shortcut to exponential land because grammars (as in Chomsky hierarchy) can easily be expressed with transducers, yay.
boundaries will also clarify use cases and allow to state: "if you want to do xyz please use re2 and a bit of code instead"
and one "technicality" that hit me once with re2 was a static buffer for resumable iteration. I'd loved to reallocate the buffer and drop it's beginning and extend it at the end but alas, the state of re2 has hard pointers into the buffer, not relative ones. I think this was when re2 hit the end of the buffer mid-expression during incremental parse. so you can't reallocate and instead you have to drop the hard earned state and resume.
anyhow, it's been quite a while so I'm no longer in the thicket of this :-D
what's your driver? curiosity? or some itch?
but I really enjoy seeing your project :-)
More generally speaking, since regular expressions effectively define a parse tree on their matches, being able to perform more general transformations of those trees would be useful.
":'(':(\\')|[^"'])*":'If I understand it correctly you want to change something inside the "..." block and change the quotas to single '.
It can be done by this expression:
echo '"hello world" "hello again!"' | ./trre "\":'.+?:-\":'"
'-' '-'
So I substitute the text inside "" by symbol - using this expression ".+?:-" and simultaneously changing the surrounding quota.
Question mark means non-greedy mode.
$ echo 'cat' | trre 'c:da:ot:g'
dog
Feels strange. What is happening here; the grammar says TRRE <- TRRE* TRRE|TRRE TRRE.TRRE
TRRE <- REGEX REGEX:REGEX
What is the parse tree here? Why is "c" not being replaced with "da"? Or why isn't c being removed and "da" being replaced by "ot"?I do like the idea of having a search/replace semantic that is more intuitive than grouping operators; back in MS-DOS days you could do "ren .log .txt" and this would work which feels bananas to my modern bash-minded way of thinking, but it's very obvious looking at this what it is supposed to do.
If the operator were explicit (let’s call it ~), the example would look like this:
$ echo 'cat' | trre 'c:d~a:o~t:g'
dog
With unnecessary parentheses: $ echo 'cat' | trre '(c:d)~(a:o)~(t:g)'
dogThere is a hidden operator of concatenation as for usual regular expressions. In the code I denote it as lower dot '.' (as in the old Thompson's implementation).
> Why is "c" not being replaced with "da"?
It is all about precedence. According to the discussion I think I've chosen a wrong one and it raises confusion. Current version of precedence table is this:
| 1 | Escaped characters | \<special character> | | 2 | Bracket expression | [] | | 3 | Grouping | () | | 4 | Single-character-ERE duplication | * + ? {m,n} | | 5 | Transduction | : | | 6 | Concatenation | . (implicit) | | 8 | Alternation | | |
So the ':' is stronger then '.' (implicit concatenation).
If we instead require regex to be non-empty (breaking the deletion examples), then the ambiguity becomes that of concatenation: whether it's '(((c:d)(a:o))(t:g))' or '((c:d)((a:o)(d:g)))'. Assuming associativity, this would not matter.
but yes now that I read it , it also makes confusion , theoretically your point makes valid , I also believe that c should be replaced by da after I read the repo , I am not sure ...
I remember a Finnish researcher from PARC coming to one of my classes at UT to show how you can use FSTs for handling Finnish morphology, which is, on its face, quite a feat.
There are some good tutorials in the form of homework assignments (from like Johns Hopkins and some others) that go through Pynini use cases.
Why not just disallow this? I understand it would make the grammar more difficult to specify - but I don't see any good reason to keep it.
The rationale was to implement a fun operation called transducer composition. It is possible to do simple operation on strings and compose trre's like filters. But I haven't finished it yet. So again, a fair point.
And then if `-g` were specified, you could use this to create infinite generators, which could be a useful construct in some cases -- like `yes` but more powerful.
EDIT: Another interesting use case, if I am understanding correctly: if this worked, then you could use `:(<regex>)` to have it output an example of a string that matches any given regex. `-g :(<regex>)` produces a generator of every string in the language matched by that regex. `-g :(<regex>) | head -n 100` would give you 100 examples.
$ echo 'cat dog' | trre 'c:bat|d:hog' bat hog
$ echo '' | trre ':a*' # <- do NOT do this dog
$ echo '' | trre ':(repeat-10-times){10}' dog
> cat
I got lost here.
make && sh test.sh
cc -std=c99 -Wall -Wextra -Wpedantic -o2 trre_nft.c -o trre
cc -std=c99 -Wall -Wextra -Wpedantic -o2 trre_dft.c -o trre_dft
test.sh: 14: Syntax error: "(" unexpected
Using bash fixed it.Then I ran one of the generator examples:
echo '' | trre -g ':(0|1){,3}?'
And I got this error: ./trre: invalid option -- 'g'
Usage: ./trre [-d] [-m] expr [file]$ make && bash test.sh
with 'bash' instead.
For the second part it is a bug in the README. Thank you for pointing this out! I had to be more careful before the publication. Fixed. Try '-ma' flags instead.
$echo '' | trre -ma ':(0|1){,3}?'
Trre seems like an arbitrarily different version of `sed -e /a/b/ ..`. This method of search and replace is essentially ported everywhere else, from awk to vim to IntelliJ, and has always gotten the job done adequately and feels as natural as anything else I've learned.
Am I missing something?
p.s I just realized I've been regex'ing heavily for 21 (!) years now, time flies.
The sed is superior, actually. I do not cover all the functions sed provides. I think of it more like 'tr' + regexp. But it has different underlying engine and might be faster and more expressive for some use cases (e.g. tokenization, morphology).
My only quick comment is - the link to `theory.pdf` in README is broken (your pdf is in docs/ dir, so just need to change the url to include docs/).
This, combined with probabilistic approach can be even more interesting. Probabilistic approaches regexes exist since 70s if memory serves right.
You could port this to like V7 Unix from 1979 or earlier; why didn't they get this idea? :)
Tools like sed build a transducer around the whole automaton: s/this/that/g.
> Tools like sed build a transducer around the whole automaton: s/this/that/g.
That sounds reasonable. Could you provide any links on sed internals? Thanks.
https://gitlab.com/rosie-pattern-language/rosie/-/blob/maste...
I.e - why is trre "(cat):(dog)" an improvement over s/cat/dog? What's the improvement of "(x:)or" over s/xor/or? And so on. Pretty much all the examples match (at least in my head) to relatively easy regexps.
I think the core advantage, if there is one, would be in group logic, so maybe the examples should lean into that - even before explaining the basics. I'd explain why it's actually a better choice before explaining the full syntax, or hello-world use cases.
For the caesar cipher example, it screams for a "apply this in reverse" - it's a pretty common request for a lot of text replacements, but it's super clear with this one. (Because programmer brain immediately screams "why express logic twice")
I don't know if it's useful (yet), but I think it's great you're trying to explore alternatives to a long-established status quo. (Caveat: That usually means there's a good chance it won't land. But it's still great to see an exploration)
echo "regular expressions" | ./trre "[a:A-z:a]"
REGULAR EXPRESSIONS
In fact, "[a:A-z:x]" seems to do the same thing as "[a:A-z:Z]" for all x.The Xerox FST book https://press.uchicago.edu/ucp/books/book/distributed/F/bo36... (Xerox XRCE's finite-state-tools comprising the compilers lexc, xfst and twolc, the languages they compile and the formal language finite state transducers describe including linguistic applications)
The XFST book https://press.uchicago.edu/ucp/books/book/distributed/F/bo36...
FOMA: the open source clone https://fomafst.github.io/
FOMA: the paper (Holden, M. (2009) Proc. EACL) https://aclanthology.org/E09-2008.pdf
FOMA: the open source clone https://fomafst.github.io/
FOMA: the paper (Holden, M. (2009) Proc. EACL) https://aclanthology.org/E09-2008.pdf
$ echo 'cat' | trre 'c:da:ot:g'
dog
Why?Elsewhere, the readme says that ":" is "non-associative", and I had a look at the language grammar but haven't figured out how to parse a sequence of ":".
This is a common modus operandi for me in VS Code.
Adding a new char to be escaped seems like another annoyance.
I try to avoid tools that make the hard bit harder, and the easy bit easier