TXR – A Programming Language for Convenient Data Munging (opens in new tab)

(nongnu.org)

118 pointsjoshumax7y ago73 comments

73 comments

43 comments · 13 top-level

anentropic7y ago· 6 in thread

> The PDF rendition of the reference manual, which takes the form of a large Unix man page, is over 600 pages long, with no index or table of contents. There are many ways to solve a given data processing problem with TXR.

"Good luck, you're on your own!"

kazinator7y ago

The "no index or TOC" isn't being touted as a feature, just that the page count is that without these (in documents like these, these features can contribute dozens to the page count). An index would be nice; patches welcome!

The HTML version that most people would be using has a TOC with two-way navigation to the section headings and is hyperlinked. Of course, man page reading allows easy searching.

Jach7y ago

I guess threads like this remind me why it's nice to have professional doc writers review my customer-facing text at work. ;) Congrats on your project getting some more attention! If you'll indulge a bit of bikeshedding, this particular miscommunication could probably be avoided in the future by changing the sentence to the short "The PDF rendition of the reference manual is over 600 pages long." Even if you add extra things to the PDF later the statement won't be incorrect and so you won't have to deal with nitpickers coming by next time with a comment like "But if you remove the index it's only 597 pages!"

Another edit preserving more of the original would be to replace the final "with no" with something like "even excluding any"...

1 more reply

nn37y ago

I've learned/used basic TXR some time ago. I had a text parsing problem that needed backtracing, and it seemed simpler to use TXR than to use implement this in python or perl.

Basic TXR matching is really quite simple. Match some patterns, generate a report at the end. The patterns are interleaved with the matching text, so it's more like a more powerful version of regexprs (but far more readable), than a normal programing language.

You can learn it quickly based on the provided examples.

It's just a few straight forward commands, although you have to wrap your mind how the backtracing parser works.

Most of the manual is about the LISP. I never used that part and I don't think it's really needed for 95+% of all text parsing/summarizing.

cgio7y ago

Well the HTML version has contents. 600 pages of documentation and with the information density I see in a quick skim would not imply a “you are on your own” mentality to me.

ilovetux7y ago

This. I have never seen a programming language brag about being inaccessible and having bad documentation.

oddity7y ago

I read it as honesty and not bragging. Few people set out to create a inaccessible language with bad documentation, but given enough time and users, most languages become one. I'd prefer language maintainers and users have enough self-awareness to not believe it is still the elegant and simple language of 20 years ago.

Edit: 10 years ago in this case.

js87y ago· 6 in thread

It would be interesting to have a DSL for data munging, but I am afraid TXR is not it. My requirements would be that the language should be functional and total.

Most transformations that we do on data do not require Turing completeness or recursion. I think it would be useful to write these down in a language with semantics that is easy to analyze.

kazinator7y ago

The funny thing is, I originally didn't intend the TXR pattern language to be recursive. It needed functional decomposition (pattern functions) to break up a big pattern match into simpler units. When those were implemented, I realized after the fact, hey we have a push-down automaton that can now grok recursive grammars.

I don't see why we would want to rule out a pattern function invoking itself (directly, or through intermediaries); if that hurts, then just don't do that.

(Though I understand that there are languages deliberately designed without unbounded loops or recursion, for justifiable reasons.)

js87y ago

I found in practice that arbitrary recursion depth is (even on languages with formal recursive grammar) very rarely needed. And where it's needed it can probably be implemented as a primitive in the language (map total function over all the nodes) that can do a similar thing.

srean7y ago

Then I think you will like https://tkatchev.bitbucket.io/tab/index.html

"It's statically-typed and type-infered.

It also infers memory consumption and guarantees O(n) memory use.

It is designed for concise one-liner computations right in the shell prompt.

It features both a mathematics library and a set of data slicing and aggregation primitives.

It is faster than all other interpreted languages with a similar scope. (Perl, Python, awk, ...)

It is not Turing-complete. (But can compute virtually anything nonetheless.)

It is self-contained: distributed as a single statically linked binary and nothing else.

It has no platform dependencies."

I am a little suspicous that you may be the author ;)

js87y ago

Looks very interesting, but I am not the author.

anewhnaccount27y ago

So XSLT then?

vidarh7y ago

XSLT is Turing complete with the usual caveats about memory. Given its complexity it'd be very unlikely for it not to be, but there's clear proof too: someone has implemented a universal Turing machine in it.

otoburb7y ago· 4 in thread

"TXR Lisp programs are shorter and clearer than those written in some mainstream languages "du jour" like Python, Ruby, Clojure, Javascript or Racket. If you find that this isn't the case, the TXR project wants to hear from you; give a shout to the mailing list. If a program is significantly clearer and shorter in another language, that is considered a bug in TXR."

That section made me chuckle. Admirable if true.

auvrw7y ago

i agree that the general-purpose programming language space is fairly crowded ... the lisp dialect/user ratio especially so.

DSLs, otoh, are in short supply. while awk or plain sed are great for shell programming, this is the only (open source) DSL i'm aware of targeting certain types of NLP-esque "munging". this space is mostly full of statistical approaches, which, while conceptually pure, don't allow the kind of flexibility that would be useful in many applications.

i wonder if, eventually, the DSL portion of TXR could be sheared off (possibly via metacircular evaluation of the TXR lisp?) into something that's portable across lisps or at least to semi-standardized scheme implementations?

kazinator7y ago

N. Westbury has been cloning it in Java:

https://github.com/westbury/txr-java

flavio817y ago

>That section made me chuckle. Admirable if true

Mostly true for very high level languages like Lisp/Scheme, or ML/OCaml/F#/Haskell, when faced against not-so-high-level languages like C, C++, Java.

Against Racket, i wouldn't be so sure. Nor against Ruby.

Python and Javascript are high level languages but they are crippled by some bad design decisions.

Zaak7y ago

I know that Python and Javascript have their warts (as do all languages in my experience), but what decisions in particular are you thinking of?

2 more replies

uptownfunk7y ago· 3 in thread

We already have this, it is R with tidyverse. What we need is a fully baked transpiler from R/tidyverse to sql.

crispyambulance7y ago

Yep. Seriously. R w/tidyverse is a ridiculously powerful data wrangling tool especially when dealing with text files.

I tend use Notepad++ when starting out on a data-wrangling adventure. It has an uncanny ability, unlike any other editor, to open hundreds of files at the same time and to perform regex operations on all of them without dropping dead. I uses Notepad++ for initial manual exploration to get the lay of the problem, and then switch to R for the actual analysis.

taeric7y ago

The irony, of course, is that txr predates tidyverse.

flavio817y ago

>I tend use Notepad++

I assume, then, that your file sizes are not so big. N++ is not good with big (>25% of your ram) file sizes, refusing to open them.

Is R/tidyverse also limited on the size of the file it can handle? In my job i routinely work with up to 100GB files.

1 more reply

notafraudster7y ago· 2 in thread

This seemed interesting, but when I went through the "Accepted Stack Overflow" links on the main page, I thought "how would I do this in an R tidyverse stack?" and set the goal that my responses should be shorter, clearer, or ideally both, and that I would favour clearer answers to code golf, except that when posting to HN I collapse the code into a single line while in R there would be linebreaks at each semicolon or after each pipe operator (%>%). Here are three examples below:

"Customized sort based on multiple columns of CSV". In R, something like this: `library(tidyverse); read_delim("file.tsv", delim = "@") %>% arrange(.[[2]]) %>% group_by(.[[2]]) %>% arrange(match(.[[3]], c("arch.", "var." "ver.", "anci.", "fam.")), .[[3]]) %>% group_by(.[[2]], .[[3]]) %>% mutate(n = n()) %>% arrange(desc(n)) %>% ungroup() %>% select(1:4)`

"Extract text from HTML table". In R, something like this would suffice: `library(rvest); library(tidyverse); read_html(URL_GOES_HERE) %>% html_nodes("div.scoreTableArea") %>% html_table() %>% write_delim("out.csv", delim = "\t")`

"Get n-th Field of Each Create Referring to Another File". In R: `library(tidyverse); file1 = read_delim("file1.txt", delim = " ", col_names = FALSE); chunks = readChar("file2.txt", 999999) %>% str_split(";") %>% unlist() %>% map(function(x) { matches = str_match(str_trim(x), '^create table "(.)"([^(])\$((.|\n)*)\$$'); title = matches[, 2]; fields = matches[, 4] %>% str_split(",") %>% unlist() %>% str_trim(); return(tibble(table_name = rep(title, length(fields)), n = 1:length(fields), field = fields)) }) %>% bind_rows(); file1 %>% left_join(chunks, by = c("X1" = "table_name", "X2" = "n"))`

The third example trades off a little clarity for a little robustness by adding a regex instead of assuming the SQL table definition is one field per line.

kazinator7y ago

There is no HTML parsing library in TXR, yet the code still looks good.

TXR Lisp has support for that type of functional transformation of structured data, with fairly tidy syntax. If a need for a full blown HTML parsing library arises, someone will come up with one; maybe me. It could end up integrated into the TXR flex/Yacc parser, which would make it fast.

In the "Get n-th Field" task, what we can do is snarf the data as a string, then remove all the commas and semicolons. It then parses as a TXR Lisp with the lisp-parse function, resulting in this:

  (create table (qref "def" something)
   (f01 char (10) f02 char (10) f03 char (10) f04 date)
   create table (qref "abc" something)
   (x01 char (10) x02 char (1) x03 char (10))
   create table (qref "ghi" something)
   (z01 char (10) z02 intr (10) z03 double (10) z04 char (10) z05 char (10)))

That seems to open an avenue to a solution. E.g. we can now partition it into pieces that start with the create symbol:

  28> (partition *26 (op where (op eq 'create)))
  ((create table (qref "def" something) (f01 char (10) f02 char (10) f03 char (10) f04 date))
   (create table (qref "abc" something) (x01 char (10) x02 char (1) x03 char (10)))
   (create table (qref "ghi" something) (z01 char (10) z02 intr (10) z03 double (10) z04 char (10) z05
                                         char (10))))

Now the (qref "def" something) parts are in fixed positions, followed by fixed-shape triplets.

Only problem with this type of solution is that it takes the example data too literally. The user's actual data might not cleanly parse this way.

kazinator7y ago

> when posting to HN I collapse the code into a single line

If you just put two spaces of indentationo on every line, you get a verbatim block in typewriter font,

  like
  this.

cstross7y ago· 2 in thread

From where I'm standing this looks like someone put a lot of effort into re-inventing Perl, minus the documentation and user community.

TuringTest7y ago

I've not studied this language yet, but if its syntax is in any way saner, that would still be a net gain.

nabla97y ago

Removing the perversion is a good goal IMHO.

(PERL = Perversion Excused by Random Lispiness)

theon1447y ago· 2 in thread

Well, this looks great, but I'm not about to start digesting the self-admitted 600-page tome just to see if it's worth learning for the tasks I encounter - surely there's a "tutorial" somewhere?

TuringTest7y ago

This page is quite explanatory:

http://www.nongnu.org/txr/txr-pattern-language.html

tux19687y ago

Way off topic, but as someone who has recently switched to using a non-standard background color in my browser... that page is horrendous to read:

https://i.imgur.com/pvCnmSa.png

I can accept that doing something non-standard leads to some rough edges like this, but i'm not sure how many web developers know this is an issue. At least it has surprised me how many websites have this issue of assuming the default color is bright white.

2 more replies

mark_l_watson7y ago· 2 in thread

Interesting lisp’y language. Off topic, but I find the domain name nongnu.org to be amusing for a GNU/FSF web site. “nongnu” to me reads as “not gnu”

buckminster7y ago

Exactly. It's GNU hosting for non-GNU projects.

mark_l_watson7y ago

Oh, that makes sense. thanks!

1 more reply

jdmoreira7y ago· 2 in thread

Very interesting. I'm wondering why they didn't implement the Lisp version on top of CL with macros

kazinator7y ago

I can summarize this as follows. TXR is my research platform into various topics, including many Lisp topics. It contains numerous innovations. As a whole, that requires working at the implementation level, ground up.

flavio817y ago

Thanks Kaz! I had the same question.

usgroup7y ago· 1 in thread

I ashamedly had never heard of this before. Could anyone add any colour RE:

1. Parsimony.

2. Performance vs awk and friends.

3. Multi threading.

4. Ideal use cases.

nn37y ago

4. My use case was: If you have a some what fuzzy parsing problem that is harder than a single regexpr and needs backtracing, and then generate a report from it.

For these things TXR is great.

If you want to do multi threading or best performance it's probably not the thing to use.

kazinator7y ago

Author here. Currently working on a debugger. (Threw the old crappy one out.) Backtraces are working. Some of the remaining work is going to require long, uninterrupted concentration that is hard to come by due to taking care of a six-month-old baby.

I have over 50 unreleased patches. There are some bugfixes, including a compiler one, involving dynamically scoped variables used as optional parameters:

      (defvar v)
      (defun f (: (v v)))
      (call (compile 'f)) ;; blows up in virtual machine with "frame level mismatch"

Patch for that:

  diff --git a/share/txr/stdlib/compiler.tl b/share/txr/stdlib/compiler.tl
  index e76849db..ccdbee83 100644
  --- a/share/txr/stdlib/compiler.tl
  +++ b/share/txr/stdlib/compiler.tl
  @@ -868,7 +868,7 @@
                                       ,*(whenlet ((spec-sub [find have-sym specials : cdr]))
                                           (set specials [remq have-sym specials cdr])
                                           ^((bindv ,have-bind.loc ,me.(get-dreg (car spec-sub))))))))))
  -                 (benv (if specials (new env up nenv co me) nenv))
  +                 (benv (if need-dframe (new env up nenv co me) nenv))
                    (btreg me.(alloc-treg))
                    (bfrag me.(comp-progn btreg benv body))
                    (boreg (if env.(out-of-scope bfrag.oreg) btreg bfrag.oreg))

There is now support in the printer for limiting the depth and length.

I added a derived hook into the OOP system; a struct being notified that it is being inherited.

mcguire7y ago

Confusingly, there's another language called TXL (https://en.wikipedia.org/wiki/TXL_(programming_language)) that's both obscure and neat.

vcdimension7y ago

Has anyone run any benchmarks of TXR against awk, R, python, or miller?

j / k navigate · click thread line to collapse

73 comments

43 comments · 13 top-level

anentropic7y ago· 6 in thread

"Good luck, you're on your own!"

kazinator7y ago

The HTML version that most people would be using has a TOC with two-way navigation to the section headings and is hyperlinked. Of course, man page reading allows easy searching.

Jach7y ago

Another edit preserving more of the original would be to replace the final "with no" with something like "even excluding any"...

1 more reply

nn37y ago

I've learned/used basic TXR some time ago. I had a text parsing problem that needed backtracing, and it seemed simpler to use TXR than to use implement this in python or perl.

You can learn it quickly based on the provided examples.

It's just a few straight forward commands, although you have to wrap your mind how the backtracing parser works.

Most of the manual is about the LISP. I never used that part and I don't think it's really needed for 95+% of all text parsing/summarizing.

cgio7y ago

Well the HTML version has contents. 600 pages of documentation and with the information density I see in a quick skim would not imply a “you are on your own” mentality to me.

ilovetux7y ago

This. I have never seen a programming language brag about being inaccessible and having bad documentation.

oddity7y ago

Edit: 10 years ago in this case.

js87y ago· 6 in thread

It would be interesting to have a DSL for data munging, but I am afraid TXR is not it. My requirements would be that the language should be functional and total.

Most transformations that we do on data do not require Turing completeness or recursion. I think it would be useful to write these down in a language with semantics that is easy to analyze.

kazinator7y ago

I don't see why we would want to rule out a pattern function invoking itself (directly, or through intermediaries); if that hurts, then just don't do that.

(Though I understand that there are languages deliberately designed without unbounded loops or recursion, for justifiable reasons.)

js87y ago

srean7y ago

Then I think you will like https://tkatchev.bitbucket.io/tab/index.html

"It's statically-typed and type-infered.

It also infers memory consumption and guarantees O(n) memory use.

It is designed for concise one-liner computations right in the shell prompt.

It features both a mathematics library and a set of data slicing and aggregation primitives.

It is faster than all other interpreted languages with a similar scope. (Perl, Python, awk, ...)

It is not Turing-complete. (But can compute virtually anything nonetheless.)

It is self-contained: distributed as a single statically linked binary and nothing else.

It has no platform dependencies."

I am a little suspicous that you may be the author ;)

js87y ago

Looks very interesting, but I am not the author.

anewhnaccount27y ago

So XSLT then?

vidarh7y ago

otoburb7y ago· 4 in thread

That section made me chuckle. Admirable if true.

auvrw7y ago

i agree that the general-purpose programming language space is fairly crowded ... the lisp dialect/user ratio especially so.

kazinator7y ago

N. Westbury has been cloning it in Java:

https://github.com/westbury/txr-java

flavio817y ago

>That section made me chuckle. Admirable if true

Mostly true for very high level languages like Lisp/Scheme, or ML/OCaml/F#/Haskell, when faced against not-so-high-level languages like C, C++, Java.

Against Racket, i wouldn't be so sure. Nor against Ruby.

Python and Javascript are high level languages but they are crippled by some bad design decisions.

Zaak7y ago

I know that Python and Javascript have their warts (as do all languages in my experience), but what decisions in particular are you thinking of?

2 more replies

uptownfunk7y ago· 3 in thread

We already have this, it is R with tidyverse. What we need is a fully baked transpiler from R/tidyverse to sql.

crispyambulance7y ago

Yep. Seriously. R w/tidyverse is a ridiculously powerful data wrangling tool especially when dealing with text files.

taeric7y ago

The irony, of course, is that txr predates tidyverse.

flavio817y ago

>I tend use Notepad++

I assume, then, that your file sizes are not so big. N++ is not good with big (>25% of your ram) file sizes, refusing to open them.

Is R/tidyverse also limited on the size of the file it can handle? In my job i routinely work with up to 100GB files.

1 more reply

notafraudster7y ago· 2 in thread

The third example trades off a little clarity for a little robustness by adding a regex instead of assuming the SQL table definition is one field per line.

kazinator7y ago

There is no HTML parsing library in TXR, yet the code still looks good.

In the "Get n-th Field" task, what we can do is snarf the data as a string, then remove all the commas and semicolons. It then parses as a TXR Lisp with the lisp-parse function, resulting in this:

  (create table (qref "def" something)
   (f01 char (10) f02 char (10) f03 char (10) f04 date)
   create table (qref "abc" something)
   (x01 char (10) x02 char (1) x03 char (10))
   create table (qref "ghi" something)
   (z01 char (10) z02 intr (10) z03 double (10) z04 char (10) z05 char (10)))

That seems to open an avenue to a solution. E.g. we can now partition it into pieces that start with the create symbol:

  28> (partition *26 (op where (op eq 'create)))
  ((create table (qref "def" something) (f01 char (10) f02 char (10) f03 char (10) f04 date))
   (create table (qref "abc" something) (x01 char (10) x02 char (1) x03 char (10)))
   (create table (qref "ghi" something) (z01 char (10) z02 intr (10) z03 double (10) z04 char (10) z05
                                         char (10))))

Now the (qref "def" something) parts are in fixed positions, followed by fixed-shape triplets.

Only problem with this type of solution is that it takes the example data too literally. The user's actual data might not cleanly parse this way.

kazinator7y ago

> when posting to HN I collapse the code into a single line

If you just put two spaces of indentationo on every line, you get a verbatim block in typewriter font,

  like
  this.

cstross7y ago· 2 in thread

From where I'm standing this looks like someone put a lot of effort into re-inventing Perl, minus the documentation and user community.

TuringTest7y ago

I've not studied this language yet, but if its syntax is in any way saner, that would still be a net gain.

nabla97y ago

Removing the perversion is a good goal IMHO.

(PERL = Perversion Excused by Random Lispiness)

theon1447y ago· 2 in thread

Well, this looks great, but I'm not about to start digesting the self-admitted 600-page tome just to see if it's worth learning for the tasks I encounter - surely there's a "tutorial" somewhere?

TuringTest7y ago

This page is quite explanatory:

http://www.nongnu.org/txr/txr-pattern-language.html

tux19687y ago

Way off topic, but as someone who has recently switched to using a non-standard background color in my browser... that page is horrendous to read:

https://i.imgur.com/pvCnmSa.png

2 more replies

mark_l_watson7y ago· 2 in thread

Interesting lisp’y language. Off topic, but I find the domain name nongnu.org to be amusing for a GNU/FSF web site. “nongnu” to me reads as “not gnu”

buckminster7y ago

Exactly. It's GNU hosting for non-GNU projects.

mark_l_watson7y ago

Oh, that makes sense. thanks!

1 more reply

jdmoreira7y ago· 2 in thread

Very interesting. I'm wondering why they didn't implement the Lisp version on top of CL with macros

kazinator7y ago

flavio817y ago

Thanks Kaz! I had the same question.

usgroup7y ago· 1 in thread

I ashamedly had never heard of this before. Could anyone add any colour RE:

1. Parsimony.

2. Performance vs awk and friends.

3. Multi threading.

4. Ideal use cases.

nn37y ago

4. My use case was: If you have a some what fuzzy parsing problem that is harder than a single regexpr and needs backtracing, and then generate a report from it.

For these things TXR is great.

If you want to do multi threading or best performance it's probably not the thing to use.

kazinator7y ago

I have over 50 unreleased patches. There are some bugfixes, including a compiler one, involving dynamically scoped variables used as optional parameters:

      (defvar v)
      (defun f (: (v v)))
      (call (compile 'f)) ;; blows up in virtual machine with "frame level mismatch"

Patch for that:

  diff --git a/share/txr/stdlib/compiler.tl b/share/txr/stdlib/compiler.tl
  index e76849db..ccdbee83 100644
  --- a/share/txr/stdlib/compiler.tl
  +++ b/share/txr/stdlib/compiler.tl
  @@ -868,7 +868,7 @@
                                       ,*(whenlet ((spec-sub [find have-sym specials : cdr]))
                                           (set specials [remq have-sym specials cdr])
                                           ^((bindv ,have-bind.loc ,me.(get-dreg (car spec-sub))))))))))
  -                 (benv (if specials (new env up nenv co me) nenv))
  +                 (benv (if need-dframe (new env up nenv co me) nenv))
                    (btreg me.(alloc-treg))
                    (bfrag me.(comp-progn btreg benv body))
                    (boreg (if env.(out-of-scope bfrag.oreg) btreg bfrag.oreg))

There is now support in the printer for limiting the depth and length.

I added a derived hook into the OOP system; a struct being notified that it is being inherited.

mcguire7y ago

Confusingly, there's another language called TXL (https://en.wikipedia.org/wiki/TXL_(programming_language)) that's both obscure and neat.

vcdimension7y ago

Has anyone run any benchmarks of TXR against awk, R, python, or miller?

j / k navigate · click thread line to collapse