Turning the IDE Inside Out with Datalog (opens in new tab)

(petevilter.me)

275 pointsarjunnarayan5y ago61 comments

61 comments

53 comments · 24 top-level

Nice article and great idea, but as is traditional there are some slight fudges of what Datalog is or what Horn clauses are etc, that I'd like to unfudge, slightly. It's Sunday! What better than to start our day with a very quick and almost not completely fudgy intro to logic programming? Le'ts go!

To begin with, Datalog is not a "cousin" of Prolog as stated in the section "Interlude: Brief Intro to Datalog". Datalogs (there are many variants!) are subsets of Prolog. For example, a typical datalog is the language of definite clauses with no function symbols [¹] and with no negation as failure [²]. Another datalog may allow only the cons function in order to handle lists; etc.

Otherwise the syntax of datalog is identical to Prolog, but there is a further difference, in that Prolog is evaluated from the "top down" whereas Datalog is evaluated from the "bottom up". What that means is that given a "query" (we'll come to the scare quotes in a moment) Prolog will try to find "rules" whose heads unify with a literal in the query (A literal is an atom, or the negation of an atom; "p(χ,α)" is an atom.) whereas datalog will first generate the set of all ground atoms that are consequences of the program in the context of which the query was made, then determine whether the atoms in the query are in the set of consequences of the program [³]. The reason for the different execution model is that the bottom-up evaluation is guaranteed to terminate [⁴] whereas Prolog's top-down evaluation can "go infinite" [⁵]. There is of course another, more subtle difference: Prolog can "go infinite" because of the Halting problem, from which datalog does not suffer because, unlike Prolog, it does not have Universal Turing Machine expressivity [⁶].

So in short, datalog is a restricted subset of Prolog that has the advantage of being decidable, while Prolog in general is not, but is also incomplete while Prolog is complete [⁷].

Now, the other slight fudge in the article is about "rules", "facts" and "queries". Although this is established and well-heeled logic programming terminology, it fudges the er fact that those three things are the same kind of thing, namely, they are, all three of them, Horn clauses [⁸].

Specifically, Horn clauses are clauses with a single positive literal.

Crash course in FOL: an atom is a predicate symbol followed by a set of terms in parentheses. Terms are variables, functions or constants (constants are functions with 0 arity, i.e. 0 arguments). A literal is an atom, or the negation of an atom. A clause is a disjunction of literals. A clause is Horn when it has at most 1 positive literal. A Horn clause is a definite clause when it has exactly 1 positive literal.

The following are Horn clauses:

  ¬P(χ) ∨ ¬P(υ)
  P(χ) ∨ ¬Q(χ)
  Q(α)
  Q(β)

In logic programming tradition, we write clauses as implications (because ¬A ∨ B ≡ A → B) and with the direction of the implication arrow reversed to make it easier to read long implications with multiple premises. So the three clauses above are written as:

  ←P(χ), P(υ) (a)
  P(χ) ← Q(χ) (b)
  Q(α)←       (c)
  Q(β)←       (d)

And those are a "query", (a), a "rule", (b) and two "facts", (c) and (d).

Note that (b,c,d) are definite clauses (they have exactly one positive literal, i.e. their head literal, which is what we call the consequent in the implication). Facts have only a positive literal; I like to read the dangling implication symbol as "everything implies ...", but that's a bit idiosyncratic. The bottom line is that definite clauses with no negative literals can be though of as being always true, hence "facts". Queries, i.e. Horn clauses with no positive literals, are the opposite: "nothing implies" their body literals (my idiosyncratic reading) so they are "always false". Queries are also called "goals". Finally, definite clauses with both positive and negative literals can be thought of as "conditionally true".

Prolog and datalog programs are written as sets of definite clauses, i.e. sets of "facts" and "rules". So, when we want to reason about the "facts" and "rules" in the program, we make a "query". Then, the language interpreter, which is a top-down resolution theorem prover [⁹] in the case of Prolog, or bottom-up fixpoint calculation in the case of datalog [¹⁰], determines whether our "query" is true. If the query includes any variables then the interpreter also returns evey set of variable substitutions that make the query true.

In the example above, (a) has two variables, χ and υ and evaluating (a) in the context of (b,c,d) would return a "true" result with the variable substitution {χ/α,υ/β}, i.e. (a) is true iff χ = α and υ = β.

And that's how Horn clauses and definite clauses become "rules", "facts" and "queries".

Next time: how the leopard got its stripes and the hippopotamus learned to love the first order predicate calculus.

_________________

[¹] This is my (first) slight fudge because constants are also functions, with 0 arguments. So, to be formal, the typical datalog above has "no functions of arity more than 0".

[²] Negation-as-failure makes a language non-monotonic, in the sense that introducing new "facts" can change the meaning of a theory, i.e. a program.

[³] So, its Least Herbrand Model, or its Least Fix-Point (LFP).

[⁴] Because it finds the LFP of the query and the program.

[⁵] Unless evaluated by SLG resolution, a.k.a. tabling, similar to memoization.

[⁶] Although higher-order datalogs, that allow for predicate symbols as terms of literals have UTM expressivity, indeed a UTM can be defined in a higher-order datalog fragment where clauses have up to two body literals with at most two arguments:

  utm(S,S) ← halt(S).
  utm(S,T) ← execute(S,S₁), utm(S₁,T).
  execute(S,T) ← instruction(S,P), P(S,T).

Originally in:

Tärnlund, S.-A. (1977). Horn clause computability. BIT Numerical Mathematics, 17(2), 215–226.

[⁷] Less fudgy, definite programs are refutation complete under SLD resolution, meaning that any atom that is entailed by a definite program can be derived by SLD resolution. A definite program is a set of definite clauses, explanation of which is coming right up.

[⁸] Long time ago, I explained this to a colleague who remarked that all the nice syntactic elegance in query languages falls apart the moment you try to make a query, which usually has a very different syntax than the actual rows of the tables in the database. So I said "that's the point! Queries are also Horn clauses!" and his immediate remark was "That just blew my mind". It's been so long and I'm so used to the idea that I haven't a clue whether this is really mind blowing. Probably, being my usual excited self, I just said it in a way that it sounded mind blowing (gesticulating widely and jumping up and down enthusiastically, you can picture the scene) so my colleague was just being polite. That was over drinks at the pub after work anyway.

[⁹] Resolution is an inference rule that allows the derivation of new atoms from a set of clauses. In theorem proving it's used to refute a goal clause by deriving the empty clause, □. Since a goal is a set of negated literals, refuting the goal means basically that the negated literals are true. So our query is true in the context of our program.

[¹⁰] Datalog's bottom-up evaluation uses something called a TP operator. It basically does what I said above, starts with the ground atoms in a program and then derives the set of consequences of the clauses in the program. In each iteration, the set of consequences are added to the program and the process is repeated, until no new consequences are derived. As stated above, the process is guaranteed to complete because every datalog definite program has a least fixpoint, which is also its Least Herbrand Model (we won't go into Herbrand Models and Herbrand interpretations, but, roughly, an LHM is the smallest set of atoms that make the union of a definite program and a goal true). A more complete introduction to LHMs and LFPs and how they are used in bottom-up evaluation for datalog can be found here:

https://www.doc.ic.ac.uk/~mjs/teaching/KnowledgeRep491/Fixpo...

tom_mellior5y ago

Sorry, but you are also fudging things.

> Datalogs (there are many variants!) are subsets of Prolog.

No. First of all, this is not true syntactically. There are Datalogs that allow non-Horn clauses with several terms in a goal head:

    a, b :- c.

This is not allowed in Prolog. So (such) Datalogs are not subsets of Prolog syntactically.

Second, it is not true semantically either, not even for the common syntactic subset. Consider:

    ancestor_of(Parent, Child) :-
        child_of(Child, Parent).
    ancestor_of(Ancestor, Person) :-
        ancestor_of(Ancestor, Parent),
        child_of(Person, Parent).

This is left-recursive, so typical queries will not terminate in Prolog, i.e., have no finite solutions. But as you say, Datalogs are decidable and any query terminates, so you will get solutions, which is different semantics from Prolog. So it's not meaningful to say that Datalog is a semantic subset of Prolog.

Datalog and Prolog are like C++ and Java: One is an extension of a subset of the other, or equivalently, there is a non-empty common subset with similar-ish semantics. This is not a very useful statement? I agree! But it is what it is. They are different languages.

YeGoblynQueenne5y ago

Yes, you're right and I'm also fudging things- but didn't I say that upfront? I start my comment by announcing an "almost not completely fudgy intro to logic programming"!

More seriously, you're right about syntax so thanks for the correction.

But, regarding semantics, the ancestor_of/2 program above can terminate in Prolog, evaluated by SLG resolution, as per my footnote 5. There are still situations where Prolog will not terminate when evaluating a normal program even under SLG resolution, but left recursion is not one of those.

Edit: also, if a Datalog program is also Prolog, and assuming that the program terminates under Prolog, then Prolog and Datalog will both compute its LHM. So it makes sense to say that the two languages are semantically at least very similar and to explain one in terms of the other. They are both much closer than what each is to ASP, for example. It really depends on what assumptions one makes- and that's where the "fudging" comes in.

Anyway, thanks for the correction. I sure could have done a better job of that comment. Did you find anything other that was very wrong in my comment? I'd appreciate it if you pointed it out.

tom_mellior5y ago

> Did you find anything other that was very wrong in my comment?

I'm not sure about the statement that "Datalog is evaluated from the "bottom up". [...] datalog will first generate the set of all ground atoms that are consequences of the program in the context of which the query was made, then determine whether the atoms in the query are in the set of consequences of the program". I think it's true that Datalog behaves as if it were evaluated like this, but AFAIK Datalog systems can do lots of very aggressive optimizations that change the actual evaluation. I'm not an expert on Datalog.

Anyway, I was mostly dissatisfied with the general thrust of the comment, trying to establish a "subset" relationship. I think the article's "cousins" comment is fair. Prolog is older and influenced Datalog very much. Prolog is also Turing complete, so it is strictly more powerful. But Datalog's semantics allow some optimizations that wouldn't be possible in Prolog, so it can be a lot faster on appropriate classes of problems.

1 more reply

vilterp5y ago

Hi, author here. Thank you so much for this unfudging! I knew I was fudging things up a bit, and I didn't really understand the relationships between Prolog, Datalog, Horn clauses, and first-order logic. This clears things up a lot, though I think I'll need to read it a couple more times for it to sink in! Thanks again.

YeGoblynQueenne5y ago

Thank you for your article! Also note tom_mellior's corrections above, particularly about datalogs with two head literals. I'm fudging things up also!

YeGoblynQueenne5y ago

Hey dang, I added some whitespace twice (I needed to add a newline above the first set of Horn clauses to make it into a blockquote) right after I made my comment and the comment "fell" maybe five places. Is that as it should be?

Edit: I did make two or three more changes after adding the whitespace that weren't whitespace, but single characters.

mkl5y ago

(Not dang!) New comments start at the top, so they have a chance to get upvotes, and quickly drop down if they don't get votes relative to other comments. It sounds normal.

YeGoblynQueenne5y ago

Yes, but a new comment doesn't immediately sink five places, unless the comments above it are very highly upvoted - or the comment is very large. That is, there is a grace period that keeps a comment on top of a thread for a little while, presumably to give it a chance to be read. But, I've noticed that large comments are automatically pushed down into a thread so the grace period for them is much reduced (and probably weighted by the size of the comment).

That's not absolute, because I remember it stated elesewhere by dang that larger comments are considered more likely to be more substantial (for sound reasons). But it seems that holds up to a certain point. There is probably some kind of comment size range, of maximum and minimum comment size, outwith which a comment is pushed down.

There are other factors, also that cause a comment to automatically sink. For example, I have the (admittedly very bad) habbit of editing and re-editing my comments over and over again. I've noticed that if I do this in a sufficiently large comment, the comment immediately sinks when I refresh the thread page after an edit. If the comment is smaller, it can take quite a bit more editing. I think this behaviour is also weighted by user karma or time since account creation because my comments started sinking less often as my karma increased. Or maybe the rules of the site changed in the meantime.

Bottom line: large comments with some edits sink like lead.

1 more reply

sriram_malhar5y ago· 6 in thread

Niko Matsakis had an insightful observation about this approach in a similar post (which is referenced by OP):

http://smallcultfollowing.com/babysteps/blog/2017/01/26/lowe...

The key observation to me was that the traditional Datalog/Prolog way of unifying is through syntactic equality, which is a bit too simple to express the kind of equality needed in Rust and elsewhere. You can express it in Datalog, but as it gets farther away from the source, error-generation suffers.

tannhaeuser5y ago

"Insightful observation" doesn't quite do it justice ;) There's a whole discipline in discrete math dealing with the possibilities and limitations of equational theories and reasoning. And it could be said their limitations gave rise to the various constraint formalisms that were introduced in the 80s and 90s as Prolog extensions and sometimes (syntactic) generalizations.

YeGoblynQueenne5y ago

Well then, please elaborate because I know nothing of the discipline you describe and it certainly sounds interesting (though it also sounds a little fussy).

tannhaeuser5y ago

I'm not going to summarize classic equational reasoning (with it's deep connection to algebraic geometry and whatnot) here in a single HN post ;) I can point you to some classic works/authors in the field, though (and I'm sure some fellow HNers can provide some more): Gauss, Knuth, Bendix, Gröbner, Buchberger, Bachmaier, Euclid, Ganzinger, Davis, Putnam, Robinson, and Colmerauer for Prolog 2. There are also category theory papers relevant to reasoning about data structures, and of course Damas, Hindley, Miller for type theories. The way is the goal here.

1 more reply

YeGoblynQueenne5y ago

>> The key observation to me was that the traditional Datalog/Prolog way of unifying is through syntactic equality, which is a bit too simple to express the kind of equality needed in Rust and elsewhere.

I was surprised by this comment because Horn logic, although syntactically restricted, has Universal Turing Machine expressivity and it's difficult to see what Rust needs to do that needs more expressive power than a UTM(!).

Reading the post I understand that the author is speaking about unification in particular, not Horn clauses and resolution, where unification is typically used, however there is still a subtle error, which is the identificatin of unification as "syntactic equality". While unification can be used to decide whether two terms are syntacticall equal, in the sense that if they are, they will unify, unification on its own is not syntactic equality.

First of all, say that T₁ = f(χ) and T₂ = f(α) are two terms. Note that χ is a variable and α is a constant. Clearly T₁ ≠ T₂ and that is true for "syntactic" equality, because χ ≠ α.

Now suppose that θ = {χ/α} is a substitution of the variables in {T₁,T₂}. Now, θT₁ = Τ₂. Read θT₁ as "T₁ with θ applied". In other words, T₁ is equal to T₂ _if_ χ is substituted with α. A unification algorithm (normally, Robinson's unification algorithm) is used to establish that two terms are equal given a substitution of their variables. In particular, a substitution of variables that allows two terms to unify is a "unifier". Note that unifiers can be composed, so for example, if T₃ = f(χ,υ), T₄ = f(α,β), φ = {χ/α} and ψ = {υ/β} then φT₃ ≠ T₄ but φψT₃ = T₄.

I noticed in the examples of Rust-in-Prolog given in the link above that the author subtly fudges the syntax of Prolog, for example Bar, which is normally a variable in Prolog, is used as a constant, whereas ?A and ?B which would be not-variables in Prolog (but partially ground terms) are used as variables. I wonder if this is part of a now-divide-by-zero step in the author's reasoning, that I can't figure out exactly because I don't know any Rust. For example, what is "?A" and what is "Bar" in the context of Rust?

In any case, the author is of course correct that unification is not type equality- but then, nothing is except for type equality. As far as I understand these things anyway- I'm not a type nerd. Bottom line, you can't implement "type equality" as any other kind of "equality" and you'll have to jump through hoops -the hoops of defining your higher-level notion of equality- in what ever language you choose.

tom_mellior5y ago

> you'll have to jump through hoops -the hoops of defining your higher-level notion of equality- in what ever language you choose.

Right. The author is just disappointed that Prolog does have a built-in "equality" that isn't the equality they want. A nice solution in Prolog would be to write a meta-interpreter or to do term rewriting on the input program. This would allow the author to write

    clone(usize).

and have it be interpreted as

    clone(A) :- type_equal(A, usize).

YeGoblynQueenne5y ago

Yes. I just don't see why this is such a big hassle. Particularly because -I'm guessing- whatever the Rust compiler does to decide type equality can't be less hassle-y than that.

But, I might misunderstand the problem since, again, I don't know Rust.

kibwen5y ago· 3 in thread

In the Rust ecosystem, I believe the next-generation IDE support is already beginning to incorporate some ideas like this, although I believe it uses Prolog rather than Datalog (I'd be interested to learn more about how they differ). The logic library is called Chalk ( https://github.com/rust-lang/chalk ), and it's designed to be a full trait resolution engine for Rust (trait resolution being a necessary step to e.g. autocomplete methods on types); rust-analyzer has been using it for quite a while now, and it's slowly being incorporated into rustc itself. Not as impressive as basing the entire IDE on queries, of course!

The Rust project is also using the differential-datalog library mentioned in the OP to underlie their third-generation borrow checker: https://github.com/rust-lang/polonius

mmarx5y ago

> I believe it uses Prolog rather than Datalog (I'd be interested to learn more about how they differ).

Datalog is the positive, function-free fragment of Prolog, i.e., what Prolog offers over Datalog is negation and “value invention” through function terms. One consequence of that is that deciding whether a Prolog program entails a fact is undecidable (i.e., Prolog is Turing-complete), deciding whether a Datalog program entails a fact is can be done in polynomial time with respect to the size of the database.

namibj5y ago

It seems polonius is using datafrog instead. Where do you see a differential engine referenced?

kibwen5y ago

The post linked from the readme mentions it: https://smallcultfollowing.com/babysteps/blog/2018/04/27/an-... , although it's two years old at this point so I wouldn't be surprised if things have changed.

1 more reply

simplify5y ago· 2 in thread

Fantastic work! Datalog/Prolog are old-yet-futuristic technologies that I wished every developer knew about, as it solves a lot of problems in very elegant ways (as attested to by this post).

I noticed you (assuming you're the author) used what looks like a non-traditional Datalog syntax – which makes sense to me as, IMO, Datalog/Prolog desperately need first-class support for a record-like syntax to finally break into mainstream. Is there any prior work to this syntax, or did you just develop it as you needed it?

tannhaeuser5y ago

There's a proposal due to the prolog-commons initiative for a dictionary-like object (that is also implemented in SWI Prolog, I believe).

And in Prolog you can, of course, just trivially use your own term structure for item-values:

    p([ item : "value", ... ])

You could even use JSON-like terms in Prolog (with maybe a little help by the op/3 directive to sort out parsing priorities):

    A_Prolog_Term = {
      x: y,
      z: [a, b, {1, d} ]
    }

But the more fundamental approach IMHO would be to use nested knowledge-base terms:

    p(
      q(whatever).
      r(X) :- q(X).
    ).

As pioneered in the 1980s for attribute grammars, using ^^ and other special graph tokens as nested "implies" operator, though (see "definite-clause translation grammars").

tom_mellior5y ago

Agreed on the niceness of the keyword syntax, I would very much like to have than in Prolog too. SWI-Prolog has record-like data structured called dicts (IIRC) but you can't use them directly as clause heads the way the article does.

sandermvanvliet5y ago· 2 in thread

I would have expected a mention of Roslyn which powers Visual Studio (Not code). Sort of similar to the IntelliJ approach but it is also what drives the C# compiler, it maintains the code model (actually 2 of them, syntactic and semantic) which makes it pretty powerful. It is multi language (VB, C# and I think F# now too?) but it’s perhaps not as universal as the Language Server Approach

sideeffffect5y ago

I'm pretty sure that Roslyn doesn't do F#

mkl5y ago

Roslyn is mentioned in the Related Work section.

disposedtrolley5y ago· 2 in thread

Are there any examples of "programs as databases" being applied to structured text like JSON or YAML? Something like a generic system which takes a set of rules/facts and the source data, and transforms these into a queryable data structure.

I'm working with a lot of OpenAPI [1] specifications currently, some of which span tens of thousands of lines. Heaps of parent, child, sibling type relationships, and objects which are defined-once and referenced in many places. It would be nice if I could perform a search like "select all schemas used in this path" or "select the count of references to this parameter".

[1] https://github.com/OAI/OpenAPI-Specification/blob/master/ver...

sjg0075y ago

I mean, put them into leucene, elastic search or maybe mongodb?

disposedtrolley5y ago

Well yes, but that doesn't solve the parsing problem right?

My point is that some YAML or JSON documents conform to a specification which can be codified as a set of rules, which combined with the data form facts. I'm asking if there are existing systems to parse these files given I write these rules in a particular syntax, spitting out a queryable representation of the data.

habosa5y ago· 2 in thread

This is extremely cool, great work! You're right, IDEs can do so much magic but they're limited by the 'spells' their creators have imagined. Many times I've wanted to do a simple semantic refactor that I could express as logic but the IDE can't handle it.

(Also hi Pete!)

duncanawoods5y ago

I don’t see what that has to with the IDE. Just grab a parser/ast library for your language.

I’ve written a few of my own refactorings for big design changes that might touch e.g. every view in a huge application in ways no series of regexes could cover.

It’s hard work. Definitely one of those tricky “cost of automation” decisions. You can do an awful lot of dumb grunt work in the time it takes to debug a refactoring.

vilterp5y ago

Oh hey Sam! Been a while! Thanks for reading man :)

brabel5y ago· 1 in thread

Jetbrain IDEs (e.g. IntelliJ) have structural search which is basically a query language for code: https://www.jetbrains.com/help/idea/structural-search-and-re...

I know there's more to just queriability in the post, but I thought this should've been mentioned when discussing what existing IDEs can offer.

agumonkey5y ago

we're not far from coccinelle-ing the codebase

jierenchen5y ago· 1 in thread

This is very cool! It's really awesome to see this code as data concept gaining a lot of traction recently. Hope to see this project developed further.

I'm working on a similar project here: https://sourcescape.io/, but intended for use outside the IDE on larger collections of code (like the codebase of a large company.)

Agreed on the Prolog/Datalog approach of expressing a query as a collection of facts. CodeQL does the same. From one datastore nerd to another, I actually think this is a relatively unexplored area of querying knowledge graphs (code being a very complex, dense knowledge graph.)

Very excited to see where you go next with this "percolate search" functionality in the IDE.

geordimort5y ago

There’s a lot of research on this area. The fact that Semmle was acquired by Github/Microsoft is a testament to the maturity of the field.

afarrell5y ago· 1 in thread

One challenge to this that languages which easily allow metaprogramming can encourage codebases where it is hard to write tools to gain insight into the structure of a codebase.

You can say “don’t do that”, but I didn’t. Past coworkers did.

geordimort5y ago

This is just a prototype to showcase ideas. In real life even parsing of C++ or Java is much harder to pull off than one anticipated.

nine_k5y ago· 1 in thread

This is really nice. Key points:

- Parse the program like an IDE would, but expose the data in an open queryable database format (both line and unlike a language server).

- Use Datalog for storing the facts and inferring new facts about data.

(A fun fact: the Datalog implementation they use is written in Haskell and generates programs in Rust.)

kibwen5y ago

Here's the link to the Datalog implementation you mention (from VMWare, I wonder what they use it for?): https://github.com/vmware/differential-datalog

pjmlp5y ago

Rejuvenating that many devs are finally looking into this kind of workflows.

For prior work on IDE as database check "Energize/Cadillac & Lucid's Demise", https://www.dreamsongs.com/Cadillac.html

You can see it in action on this 1993 marketing video from Lucid, https://www.youtube.com/watch?v=pQQTScuApWk

IBM also tried the same concept on Visual Age for C++, version 4, which was one of the very last versions of the product.

http://www.edm2.com/index.php/VisualAge_C%2B%2B_4.0_Review

Both suffered from too heavy hardware requirements for what most companies were willing to pay for, otherwise we could have had Smalltalk like tooling, with incremental compilation and reloading for C++ already during the early 90's.

sideeffffect5y ago

This is a similar idea to that of SemanticDB, which is used in the Scala community

https://scalameta.org/docs/semanticdb/guide.html

It is a queryable database of semantic information about the program which is generated by the compiler (compiler plugin, to be precise). Once generated, other tools which need semantic information, like linters or language servers, can consume it without having to worry about how to actually generate it.

You might enjoy a talk about it: How We Built Tools That Scale to Millions of Lines of Code by Eugene Burmako

https://www.youtube.com/watch?v=C770WpI_odM

Kythe by Google is also a similar thing: https://kythe.io/

ilaksh5y ago

Reminds me a little of "Intentional Programming" or Structured Editors.

Davidbrcz5y ago

This idea of turning a program into a database and using prolog/dalalog on it is not new.

The most successful example is Semmle (bought by Github), which has been doing it for years now, with a SQL-like syntax for requests (named ".QL").

gavinray5y ago

This is how Rust Analyzers intellisense works, it uses a query engine called Salsa that stores the symbols in a database and the linting and completion/semantics are entirely query-driven:

https://github.com/salsa-rs/salsa

Good video describing it's use for working with Rust's AST:

https://youtu.be/i_IhACacPRY?t=1348

z3t45y ago

I've written an IDE in JS where all internal structures are available in the runtime which can be hooked into. I can however not see the usability in what the article describe. So I would like to se some practical examples. What if you could do queries like in the article, what sort of queries would you run !?

coderdd5y ago

With https://github.com/TreeTide/underhood, my goal is to provide a read-only view of code, geared for understanding and debugging. Having to maintain the ability to edit comes with constraints.

geordimort5y ago

Sorry to disappoint but for people doing programming language semantics there’s nothing new here. The hard part is always to go from the made-up language to a set of multiple full-blown languages.

tom_mellior5y ago

> I want to see how easy it is to add more advanced features to FP, like generics

This will be interesting to see, since the straight-forward implementation of generics for functional programming languages uses unification, which is not available in Datalog. It will probably be possible to encode things for any given program, since its universe of types should be finite. But it will involve jumping through hoops to encode something equivalent to unification.

pgt5y ago

Ask yourself: what is the difference between a database and a programming language? The question is worth exploring.

airocker5y ago

I think the biggest win here could be transactions. One commit can be worked upon by more than one person at once.

LeonB5y ago

The .net dependency and code maintenance tool “NDepend” has a feature called CodeQuery that this reminded me of.

You can write queries such as:

from m in Application.Methods where m.NbLinesOfCode > 30 && m.IsPublic select m

jtwaleson5y ago

I was very confused as I had read the title as '... with Datadog" (the monitoring tool).

Anyway, I think both are a good idea. Datadog -> integrating monitoring information back into the IDE. Datadog -> working with the actual semantics of the code rather than just blobs of text.

j / k navigate · click thread line to collapse

61 comments

53 comments · 24 top-level

YeGoblynQueenne5y ago· 8 in thread

So in short, datalog is a restricted subset of Prolog that has the advantage of being decidable, while Prolog in general is not, but is also incomplete while Prolog is complete [⁷].

Specifically, Horn clauses are clauses with a single positive literal.

The following are Horn clauses:

  ¬P(χ) ∨ ¬P(υ)
  P(χ) ∨ ¬Q(χ)
  Q(α)
  Q(β)

  ←P(χ), P(υ) (a)
  P(χ) ← Q(χ) (b)
  Q(α)←       (c)
  Q(β)←       (d)

And those are a "query", (a), a "rule", (b) and two "facts", (c) and (d).

And that's how Horn clauses and definite clauses become "rules", "facts" and "queries".

Next time: how the leopard got its stripes and the hippopotamus learned to love the first order predicate calculus.

_________________

[¹] This is my (first) slight fudge because constants are also functions, with 0 arguments. So, to be formal, the typical datalog above has "no functions of arity more than 0".

[²] Negation-as-failure makes a language non-monotonic, in the sense that introducing new "facts" can change the meaning of a theory, i.e. a program.

[³] So, its Least Herbrand Model, or its Least Fix-Point (LFP).

[⁴] Because it finds the LFP of the query and the program.

[⁵] Unless evaluated by SLG resolution, a.k.a. tabling, similar to memoization.

  utm(S,S) ← halt(S).
  utm(S,T) ← execute(S,S₁), utm(S₁,T).
  execute(S,T) ← instruction(S,P), P(S,T).

Originally in:

Tärnlund, S.-A. (1977). Horn clause computability. BIT Numerical Mathematics, 17(2), 215–226.

https://www.doc.ic.ac.uk/~mjs/teaching/KnowledgeRep491/Fixpo...

tom_mellior5y ago

Sorry, but you are also fudging things.

> Datalogs (there are many variants!) are subsets of Prolog.

No. First of all, this is not true syntactically. There are Datalogs that allow non-Horn clauses with several terms in a goal head:

    a, b :- c.

This is not allowed in Prolog. So (such) Datalogs are not subsets of Prolog syntactically.

Second, it is not true semantically either, not even for the common syntactic subset. Consider:

    ancestor_of(Parent, Child) :-
        child_of(Child, Parent).
    ancestor_of(Ancestor, Person) :-
        ancestor_of(Ancestor, Parent),
        child_of(Person, Parent).

YeGoblynQueenne5y ago

Yes, you're right and I'm also fudging things- but didn't I say that upfront? I start my comment by announcing an "almost not completely fudgy intro to logic programming"!

More seriously, you're right about syntax so thanks for the correction.

Anyway, thanks for the correction. I sure could have done a better job of that comment. Did you find anything other that was very wrong in my comment? I'd appreciate it if you pointed it out.

tom_mellior5y ago

> Did you find anything other that was very wrong in my comment?

1 more reply

vilterp5y ago

YeGoblynQueenne5y ago

Thank you for your article! Also note tom_mellior's corrections above, particularly about datalogs with two head literals. I'm fudging things up also!

YeGoblynQueenne5y ago

Edit: I did make two or three more changes after adding the whitespace that weren't whitespace, but single characters.

mkl5y ago

(Not dang!) New comments start at the top, so they have a chance to get upvotes, and quickly drop down if they don't get votes relative to other comments. It sounds normal.

YeGoblynQueenne5y ago

Bottom line: large comments with some edits sink like lead.

1 more reply

sriram_malhar5y ago· 6 in thread

Niko Matsakis had an insightful observation about this approach in a similar post (which is referenced by OP):

http://smallcultfollowing.com/babysteps/blog/2017/01/26/lowe...

tannhaeuser5y ago

YeGoblynQueenne5y ago

Well then, please elaborate because I know nothing of the discipline you describe and it certainly sounds interesting (though it also sounds a little fussy).

tannhaeuser5y ago

1 more reply

YeGoblynQueenne5y ago

tom_mellior5y ago

> you'll have to jump through hoops -the hoops of defining your higher-level notion of equality- in what ever language you choose.

    clone(usize).

and have it be interpreted as

    clone(A) :- type_equal(A, usize).

YeGoblynQueenne5y ago

Yes. I just don't see why this is such a big hassle. Particularly because -I'm guessing- whatever the Rust compiler does to decide type equality can't be less hassle-y than that.

But, I might misunderstand the problem since, again, I don't know Rust.

kibwen5y ago· 3 in thread

The Rust project is also using the differential-datalog library mentioned in the OP to underlie their third-generation borrow checker: https://github.com/rust-lang/polonius

mmarx5y ago

> I believe it uses Prolog rather than Datalog (I'd be interested to learn more about how they differ).

namibj5y ago

It seems polonius is using datafrog instead. Where do you see a differential engine referenced?

kibwen5y ago

1 more reply

simplify5y ago· 2 in thread

Fantastic work! Datalog/Prolog are old-yet-futuristic technologies that I wished every developer knew about, as it solves a lot of problems in very elegant ways (as attested to by this post).

tannhaeuser5y ago

There's a proposal due to the prolog-commons initiative for a dictionary-like object (that is also implemented in SWI Prolog, I believe).

And in Prolog you can, of course, just trivially use your own term structure for item-values:

    p([ item : "value", ... ])

You could even use JSON-like terms in Prolog (with maybe a little help by the op/3 directive to sort out parsing priorities):

    A_Prolog_Term = {
      x: y,
      z: [a, b, {1, d} ]
    }

But the more fundamental approach IMHO would be to use nested knowledge-base terms:

    p(
      q(whatever).
      r(X) :- q(X).
    ).

As pioneered in the 1980s for attribute grammars, using ^^ and other special graph tokens as nested "implies" operator, though (see "definite-clause translation grammars").

tom_mellior5y ago

sandermvanvliet5y ago· 2 in thread

sideeffffect5y ago

I'm pretty sure that Roslyn doesn't do F#

mkl5y ago

Roslyn is mentioned in the Related Work section.

disposedtrolley5y ago· 2 in thread

[1] https://github.com/OAI/OpenAPI-Specification/blob/master/ver...

sjg0075y ago

I mean, put them into leucene, elastic search or maybe mongodb?

disposedtrolley5y ago

Well yes, but that doesn't solve the parsing problem right?

habosa5y ago· 2 in thread

(Also hi Pete!)

duncanawoods5y ago

I don’t see what that has to with the IDE. Just grab a parser/ast library for your language.

I’ve written a few of my own refactorings for big design changes that might touch e.g. every view in a huge application in ways no series of regexes could cover.

It’s hard work. Definitely one of those tricky “cost of automation” decisions. You can do an awful lot of dumb grunt work in the time it takes to debug a refactoring.

vilterp5y ago

Oh hey Sam! Been a while! Thanks for reading man :)

brabel5y ago· 1 in thread

Jetbrain IDEs (e.g. IntelliJ) have structural search which is basically a query language for code: https://www.jetbrains.com/help/idea/structural-search-and-re...

I know there's more to just queriability in the post, but I thought this should've been mentioned when discussing what existing IDEs can offer.

agumonkey5y ago

we're not far from coccinelle-ing the codebase

jierenchen5y ago· 1 in thread

This is very cool! It's really awesome to see this code as data concept gaining a lot of traction recently. Hope to see this project developed further.

I'm working on a similar project here: https://sourcescape.io/, but intended for use outside the IDE on larger collections of code (like the codebase of a large company.)

Very excited to see where you go next with this "percolate search" functionality in the IDE.

geordimort5y ago

There’s a lot of research on this area. The fact that Semmle was acquired by Github/Microsoft is a testament to the maturity of the field.

afarrell5y ago· 1 in thread

One challenge to this that languages which easily allow metaprogramming can encourage codebases where it is hard to write tools to gain insight into the structure of a codebase.

You can say “don’t do that”, but I didn’t. Past coworkers did.

geordimort5y ago

This is just a prototype to showcase ideas. In real life even parsing of C++ or Java is much harder to pull off than one anticipated.

nine_k5y ago· 1 in thread

This is really nice. Key points:

- Parse the program like an IDE would, but expose the data in an open queryable database format (both line and unlike a language server).

- Use Datalog for storing the facts and inferring new facts about data.

(A fun fact: the Datalog implementation they use is written in Haskell and generates programs in Rust.)

kibwen5y ago

Here's the link to the Datalog implementation you mention (from VMWare, I wonder what they use it for?): https://github.com/vmware/differential-datalog

pjmlp5y ago

Rejuvenating that many devs are finally looking into this kind of workflows.

For prior work on IDE as database check "Energize/Cadillac & Lucid's Demise", https://www.dreamsongs.com/Cadillac.html

You can see it in action on this 1993 marketing video from Lucid, https://www.youtube.com/watch?v=pQQTScuApWk

IBM also tried the same concept on Visual Age for C++, version 4, which was one of the very last versions of the product.

http://www.edm2.com/index.php/VisualAge_C%2B%2B_4.0_Review

sideeffffect5y ago

This is a similar idea to that of SemanticDB, which is used in the Scala community

https://scalameta.org/docs/semanticdb/guide.html

You might enjoy a talk about it: How We Built Tools That Scale to Millions of Lines of Code by Eugene Burmako

https://www.youtube.com/watch?v=C770WpI_odM

Kythe by Google is also a similar thing: https://kythe.io/

ilaksh5y ago

Reminds me a little of "Intentional Programming" or Structured Editors.

Davidbrcz5y ago

This idea of turning a program into a database and using prolog/dalalog on it is not new.

The most successful example is Semmle (bought by Github), which has been doing it for years now, with a SQL-like syntax for requests (named ".QL").

gavinray5y ago

This is how Rust Analyzers intellisense works, it uses a query engine called Salsa that stores the symbols in a database and the linting and completion/semantics are entirely query-driven:

https://github.com/salsa-rs/salsa

Good video describing it's use for working with Rust's AST:

https://youtu.be/i_IhACacPRY?t=1348

z3t45y ago

coderdd5y ago

With https://github.com/TreeTide/underhood, my goal is to provide a read-only view of code, geared for understanding and debugging. Having to maintain the ability to edit comes with constraints.

geordimort5y ago

Sorry to disappoint but for people doing programming language semantics there’s nothing new here. The hard part is always to go from the made-up language to a set of multiple full-blown languages.

tom_mellior5y ago

> I want to see how easy it is to add more advanced features to FP, like generics

pgt5y ago

Ask yourself: what is the difference between a database and a programming language? The question is worth exploring.

airocker5y ago

I think the biggest win here could be transactions. One commit can be worked upon by more than one person at once.

LeonB5y ago

The .net dependency and code maintenance tool “NDepend” has a feature called CodeQuery that this reminded me of.

You can write queries such as:

from m in Application.Methods where m.NbLinesOfCode > 30 && m.IsPublic select m

jtwaleson5y ago

I was very confused as I had read the title as '... with Datadog" (the monitoring tool).

Anyway, I think both are a good idea. Datadog -> integrating monitoring information back into the IDE. Datadog -> working with the actual semantics of the code rather than just blobs of text.

j / k navigate · click thread line to collapse