To begin with, Datalog is not a "cousin" of Prolog as stated in the section "Interlude: Brief Intro to Datalog". Datalogs (there are many variants!) are subsets of Prolog. For example, a typical datalog is the language of definite clauses with no function symbols [¹] and with no negation as failure [²]. Another datalog may allow only the cons function in order to handle lists; etc.
Otherwise the syntax of datalog is identical to Prolog, but there is a further difference, in that Prolog is evaluated from the "top down" whereas Datalog is evaluated from the "bottom up". What that means is that given a "query" (we'll come to the scare quotes in a moment) Prolog will try to find "rules" whose heads unify with a literal in the query (A literal is an atom, or the negation of an atom; "p(χ,α)" is an atom.) whereas datalog will first generate the set of all ground atoms that are consequences of the program in the context of which the query was made, then determine whether the atoms in the query are in the set of consequences of the program [³]. The reason for the different execution model is that the bottom-up evaluation is guaranteed to terminate [⁴] whereas Prolog's top-down evaluation can "go infinite" [⁵]. There is of course another, more subtle difference: Prolog can "go infinite" because of the Halting problem, from which datalog does not suffer because, unlike Prolog, it does not have Universal Turing Machine expressivity [⁶].
So in short, datalog is a restricted subset of Prolog that has the advantage of being decidable, while Prolog in general is not, but is also incomplete while Prolog is complete [⁷].
Now, the other slight fudge in the article is about "rules", "facts" and "queries". Although this is established and well-heeled logic programming terminology, it fudges the er fact that those three things are the same kind of thing, namely, they are, all three of them, Horn clauses [⁸].
Specifically, Horn clauses are clauses with a single positive literal.
Crash course in FOL: an atom is a predicate symbol followed by a set of terms in parentheses. Terms are variables, functions or constants (constants are functions with 0 arity, i.e. 0 arguments). A literal is an atom, or the negation of an atom. A clause is a disjunction of literals. A clause is Horn when it has at most 1 positive literal. A Horn clause is a definite clause when it has exactly 1 positive literal.
The following are Horn clauses:
¬P(χ) ∨ ¬P(υ)
P(χ) ∨ ¬Q(χ)
Q(α)
Q(β)
In logic programming tradition, we write clauses as implications (because ¬A ∨
B ≡ A → B) and with the direction of the implication arrow reversed to make it
easier to read long implications with multiple premises. So the three clauses
above are written as: ←P(χ), P(υ) (a)
P(χ) ← Q(χ) (b)
Q(α)← (c)
Q(β)← (d)
And those are a "query", (a), a "rule", (b) and two "facts", (c) and (d).Note that (b,c,d) are definite clauses (they have exactly one positive literal, i.e. their head literal, which is what we call the consequent in the implication). Facts have only a positive literal; I like to read the dangling implication symbol as "everything implies ...", but that's a bit idiosyncratic. The bottom line is that definite clauses with no negative literals can be though of as being always true, hence "facts". Queries, i.e. Horn clauses with no positive literals, are the opposite: "nothing implies" their body literals (my idiosyncratic reading) so they are "always false". Queries are also called "goals". Finally, definite clauses with both positive and negative literals can be thought of as "conditionally true".
Prolog and datalog programs are written as sets of definite clauses, i.e. sets of "facts" and "rules". So, when we want to reason about the "facts" and "rules" in the program, we make a "query". Then, the language interpreter, which is a top-down resolution theorem prover [⁹] in the case of Prolog, or bottom-up fixpoint calculation in the case of datalog [¹⁰], determines whether our "query" is true. If the query includes any variables then the interpreter also returns evey set of variable substitutions that make the query true.
In the example above, (a) has two variables, χ and υ and evaluating (a) in the context of (b,c,d) would return a "true" result with the variable substitution {χ/α,υ/β}, i.e. (a) is true iff χ = α and υ = β.
And that's how Horn clauses and definite clauses become "rules", "facts" and "queries".
Next time: how the leopard got its stripes and the hippopotamus learned to love the first order predicate calculus.
_________________
[¹] This is my (first) slight fudge because constants are also functions, with 0 arguments. So, to be formal, the typical datalog above has "no functions of arity more than 0".
[²] Negation-as-failure makes a language non-monotonic, in the sense that introducing new "facts" can change the meaning of a theory, i.e. a program.
[³] So, its Least Herbrand Model, or its Least Fix-Point (LFP).
[⁴] Because it finds the LFP of the query and the program.
[⁵] Unless evaluated by SLG resolution, a.k.a. tabling, similar to memoization.
[⁶] Although higher-order datalogs, that allow for predicate symbols as terms of literals have UTM expressivity, indeed a UTM can be defined in a higher-order datalog fragment where clauses have up to two body literals with at most two arguments:
utm(S,S) ← halt(S).
utm(S,T) ← execute(S,S₁), utm(S₁,T).
execute(S,T) ← instruction(S,P), P(S,T).
Originally in:Tärnlund, S.-A. (1977). Horn clause computability. BIT Numerical Mathematics, 17(2), 215–226.
[⁷] Less fudgy, definite programs are refutation complete under SLD resolution, meaning that any atom that is entailed by a definite program can be derived by SLD resolution. A definite program is a set of definite clauses, explanation of which is coming right up.
[⁸] Long time ago, I explained this to a colleague who remarked that all the nice syntactic elegance in query languages falls apart the moment you try to make a query, which usually has a very different syntax than the actual rows of the tables in the database. So I said "that's the point! Queries are also Horn clauses!" and his immediate remark was "That just blew my mind". It's been so long and I'm so used to the idea that I haven't a clue whether this is really mind blowing. Probably, being my usual excited self, I just said it in a way that it sounded mind blowing (gesticulating widely and jumping up and down enthusiastically, you can picture the scene) so my colleague was just being polite. That was over drinks at the pub after work anyway.
[⁹] Resolution is an inference rule that allows the derivation of new atoms from a set of clauses. In theorem proving it's used to refute a goal clause by deriving the empty clause, □. Since a goal is a set of negated literals, refuting the goal means basically that the negated literals are true. So our query is true in the context of our program.
[¹⁰] Datalog's bottom-up evaluation uses something called a TP operator. It basically does what I said above, starts with the ground atoms in a program and then derives the set of consequences of the clauses in the program. In each iteration, the set of consequences are added to the program and the process is repeated, until no new consequences are derived. As stated above, the process is guaranteed to complete because every datalog definite program has a least fixpoint, which is also its Least Herbrand Model (we won't go into Herbrand Models and Herbrand interpretations, but, roughly, an LHM is the smallest set of atoms that make the union of a definite program and a goal true). A more complete introduction to LHMs and LFPs and how they are used in bottom-up evaluation for datalog can be found here:
https://www.doc.ic.ac.uk/~mjs/teaching/KnowledgeRep491/Fixpo...
> Datalogs (there are many variants!) are subsets of Prolog.
No. First of all, this is not true syntactically. There are Datalogs that allow non-Horn clauses with several terms in a goal head:
a, b :- c.
This is not allowed in Prolog. So (such) Datalogs are not subsets of Prolog syntactically.Second, it is not true semantically either, not even for the common syntactic subset. Consider:
ancestor_of(Parent, Child) :-
child_of(Child, Parent).
ancestor_of(Ancestor, Person) :-
ancestor_of(Ancestor, Parent),
child_of(Person, Parent).
This is left-recursive, so typical queries will not terminate in Prolog, i.e., have no finite solutions. But as you say, Datalogs are decidable and any query terminates, so you will get solutions, which is different semantics from Prolog. So it's not meaningful to say that Datalog is a semantic subset of Prolog.Datalog and Prolog are like C++ and Java: One is an extension of a subset of the other, or equivalently, there is a non-empty common subset with similar-ish semantics. This is not a very useful statement? I agree! But it is what it is. They are different languages.
More seriously, you're right about syntax so thanks for the correction.
But, regarding semantics, the ancestor_of/2 program above can terminate in Prolog, evaluated by SLG resolution, as per my footnote 5. There are still situations where Prolog will not terminate when evaluating a normal program even under SLG resolution, but left recursion is not one of those.
Edit: also, if a Datalog program is also Prolog, and assuming that the program terminates under Prolog, then Prolog and Datalog will both compute its LHM. So it makes sense to say that the two languages are semantically at least very similar and to explain one in terms of the other. They are both much closer than what each is to ASP, for example. It really depends on what assumptions one makes- and that's where the "fudging" comes in.
Anyway, thanks for the correction. I sure could have done a better job of that comment. Did you find anything other that was very wrong in my comment? I'd appreciate it if you pointed it out.
I'm not sure about the statement that "Datalog is evaluated from the "bottom up". [...] datalog will first generate the set of all ground atoms that are consequences of the program in the context of which the query was made, then determine whether the atoms in the query are in the set of consequences of the program". I think it's true that Datalog behaves as if it were evaluated like this, but AFAIK Datalog systems can do lots of very aggressive optimizations that change the actual evaluation. I'm not an expert on Datalog.
Anyway, I was mostly dissatisfied with the general thrust of the comment, trying to establish a "subset" relationship. I think the article's "cousins" comment is fair. Prolog is older and influenced Datalog very much. Prolog is also Turing complete, so it is strictly more powerful. But Datalog's semantics allow some optimizations that wouldn't be possible in Prolog, so it can be a lot faster on appropriate classes of problems.
Edit: I did make two or three more changes after adding the whitespace that weren't whitespace, but single characters.
That's not absolute, because I remember it stated elesewhere by dang that larger comments are considered more likely to be more substantial (for sound reasons). But it seems that holds up to a certain point. There is probably some kind of comment size range, of maximum and minimum comment size, outwith which a comment is pushed down.
There are other factors, also that cause a comment to automatically sink. For example, I have the (admittedly very bad) habbit of editing and re-editing my comments over and over again. I've noticed that if I do this in a sufficiently large comment, the comment immediately sinks when I refresh the thread page after an edit. If the comment is smaller, it can take quite a bit more editing. I think this behaviour is also weighted by user karma or time since account creation because my comments started sinking less often as my karma increased. Or maybe the rules of the site changed in the meantime.
Bottom line: large comments with some edits sink like lead.
http://smallcultfollowing.com/babysteps/blog/2017/01/26/lowe...
The key observation to me was that the traditional Datalog/Prolog way of unifying is through syntactic equality, which is a bit too simple to express the kind of equality needed in Rust and elsewhere. You can express it in Datalog, but as it gets farther away from the source, error-generation suffers.
I was surprised by this comment because Horn logic, although syntactically restricted, has Universal Turing Machine expressivity and it's difficult to see what Rust needs to do that needs more expressive power than a UTM(!).
Reading the post I understand that the author is speaking about unification in particular, not Horn clauses and resolution, where unification is typically used, however there is still a subtle error, which is the identificatin of unification as "syntactic equality". While unification can be used to decide whether two terms are syntacticall equal, in the sense that if they are, they will unify, unification on its own is not syntactic equality.
First of all, say that T₁ = f(χ) and T₂ = f(α) are two terms. Note that χ is a variable and α is a constant. Clearly T₁ ≠ T₂ and that is true for "syntactic" equality, because χ ≠ α.
Now suppose that θ = {χ/α} is a substitution of the variables in {T₁,T₂}. Now, θT₁ = Τ₂. Read θT₁ as "T₁ with θ applied". In other words, T₁ is equal to T₂ _if_ χ is substituted with α. A unification algorithm (normally, Robinson's unification algorithm) is used to establish that two terms are equal given a substitution of their variables. In particular, a substitution of variables that allows two terms to unify is a "unifier". Note that unifiers can be composed, so for example, if T₃ = f(χ,υ), T₄ = f(α,β), φ = {χ/α} and ψ = {υ/β} then φT₃ ≠ T₄ but φψT₃ = T₄.
I noticed in the examples of Rust-in-Prolog given in the link above that the author subtly fudges the syntax of Prolog, for example Bar, which is normally a variable in Prolog, is used as a constant, whereas ?A and ?B which would be not-variables in Prolog (but partially ground terms) are used as variables. I wonder if this is part of a now-divide-by-zero step in the author's reasoning, that I can't figure out exactly because I don't know any Rust. For example, what is "?A" and what is "Bar" in the context of Rust?
In any case, the author is of course correct that unification is not type equality- but then, nothing is except for type equality. As far as I understand these things anyway- I'm not a type nerd. Bottom line, you can't implement "type equality" as any other kind of "equality" and you'll have to jump through hoops -the hoops of defining your higher-level notion of equality- in what ever language you choose.
Right. The author is just disappointed that Prolog does have a built-in "equality" that isn't the equality they want. A nice solution in Prolog would be to write a meta-interpreter or to do term rewriting on the input program. This would allow the author to write
clone(usize).
and have it be interpreted as clone(A) :- type_equal(A, usize).But, I might misunderstand the problem since, again, I don't know Rust.
The Rust project is also using the differential-datalog library mentioned in the OP to underlie their third-generation borrow checker: https://github.com/rust-lang/polonius
Datalog is the positive, function-free fragment of Prolog, i.e., what Prolog offers over Datalog is negation and “value invention” through function terms. One consequence of that is that deciding whether a Prolog program entails a fact is undecidable (i.e., Prolog is Turing-complete), deciding whether a Datalog program entails a fact is can be done in polynomial time with respect to the size of the database.
I noticed you (assuming you're the author) used what looks like a non-traditional Datalog syntax – which makes sense to me as, IMO, Datalog/Prolog desperately need first-class support for a record-like syntax to finally break into mainstream. Is there any prior work to this syntax, or did you just develop it as you needed it?
And in Prolog you can, of course, just trivially use your own term structure for item-values:
p([ item : "value", ... ])
You could even use JSON-like terms in Prolog (with maybe a little help by the op/3 directive to sort out parsing priorities): A_Prolog_Term = {
x: y,
z: [a, b, {1, d} ]
}
But the more fundamental approach IMHO would be to use nested knowledge-base terms: p(
q(whatever).
r(X) :- q(X).
).
As pioneered in the 1980s for attribute grammars, using ^^ and other special graph tokens as nested "implies" operator, though (see "definite-clause translation grammars").I'm working with a lot of OpenAPI [1] specifications currently, some of which span tens of thousands of lines. Heaps of parent, child, sibling type relationships, and objects which are defined-once and referenced in many places. It would be nice if I could perform a search like "select all schemas used in this path" or "select the count of references to this parameter".
[1] https://github.com/OAI/OpenAPI-Specification/blob/master/ver...
My point is that some YAML or JSON documents conform to a specification which can be codified as a set of rules, which combined with the data form facts. I'm asking if there are existing systems to parse these files given I write these rules in a particular syntax, spitting out a queryable representation of the data.
(Also hi Pete!)
I’ve written a few of my own refactorings for big design changes that might touch e.g. every view in a huge application in ways no series of regexes could cover.
It’s hard work. Definitely one of those tricky “cost of automation” decisions. You can do an awful lot of dumb grunt work in the time it takes to debug a refactoring.
I know there's more to just queriability in the post, but I thought this should've been mentioned when discussing what existing IDEs can offer.
I'm working on a similar project here: https://sourcescape.io/, but intended for use outside the IDE on larger collections of code (like the codebase of a large company.)
Agreed on the Prolog/Datalog approach of expressing a query as a collection of facts. CodeQL does the same. From one datastore nerd to another, I actually think this is a relatively unexplored area of querying knowledge graphs (code being a very complex, dense knowledge graph.)
Very excited to see where you go next with this "percolate search" functionality in the IDE.
You can say “don’t do that”, but I didn’t. Past coworkers did.
- Parse the program like an IDE would, but expose the data in an open queryable database format (both line and unlike a language server).
- Use Datalog for storing the facts and inferring new facts about data.
(A fun fact: the Datalog implementation they use is written in Haskell and generates programs in Rust.)
For prior work on IDE as database check "Energize/Cadillac & Lucid's Demise", https://www.dreamsongs.com/Cadillac.html
You can see it in action on this 1993 marketing video from Lucid, https://www.youtube.com/watch?v=pQQTScuApWk
IBM also tried the same concept on Visual Age for C++, version 4, which was one of the very last versions of the product.
http://www.edm2.com/index.php/VisualAge_C%2B%2B_4.0_Review
Both suffered from too heavy hardware requirements for what most companies were willing to pay for, otherwise we could have had Smalltalk like tooling, with incremental compilation and reloading for C++ already during the early 90's.
https://scalameta.org/docs/semanticdb/guide.html
It is a queryable database of semantic information about the program which is generated by the compiler (compiler plugin, to be precise). Once generated, other tools which need semantic information, like linters or language servers, can consume it without having to worry about how to actually generate it.
You might enjoy a talk about it: How We Built Tools That Scale to Millions of Lines of Code by Eugene Burmako
https://www.youtube.com/watch?v=C770WpI_odM
Kythe by Google is also a similar thing: https://kythe.io/
The most successful example is Semmle (bought by Github), which has been doing it for years now, with a SQL-like syntax for requests (named ".QL").
https://github.com/salsa-rs/salsa
Good video describing it's use for working with Rust's AST:
This will be interesting to see, since the straight-forward implementation of generics for functional programming languages uses unification, which is not available in Datalog. It will probably be possible to encode things for any given program, since its universe of types should be finite. But it will involve jumping through hoops to encode something equivalent to unification.
You can write queries such as:
from m in Application.Methods where m.NbLinesOfCode > 30 && m.IsPublic select m
Anyway, I think both are a good idea. Datadog -> integrating monitoring information back into the IDE. Datadog -> working with the actual semantics of the code rather than just blobs of text.