What are your experiences when writing a new programming language, where and how did you start?
I would recommend a lot of books here as well for a solid base and giving you the background needed to write more complex languages including static types, compilers, JITs and VMs. But for me, because I really do not have that much time, that would stretch over too long time and I lose interest, so I stick to things I know I can finish (or rather, get working and achieve something a previous incantation did not achieve; smallest, simplest, fastest etc).
[1] https://gist.github.com/tluyben/16ee2645c4c8aed813005d51488d...
It's available in a variety of languages and has a nice guide (linked) to various steps.
The vast majority of languages are parsed into ASTs of some form, and Lisp is basically directly programming in AST data structures that are executable. Seeing that source code is basically just another data structure to be manipulated and converted was the major "Ah hah!" moment in demystifying compilers for me.
Forth is a great lower-level representation of nested expression evaluation, and sheds a lot of light on how to organize that evaluation in your code generation, converting foo(bar(baz(val),blort(val2))) into a linear stream of operations. Even for register machines, Forth's stack represents the intermediate values that must be retained between expressions in either registers or memory.
Once this basis of how programming languages work is established, then the ideas of new programming language design from the source code perspective can be explored in practical terms.
For example here's a version with classes / OO-behaviour:
https://github.com/haifenghuang/monkey
Or here's my version which "just" adds some more standard-library facilities, allowing you to open/read/write files:
I found the language fun to play with, and I used a virtually identical lexer and parser-idea in a recent project of my own - rather than "parsing" an input-file with regular-expressions, which is how I would have previously approached the implementation of simple DSLs. I guess that showed I'd learned something useful!
Maybe that book is more practical than I realize though.
It helped me write the scripting language that I use in one of my projects.
Of course if you pick a language like forth (mentioned below) the whole problem is sort of moot.
That said, everyone who learns to program usually learns more than one language. For me, it was after about the fifth language that the syntax of a particular language fell away and what remained was the algorithm that was being expressed. It was this ability to see the algorithm distinct from the syntactic construction of the language that lets me see where the language is "helping" and where it is "hurting" in expressing the algorithm.
The advice then for starting points for developing new languages would be to take some algorithms you know well and implement it in as many languages as you can find. Ideally you'll have at least one imperative language (BASIC, C, Fortran, COBOL all qualify), one functional language (haskell/APL), one lambda language (lispish), and one semantic language (Ada, Mesa, Modula) Typically sorting, string handling, and asynchronous execution (or schedulers) all make for good algorithms to try. See what is "easy" and what is "hard" to implement based on the features the language brings to the table.
Then optimize for 'easy'.
That will gives you verbs and structure that support your need, once you have those then giving them names and a syntactic structure is fairly straight forward.
Write a simple interpreter for a C-like language by parsing it into a syntax tree, constructing a stack of symbol tables, and evaluating the nodes.
Once that works, write a version that emits code for a stack-based virtual machine (a simple stack machine written concisely might eat just a few screenfuls of code).
When that works, symbolically evaluate the stack bytecode into variable references, pick a simple architecture (I like MSP430 for this but ARM will do) do a trivial register allocator, implement register spill to the stack, and emit real code.
Then read Lisp In Small Pieces.
Worked OK for me!
A surprising amount of the complexity you might perceive in a "real" programming language just comes down to how expressions are parsed.
After that, it's a question of where you want to go with your language, and what distinguishing features you'd like it to have relative to other languages. Languages which just change some syntax and keywords but otherwise look like C are common, but they tend not to get usage beyond their creators. Much of the academic research lately is about type systems (TAPL has already been mentioned) - there's research into dependent types, effect typing, total functional programming, linear types already made it into Rust's borrow checker, etc.
However, many of the worst problems in industry - dependencies, versioning, smoothly migrating from a prototype to robust production code, code navigation, security, parallelization & distribution, the memory hierarchy - have gotten relatively little attention in academic research, and are desperately awaiting solutions. If you have some industry experience, it may be worth trying to imagine how you might fix these problems if you were allowed to change the programming language at will.
[1] https://en.wikipedia.org/wiki/Compilers:_Principles,_Techniq...
[2] http://web.mit.edu/alexmv/6.037/sicp.pdf
[3] https://www.amazon.com/Modern-Compiler-Implementation-Andrew...
Considering c++ is one of the most widely-used languages out there, I would question this.
Ohm [0] is a good place to start. It's a parser generator. If you can understand the math example [1], you can probably realize how things get built upwards. Then you can focus on what you want your language to do, how you'll interpret it, what problems it would express, &c.
[0] https://ohmlang.github.io/
[1] https://github.com/harc/ohm/blob/master/examples/math/index....
Also consider reading about OMeta, if Ohm isn't your thing.
Markup languages are not generally thought of as programming languages, but maybe that's an archaic view.
At any rate, 90% of the material in this thread is way overkill if all one wants to do is parse Markdown.
I have found too many language designers have not thought about this clearly enough. When you go to actually use the language, you will suddenly find all sorts of oddities in how you construct the program you are trying to write. Undefined behaviour is a curse, implementation defined behaviour is a curse.
If your language allows you to write something and it passes the compiler, then make sure that it is completely defined so that no matter what there are no subtle surprises.
You pose the question but don't seem to answer it - this area is called semantics, and there are several approaches and notations you can use. Types and Programming Languages by Pierce is the most common starting point.
By the way, how much are you doing on Katadhin these day?
Take a top-down approach, and focus on semantics over syntax.
If you want to design a new language, think of a moderately sized representative example of a problem you have day-to-day, and write a sketch of a program in the magical pseudocode you wish you were able to write. It doesn’t have to be a big problem, or a big language, just a concrete problem domain.
Then, piece by piece, investigate the details of how to make that notation work in reality: how do you parse, typecheck, optimise, compile, and implement a runtime for it? Figure out the smallest set of core primitive operations needed to express the semantics of your language—that’s your core calculus, which you can prove things about or use as an intermediate representation in your compiler.
In addition, a productive strategy for producing an actual implementation, followed in some compiler courses, is to build your compiler (or interpreter) in such a way that you always have a working implementation at every step—of a language that starts small and grows over time.
For example, write a program that takes in a “hello world” or other primitive program written in your language, and just produces the expected output without any analysis—the result of the program, or a binary (machine code, .NET/JVM bytecode, &c.) with the right behaviour. Then, incrementally add features—outputting other messages, doing other things than simple input/output, evaluating complex terms, rejecting invalid programs, and ultimately taking advantage of the semantic features that make your language unique. This style of development helps encourage you to write a test suite of increasingly complex example programs, which are an essential part of a language implementation—you don’t need to follow strict rules like TDD, but you do need to test everything.
Other books burn a lot of time on things like lexing and parsing; this one does not. In the lisp tradition, it assumes s-expressions and gets to the juicy bits of programming language semantics immediately: lexical scope, first class functions, objects, etc.
A few other tips:
* Use a parser generator. Parsing is more or less a solved problem; get it out of the way as fast as possible.
* Writing a language that compiles to another language is very helpful as you can often borrow the underlying language's implementation of a feature. Doubly helpful if you can leverage the underlying language's libraries for things like syscalls.
* Make a todo list (I like Trello for this) full of self-contained tasks. Pick one at a time, think through how you would implement that feature, and go to work. Compilers look like a lot of work in the beginning, but if you take one step at a time, one day you'll look back and have accomplished more than you realize. The list can also be helpful to avoid scope creep; add any new ideas to your backlog and deal with them later.
* Write unit tests, as you'll very likely want to refactor something down the line. Once your compiler becomes complex enough, add functional tests that verify entire programs can compile and run and produce the expected output.
By using Lisp-y language you can ignore the tedious and uninteresting task of parsing for the real meat of compiling and (some) optimizing.
I've been exploring exactly this, compiling with Prolog, this past month or so and I'm simply blown away. It's so simple, elegant, and easy. It will be faster and easier for you to learn the basics of Prolog and then learn how to write compilers using it, than to learn to write compilers in another language (even a language you already know.) Prolog is an extremely simple language despite its incredible power. It's very different from imperative/functional languages so it might take a minute to wrap your head around it, but then you're operating on a higher plane.
Read "Logic Programming and Compiler Writing " by David Warren http://sovietov.com/tmp/warren1980.pdf But beware: it's dated. Or, if you can get a copy of "The Art of Prolog" you can read Chapter 24, which covers the same concepts but has more modern code. Specifically, you're going to want to use Definite Clause Grammars (DCG):
https://en.wikipedia.org/wiki/Definite_clause_grammar http://www.pathwayslms.com/swipltuts/dcg/index.html https://www.metalevel.at/prolog/dcg http://www.amzi.com/manuals/amzi/pro/ref_dcg.htm
Study "Toy compiler in Prolog" http://faculty.cooper.edu/smyth/cs225/ch7/prolog.htm And maybe this: https://github.com/fusiongyro/dailyprogrammer/tree/master/20...
I can vouch for this. I wrote a parser for Oberon in two days. I wrote an assembler for the RISC CPU used in Project Oberon in another two days. Both of them are no longer than two pages and the code is essentially just grammar rules. I haven't (yet) written the bit in-between that translates the AST from the parser into assembly code for the chip, but only because I have other fish to fry. But I can see how to do it and it would be very easy. The compiler is described in the Project Oberon book if you want to attempt that language.
I'm implementing a compiler for Joy on top of the assembler and it's been like a dream. Because the code is so simple, and the editor I'm using has an extension that uses the Prolog interpreter to lint as I go, it's actually hard to introduce bugs. I write the asm code then abstract patterns into the compiler to refactor and (lightly) abstract the raw asm into what are effectively AST nodes. But there is no language above this. I think I have to write a blog post called "Compiler for No Language" or something. Eventually, I am going to abstract my way to "atomic" nodes that represent Joy functions and then I'll have a Joy compiler. But in the meantime, I'm floating around in a semantic sea between parsing and assembly, and it's warm and tropical and you would not believe the fish I've seen.
I have to conclude that anyone who is writing any sort of compiler and NOT using Prolog is working way too hard.
I've been programming in Python for twenty years, happy as a clam, and now you couldn't pay me enough to NOT use Prolog.
"These are the words of VAL. They are true. Hear them."
Another approach is to look at SICP, it is Scheme oriented and more oriented towards macros. See: https://mitpress.mit.edu/sites/default/files/sicp/index.html
SQL is a deceptively complex language, the optimisations that it performs are very complex. Markdown, on the other hand, is more of a transpiler to HTML and is not Turing Complete.
There's also a followup article implementing additional data types and tail call optimization. http://norvig.com/lispy2.html
Now add error handling (i.e., unrecognized operators, or stack underflow condition) -- figure out what you want to do, such as crash the interpreter with an error message, or return an error and let the user continue.
Now, read up on the Shunting Yard algorithm. This lets you take a regular (infix) math expression, and convert it to RPN (postfix) notation, which you can then feed directly into the RPN interpreter ("stack machine") that you just wrote.
Now look at handling parentheses, which should be covered in the same explanation of the Shunting Yard algorithm.
Next, add in variables in the original stack language, and figure out what minimal changes to the second piece (the algebraic expression parser) you need to do to get it to handle variables.
Finally, add in functions, and you have a simple language. I followed this exact formula when putting together a toy language many years ago, after which I learned how much I didn't know that I didn't know about language design. But I know enough now for a second shot.
Oh, and some other resources to study that may help -- Look for the reference books for the Postscript language. The "Blue" book teaches you Postscript, and I believe it is the "Red" book that tells you enough of the internals for you to write your own Postscript interpreter (or to at least get inspiration on creating you own version of a Stack machine interpreter).
https://www.elsevier.com/books/programming-language-pragmati...
If you like excellent free university lectures online, this is great:
https://www.youtube.com/watch?v=3N__tvmZrzc
In reality - programming languages are most likely what you think they are, if you simply start with the basics like what's a variable, what's a function. How do I call functions inside other functions. Do I want variables to have types. What kind of types?
If you start with the very very basics - global variables, global functions, you can't call functions within other functions - and implement that, you'll get the feel for it, which is most likely as far as you'll want to go, because it gets very tedious, very quickly, as soon as you want something as 'simple' as proper unicode support, garbage collection, generics, inheritance, foreign-function-interface and the list goes on and on.
To get your feet wet regarding what it's all about - those video lectures above are top notch.
[1] https://ferret-lang.org/ [2] https://nakkaya.com/2011/06/29/ferret-an-experimental-clojur... [3] https://nakkaya.com/2010/08/24/a-micro-manual-for-lisp-imple...
Then continue enhancing that tiny language, add more features.
Start with an interpreter, they are easier to grasp. You can always add a compiler later.
I grabbed Scheme R5RS standard and started implementing it in Kotlin: https://github.com/kovrik/scheme-in-kotlin
Then added more Clojure-like and Racket-like features, plus some Kotlin features.
Also I would recommend staying away from lexer and parser generators and just use hand built recursive descent parser and some set of ad-hoc lexical analysis subroutines.
This book taught me how to write a compiler and build a programming language of my own.
To answer you question "where and how did you start?": This is where I started learning about compilers and programming language implementation.
Here is its description from its website:
* Comprehensive treatment of compiler construction.
* JavaCC and Yacc coverage optional.
* Entire book is Java oriented.
* Powerful software package available to students that tests and evaluates their compilers.
* Fully defines many projects so students can learn how to put the theory into practice.
* Includes supplements on theory so that the book can be used in a course that combines compiler construction with formal languages, automata theory, and computability theory.
What I promise you with this book: You'll learn how to write your own programing language. Not only how compilers and about the language but also about computer architecture. This book is very easy to read and understand. It starts with very basic topics then slowly building from it until you'll grok implementing the language given the specification. You'll have the chance to build a compiler from scratch or by using JavaCC and Yacc.
If you're happy to consider Markdown a language (which it is, just not a programming language), you could also consider a (subset of) XML parser, or JSON parser. It's a nice place to start, because it's very small scope, a subset of PL (parsing), and the completed thing is actually usable and useful.
Plus, it gets you a step on the path to a programming language - arguably, the fundamental step.
I recommend a recursive descent parser, and this dragon compiler book (https://en.m.wikipedia.org/wiki/Principles_of_Compiler_Desig... Your local uni library has it.). An early chapter gives a complete calculator (arithmetic parser), in very simple and clear C code.
NB the most recent dragon book uses horrific over-complex architecture-astronaut, design-patterns OO java code. OOP has its place, but not like this. Not like this.
write a forth or lisp lexer, parser and interpreter.
then write a small language that compiles to a target, like js, make sure your language is close in semantics to the target language to avoid headaches.
Yes, I'm the author! There are other, more in-depth, compiler books on my homepage: http://t3x.org
The home page also contains the source code to many compilers and interpreters at different levels of complexity, from trivial to real-world.
Ever since I started programming, the idea of writing my own language was a sort of "white whale". I have one idea for a language that I think could be usable in a very particular domain. For that, I wrote a very primitive tree-walking interpreter (terrible performance, very technically uninteresting). The idea for the language was what I wanted to explore. I recently started writing a regex->jvm bytecode compiler. That's much lower level, and if I stick with it, performance will be what I optimize for. In neither case did I do anything interesting with parsers, though I used to love learning about new parsing techniques.
In that vein, see dfox's suggestion to "just use hand built recursive descent parser and some set of ad-hoc lexical analysis subroutines". That's the fastest way to get started, whether or not it'll be the best in the long term.
So, my advice is to explore and figure out which things you're interested in, and what you want to accomplish. That will tell you where you want to invest your time.
Another option is to build a VM, that accepts as input byte in a text (not binary) form. Less efficient, but easier to implement, and you can write your bytecode by hand instead of also building a compiler to do so. Like object-oriented assembly language, if you will. If you're interested, I'll scan and share my dissertation report with you.
You can also implement a domain-specific language. Did you, in the past, feel that a task you were doing would be simpler if there was a custom language for it? If so, maybe you can try building that DSL and then using that to make your main task easier.
You can begin with writing a tiny lexer and a recursive descent parser that simply writes out the value of an expression that the user typed in. Then you can make it generate assembly for computing the expression instead (hardcoding some further assembly to print the result out at the end of the execution of the program), turning it into either a traditional compiler or a JIT compiler. Once you do that, it should not be too hard to see how to add variables and simple loops. This will already demystify things quite a bit and make further learning easier.
I specifically recommend implementing your own parser and generating assembly on your own, at least if your primary goal is learning. You could instead use an existing parser generator and e.g. a library for generating JVM bytecode, but those ready-made components are high level enough to turn it all into the usual exercise in plumbing libraries together, rather than a learning experience across many layers of the computing stack.
1. Cute hacks like using the eval function in Python and rewriting the string before passing it in. This is basically metaprogramming by macro-processing and a good stepping stone to thinking about what a programming language really accomplishes.
2. Writing various Forth-like and Lisp-like languages. This let me explore a working system without spending too much time on parsing - I could try interpreting the AST directly or compiling it to an intermediate language.
3. Gradually learned that a lot of the features and technologies in general purpose programming languages aren't things I want to care about(how to implement arithmetic, perform variable assignment, perform typechecking etc.). Focused on data languages like JSON and small supplemental languages for e.g. scheduling concurrency.
4. Found a useful model for prototyping semantics before adding a syntax, by envisioning the whole language in the form of a state machine that builds expressions one API call at a time, then compiles/executes the result with a "commit" call or similar. This is a practical technique for many domain specific problems, and makes the necessary syntax emerge naturally.
A lot of what goes into making a useful production language isn't really on the theoretical side, but on finish-and-polish stuff: fast builds, helpful docs and error messages, well-rounded libraries, debugging tools, ease of deployment and so on. Learning this through the process of writing small PLs helped change how I viewed language selection to be more like picking any other application, vs a point of religion: choosing something with approximately the right features and tooling so that the groundwork is easy, and then magnifying that leverage by adding a bit of custom syntax, static checking, code generation, what-have-you on top. Turning the tech into an airtight abstraction generalized across a whole language is a ton of work, but getting 80% of it in for the specific app takes a fraction of the time in comparison.
-1. You need to actively set aside time each day for this task. -2. Leave time for this to last a long time, two months, minimum. -3. Plan every step out. -4. Repeat, repeat, REPEAT!! As in, repeat your training tasks. People have to repeat something seven times to burn it into permanent memory, and you need to remind yourself of it at least once a month for a year, then once every six months after this.
If you do those four things, which are surely no easy task, you should be an excellent programmer in no time. Good luck :)
I've captured the entire process on video: https://www.youtube.com/watch?v=0_PX2Ue6F_A&list=PLWMUVtnFsZ...
The YouTube playlist is an archive of my daily streams on Twitch.
When I was learning, my only resources were books, google, and PHP-src and CPython.
The idea was then to have a language that maintains the elegance and clarity, and a compiler that generates in my case C code. This let me focus on the "core" of a compiler, without too much emphasis on parsing, tons of optimizations, corner cases.
So this would be my advice. Pick a domain, maybe focus on getting highly optimized code, design a small language and build your first compiler.
Specifically Chapter 5 - Language Structures and Compilers
The next step is to actually parse code with an abstract syntax tree. The easy path here is actually javascript. There are many transpilers for JS so things like bluebird/harmony can work.
Next step is to do thing "the right way"(tm) with yak/bison and build a compilable language. For that, I suggest you read a few books on languages and understand concepts such as backus-naur form before continuing.
Another approach is to look at rpy - the restricted python subset used for pypy, see for example "other languages" on:
I started out just working on a set of linked/cooperating libraries and then decided to offer both the libraries and a custom language. I am writing something for my own use but when I release the MVL, I obviously also hope to get users and pull requests.
But it's much better to make languages explicit. And there's no need to invent new grammars and parsers.. keep those for real extreme cases.
i express my languages in python, its easiest to use as base and build semantics on top (and even syntax). One can just use function/class-based declarations, of domain-specific notions, and then interpret them any way one wants. Or one can even turn function-calls into "statements", and "with"-construct into "blocks-of-statements", etc. Gives you strict debuggable proto-syntax, on which to build your own "syntax"/semantics/error-checking/reporting.
but any way, try to see/feel the Language behind the syntax/implementation. have fun
- kernel language, which is not a practical language, but contains all the concepts for building one of a given model
- growing said kernel language one concept at a time, when its expressiveness is stretched
- the case for multi paradigm languages, with a real example throughout the book (Oz)
- and I mean multiparadigm (although the author dislikes this word): first declarative, then with concurrency, then with ports, then with explicit state, then object oriented with and without concurrency (which in fact isn't really an extension of the kernel language), then relational ... all being simple but very careful extensions to that kernel language
- the case for each concept, which results in an extension to the language, is very carefully explained and many alternatives are considered, and much justification is given to why it would improve the language; if you are considering designing your own language, you could do worse than learning from the author's carefully described approach
- each concept results in an addition to the grammar, the semantics of which are described in detail; they are also described formally in an appendix
- whenever a new programming model is described (which is basically a set of concepts), an archetypal example is discussed: Haskell (concurrent declarative), Erlang (message passing), Java (concurrent OO), Prolog (relational)
- techniques for programming with each concept are introduced
- each chapter has exercises at the end, several of which consist of doing your own extensions to the language; some of these are research level
- as an example of how deep but accessible this book is, check the "Principal programming paradigms" poster [pdf] [2]
- the extensive bibliography at the end is fantastic
And since I believe that it hasn't been mentioned in the comments and in case you don't know yet, check out "Lambda the Ultimate", "the programming languages weblog" [3]
[1] https://www.info.ucl.ac.be/~pvr/book.html
[2] https://www.info.ucl.ac.be/~pvr/paradigmsDIAGRAMeng108.pdf
A lot of the resources out there are going to assume you have one or two sides of this triangle already and are ready to bring it all together. If you don't then you'll probably feel like you're in over your head.
I would recommend checking out "Let's Build a Compiler". It's a bit old-fashioned but at least things are pretty simple. https://compilers.iecc.com/crenshaw/
Different flavors of Markdown are not a bad place to start. Try and find something with a real syntax. As I recall, the original/official version is defined by a Perl program and has no true grammar. But other variants probably do.
SQL is not simple. Just the parser for PostgreSQL's dialect was around 1500 lines of code, last I checked. Although, if you did a small subset, maybe it wouldn't be so bad.
I really like the book Semantics with Applications which discusses the meaning of different programming languages. When I was starting out, it seemed to me like there was a lot riding on the syntactic differences between languages like C and Pascal. Now having seen the fringe, a lot of those differences are not as significant to me as they were before. This book helps you drill into what makes some languages different below the surface, and how to model those differences formally.
I think it's eminently possible to make a small language, like "While", in a few pages of code. I would start with a very simple unambiguous grammar, and a little interpreter for it, but if you really want the full experience, generate assembler. This paper sort of presents the language I'm talking about, but there is a lot here about type theory and semantics that may not be interesting to you, so just ignore that. https://arxiv.org/pdf/1603.08949.pdf When I took compilers ~15 years ago, we generated code for nasm because gas syntax was too weird for us. Anyway.
Another approach that looks interesting to me is "many small steps": http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf The idea here is to start with a program that only knows about integers, but handle everything end-to-end. Then you slowly add features one-by-one, but in an end-to-end fashion. I think this would be a lot easier to iterate and it hearkens back to Crenshaw's approach.
A lot of compiler construction material focuses on Lisp and Scheme. This is because it sort of hides or ignores the parsing problem. You will probably have to learn some other languages to build a compiler; it wouldn't hurt to know some Lisp or Scheme to be able to access these resources. A lot of people build a Lisp-like language as their first for this reason too. Just something I'm pointing out.
Anyway, it's a noble goal and best of luck!
If you have a specific problem, you can then write a small DSL for it. I once had the opportunity to write a small "test language" for the sub-protocols of H.323. These protocols are specified using SDL (https://en.wikipedia.org/wiki/Specification_and_Description_...) and i just converted the tokens in the spec directly into my language and put a simple markdown syntax around it. The tests simply exercise the protocol FSM.
Eg:
<H245>
<Test1>
<Prestimuli>
Var1=Val1 //these are assignments
Var2=Val2
</Prestimuli>
<Stimuli>
MsgFromNetwork=<msg name> //trigger to FSM
</Stimuli>
<Poststimuli>
Var1=Val1.1 //these are assertions
Var2=Val2.1
</Poststimuli>
</Test1>
</H245>
Though simple, this worked nicely, allowing me to push off the actual writing of the test cases to non-programmers who just had to convert the SDL diagrams from the spec. to text as given above.