6.13 Create your own language with a lot of transpiling: in LispE - naver/lispe GitHub Wiki

Creating a new programming language

Version française

You're tired of the increasingly abstruse syntaxes that new languages impose on you. You dream of a pure, simple language that speaks to your inner self, the one that likes to meditate under an olive tree, eyes fixed on the distant horizon... Uh... Well...

There is an answer to that... You can create your own language ...

Yes, you can... And then you look at the literature and you say to yourself: "maybinotfinely".

Because it's full of things that nobody ever uses, concepts that are scary just to say out loud. The tree of abstract syntax, a BNF grammar .

Frustrated, you give up on the idea and come back to this Python code that crashes because the toto variable that originally contained a list now contains an integer.

Don't give up. The wonderful thing about computer science is that the things you dream about, you can often implement...

No... I'm kidding... Let's be honest, most simple solutions are often slammed into the ground.

Except, the one I'm going to present here.

LispE

We'll do everything in LispE. LispE is a rather original Lisp that is hosted on the GitHub of Naver, a company that offers in Korea all the services on the Internet, from search engine to online shopping, that you can dream of.

The main feature of LispE is that it relies on the use of arrays rather than linked lists, as is traditionally the case. This has many advantages, especially in terms of access to objects within a list via its index. Note in passing that linked lists are also part of LispE but they do not have a central role that they generally have in other implementations.

The article you are reading is actually hosted in the Wiki of the GitHub in question.

Important: You will find here the pre-compiled versions for Windows and Mac (including M1 and Intel).

For Linux, I advise you to read the page: compiling on Linux

Let's create our own language.

Our language

If you have arrived here, it is because you still want to know.

Know Oh! reader! that I have designed a wonderful tool that will make you feel a lot of admiration.

Because LispE contains all the necessary tools to make a powerful and simple language.

Here is an example of what it looks like:

function fact(a)
   if a <> 1 then
      a * fact(a-1)
   then
      1
   endif
endfunction

a = fact(10)
println a

No unattractive ";", no untimely "(...)". Clean, simple and powerful code.

Of course, you could create another syntax, but the idea is to create your own first language, and it might as well be simple, because otherwise it could be a bit complicated. Obviously.

Playground

For a full example of the language, see: Example

It contains all the instructions to execute the example in Basic saved in the code variable.

You can run it directly in LispE.

BNF grammar

The role of a grammar is to produce a description of our new language in a mathematical form. But a grammar is also a language with its own conventions, which we will now describe in detail.

For you must know that computer science is just that: conventions, some of which were imagined by brilliant minds levitating above Saturn.

Description of the grammar

Oh! reader, look at this instruction: A = 10.

We'll write a rule to capture the sacred essence of it.

assignment := variable %= number

We assume that variable is used to recognize a variable name and number is used to recognize a... number. The % is used to escape the = operator.

Except that our variables can be assigned with something else than numbers, right? Yes, of course, in this case we will separate the possible values by a ^, also called disjunction operator.

assignment := variable %= [calculus^variable^number^string]

The disjunction is enclosed in two [..] to indicate that it applies to these four elements.

number, string and variable will be defined outside the grammar when we transpile the whole thing... Because, I tell you, we are going to transpile a lot.

But calculus, eh!!!

We'll write another rule, two in fact:

operator := [%< %<]^[%> %>]^[%^ %^]^[%* %*]^%&^%|^%+^%-^%*^%/^%%^%^
calculus := [number^string] [operator [number^string]]+

Transpiling is the process of transforming code written in language X into code written in language Y. For example, a transpiler can take a Java code and transform it into an equivalent C++ code. Although in this case, it is more of a cryptographic issue.

In this case, we're going to transform our grammar into a LispE program. And it's going to be pretty simple and straightforward. In fact, each rule name will become the name of a function. The disjunctions will be compiled as or and the conjunctions as and.

Let's go back to the previous example:

assignment := variable %= [calculus^variable^number^string]
operator := [%< %<]^[%> %>]^[%^ %^]^[%* %*]^%&^%|^%+^%-^%*^%/^%%^%^
calculus := [number^string] [operator [number^string]]+

The assignment rule will become a LispE function. It is composed of a conjunction and a disjunction:

(defun C_assignment (tokens i0 v)
   (check (< (car i0) (size tokens))
      (setq v0 ())
      (if (and
            (setq i1 (clone i0))
            (setq v1 ())
            (C_variable tokens i1 v1)
            (compare tokens "=" i1 v1 nil)
            (or
               (C_calculus tokens i1 v1)
               (C_variable tokens i1 v1)
               (C_number of tokens i1 v1)
               (C_string tokens i1 v1)
            )
            (set@ i0 0 (car i1))
            (setq v0 v1)
         )
         (push v (cons 'assignment v0)); remember this line carefully
      )
   )
)

We put C_ in front of the name to avoid confusion with real LispE instructions. As we can see in this example, C_variable, C_compare, C_number and C_string can be perfectly defined outside the grammar. In this way, we can modulate the operation of our parser in the finest possible way.

Tokens

The above method expects a list of tokens. And to produce tokens, we perform a tokenization. Tokenization consists in cutting a string in significant segments that we put in a list.

"A = 10" --> ("A" "=" "10")

And in fact, this operation, LispE gives it to you for free: tokenize_rules.

Tokenization with this instruction is done with a set of rules which you can modify.

; We get the default ruleset
(setq tok (tokenizer_rules))

;And now we can tokenize
(setq tokens (tokenize_rules tok "A=10")) ; ("A" "=" "10")

And now that we have our tokens, we can easily apply our rules on them:

; We get the default set of rules
(setq tok (tokenizer_rules))

; And now we can tokenize
(setq tokens (tokenize_rules tok "A=10")) ; ("A" "=" "10")

(setq v ())
(C_assignment tokens '(0) v)

And this is where the miracle happens. v now contains: (assignment (variable "A") (number 10)).

For what is produced here is a tree: The Abstract Syntax Tree.

Let's go further, and show those skeptics how powerful this solution is.

But this time we'll use our real example, the one whose grammar is here: Basic.

We'll try to parse: if a < 10 then a = a + 1 endif:

(setq tokens (tokenize_rules tok "if a < 10 then a = a + 1 endif"))

; tokens: ("if" "a" "<" "10" "then" "a" "=" "a" "+" "1" "endif")

(setq v ())
(C_analysis tokens '(0) v)

; v now contains: 
(if 
   (comparison
    (variable "a")
    (comparator "<")
    (anumber 10)
   )
   (then
     (declaration
         (variable "a")
         (computing
           (variable "a")
           (operator "-")
           (anumber 1)
         )
     )
   )
)

Let's note something cool right away. The number of items in sublists is defined by the grammar. This sounds obvious, but it is absolutely fundamental. Because the next step is to translate this tree into LispE code.

First level of transpiling

By the way, the program that transpiles the grammar is located here: compiler.

It takes the grammar as input and generates the basic.lisp file which contains the transpiled code.

Roughly speaking, the grammar is tokenized line by line, and for each of these lines an equivalent LispE function is generated.

Some functions like: compare, C_anumber, C_Word, C_astring are predefined outside the grammar. They are used to detect certain types of token like a number or a string for example.

Patterns

We make extensive use of the notion of pattern programming to be able to perform this transformation.

Indeed, we move token by token in a list and we look for some particular cases, like the presence of the disjunction operator. By default, the rule is transformed into a (and..) structure and the disjunctions into (or..) structures.

Instructions in conditions

The nice thing about LispE is that you can perfectly initialize a variable in an and or in an or. This makes code generation much easier.

The variables i0 and v are for the first one the index of the current token in tokens and the other one the reception variable of the already recognized subtrees. In the case of a and structure, we create local variables to store the different subtrees. When we reach the end of the and, we then copy these subtrees into the main variable, otherwise we exit without changing anything. In this way, we can test several analysis paths without any issue.

Note also that the use of square brackets [..] results in an additional (and...) structure, the purpose of which is to guarantee that the sequence of elements will be correctly recognized (see the code of: comparator for a nice example. There is even the original rule in the file).

Finally, note that when a rule contains one of the Kleene ?+* operators, we turn that part of the rule into a particular function starting with O_, P_ or S_ respectively, O for optional, P for plus and S for star.

Obviously, this compilation does not need to be repeated every time you want to compile a program to Basic. Once basic.lisp is produced, you can use it as many times as you like.

A Secret of the Universe Unveiled

Otherwise, by the way, this is what the Abstract Syntax Tree looks like for our function: fact, when run through the mill of our grammar in Lispian form.

(function
   "fact"
   (variables (variable "a"))
   (if (comparison
         (variable "a")
         (comparator "<" ">")
         (anumber 1)
      )
      (then
         (computing
            (variable "a")
            (operator "*")
            (call
               "fact"
               (computing
                  (variable "a")
                  (operator "-")
                  (anumber 1)
               )
            )
         )
      )
      (else (anumber 1))
   )
)

As you can see from the result, this tree looks a lot like a Lisp program. There are parentheses all over the place, and there is even a whiff of prefixed calls.

You're selling us Lisp, we get it, I've got some really important stuff to finish. I'm out of here.

And then, with a distant look and a heart full of pity, I stop you and say: "My sister or my brother, don't be angry. If the AST looks like Lisp, it is because Lisp expresses its code directly in the form of an Abstract Syntax Tree. For in the beginning of the code was the AST and Lisp is its purest expression."

Ode to Lisp

There you go Oh! reader! I have just revealed to you the best kept secret of computer science. Lisp is the simplest language to compile, because in Lisp everything is an AST. There is also Forth which is in the same case, but it does it backwards.

In other words, where other languages suffer from a serious lack of vegetation, LispE has been vegan since the beginning. Where other languages have to go through some crazy code to produce the AST needed for further programming, in Lisp, you directly write that tree. With Lisp, YOU ARE THE COMPILER.

Parentheses are the ultimate embodiment of code purity. They free the code from its binary gangue to expose its tree-like structure.

Transpiling, the return

Yes, we have now produced an AST: the Abstract Syntax Tree.

But this is not enough, we must now produce the corresponding Lisp code.

Roughly speaking, where we have: A=10+20, we would like to produce: (setq A (+ 10 20)).

And where we have:

function fact(a)
   if a <> 1 then
      a * fact(a-1)
   then
      1
   endif
endfunction

It would be nice if we had:

(defun fact(a)
  (if (neq a 1)
      (* a (fact (- a 1)))
      1
   )
)

Because this code, LispE knows how to execute it.

Pattern programming

As much as I exaggerated a little earlier on the importance of pattern programming, this is really fundamental.

You've seen the look of the AST above. I put some nice indentations in what is a list anyway.

And this list contains a lot of keywords that we're going to use. Keywords that are systematically in position 0 of our sub-lists.

Do you follow?

Well, I'll show you on a piece of tree:

(computing
   (variable "a")
   (operator "-")
   (anumber 1)
)

Do you see the keywords: computing, variable, operator and anumber?

They are always in the first position, because of course that's how they were constructed by basic.lisp.

Remember the line in the C_assignment code:

(push v (cons 'assignment v0)); remember this line carefully

v0 is the subtree produced by C_assignment, and we put it in our final structure by pushing assignment on top of it.

And all the code in basic.lisp is built like this.

Keeping the tree slim

Also, each element of the rule should produce one and only one element in v0, at the position that corresponds to it.

assignment := variable %= [calculus^variable^number^string]

In the above case, there should be one element for the initial variable, one for the = sign, and one possible element among calculus, variable, number or string.

However, if you have a look on the actual output, the final tree does not contain all these elements.

(assignment
    (variable "A")
    (anumber 10)
)

Don't worry, there are flags in the grammar to avoid overloading the tree with useless information. For example, the = sign is totally redundant, and therefore useless as it has been used to identify an assignment. Obviously, for the rest of the analysis, we can do without it.

This avoids ending up with a bloated tree, with information that will have to be analyzed during the transpilation, the return.

# The ! in front of assignment indicates that operators and strings can be ignored
!assignment := variable %= number

The pattern functions

So, we now know several things that we will use.

  • It is a tree structure that we will go through recursively
  • The lists start with a keyword
  • The rules have been compiled in such a way that we know the number of elements expected in advance

Note that for the number of elements, the presence of the Kleene operators can sometimes make things a little more complicated.

Tailor Made

First of all, the special transpiler that we will write depends closely on the grammar that we have written.

As much as the compiler can take any grammar as input to generate the right program, the transpiler depends on the particular keywords in the grammar to generate the right code.

If you thought this was going to be simple, sorry.

For Basic, we have already written the corresponding program: transpiler.

Pattern Headers

You can see right away that there is a function: parsing defined with the keyword defpat for pattern definition.

Pattern programming consists in proposing several possible configurations for a given list. We then let the interpreter decide which function to choose.

Here are for example the different possibilities in our transpiler:

(defpat parsing ( ['assignment $ d] )...)
(defpat parsing ( ['indexes $ d] )...)
(defpat parsing ( ['dim n $ d] )...)
(defpat parsing ( ['dimstring n $ d] )...)
(defpat parsing ( ['dimvariablestring n $ d] )...)
(defpat parsing ( ['dimvariable n $ d] )...)
(defpat parsing ( ['variable d] )...)
(defpat parsing ( ['stringvariable $ d] )...)
(defpat parsing ( ['minus d] )...)
(defpat parsing ( ['anumber d] )...)
(defpat parsing ( ['string d] )...)
(defpat parsing ( ['comparator $ d] )...)
(defpat parsing ( ['method n m $ arguments] )...)
(defpat parsing ( ['call n $ arguments] )...)
(defpat parsing ( ['function name parameters $ code] )...)
(defpat parsing ( ['forin init rg $ code] )...)
(defpat parsing ( ['for init test inc $ code] )...)
(defpat parsing ( ['while test $ code] )...)
(defpat parsing ( ['then $ code])...)
(defpat parsing ( ['else $ code])...)
(defpat parsing ( ['if test then $ else] )...)
(defpat parsing ( ['computing $ rest] )...)
(defpat parsing ( ['comparison a1 op a2 $ d] )...)
(defpat parsing ( ['multiop x $ arguments] )...)
(defpat parsing ( ['parenthetic $ code] )...)

; Skip a label
(defpat parsing ( [ [atom_ x] $ d] )...)

; Fallback functions
(defpat parsing ( [ x $ d] )...)
(defpat parsing ( x )...)
(defpat parsing () )...)

Obviously we have only given here the headers of the functions.

Some details in passing, important for the understanding of the system.

  • Each parameter describes a list composed of a certain number of elements, whose value can be hardcopied.
  • The $ is used here as an end-of-list operator. In other words, after the $, we put the rest of the list in the variable that follows.
  • The first element of each of these lists comes directly from the rule heads of the grammar. As explained above, we have arranged for this label to always occupy this first place.

It should also be noted that each of these patterns is indexed on its first element. Thus, the first element of the list in argument makes it possible to jump directly to this function, which makes the whole process pretty efficient. If several functions are indexed on the same element, they will be tested in the order in which they were declared.

; if we run: 
(parsing '(if ...))

; the "if" will be used to directly jump to
(defpat parsing ( ['if test then $ else] )...))

The last three functions have too general a parameter to allow indexing. They are used as fallback functions if no other function could be applied.

Let's note in passing the rule that allows to skip a label. The idea is that the grammar can build subtrees without necessarily having a real semantic interpretation. For example, when a combination is particularly heavy or redundant, we can create an entry in the grammar to avoid carrying around something too complicated.

For example, in the grammar, we can admire the callitem rule whose usefulness from a semantic point of view is very limited. It's just a big disjunction that would have made a mess of the other rules because of its size.

The execution

We'll now move on to the execution of the code and show how a piece of Basic code can become LispE.

For this, I have already prepared a piece of code that should enlighten you: Example.

The variable code keeps a piece of code in Basic. We will call the function transpile which will transpile this code into LispE.

(defun transpile (code)
   ; Transpiling into LispE
   (setq tree (abstract_tree code))

   ; __root__ is a special function that defines the top block of a LispE program
   (setq code '(__root__))
   (push code '(trace true))

   (ife (in (car tree) "Error")
      (setq code (list 'println (join tree " ")))
      ; Here we loop element by element and we call parsing
      ; We put the result in code
      (loop line (cdr tree)
         (setq c (parsing line))
         (push code c)
      )
   )
   code
)

As you can see, it is rather straightforward. We call the famous function abstract_tree that we created with the compiler.

It will automatically take our code and generate a list that contains the abstract syntax tree.

We then apply the function parsing to each of the tree elements.

Let's take an example:

Suppose we want to parse fact.

We produce the following tree with abstract_tree.

(function
   "fact
   (variables (variable "a"))
   (if (comparison
         (variable "a")
         (comparator "<" ">")
         (anumber 1)
      )
      (then
         (computing
            (variable "a")
            (operator "*")
            (call
               "fact"
               (computing
                  (variable "a")
                  (operator "-")
                  (anumber 1)
               )
            )
         )
      )
      (else (anumber 1))
   )
)

The presence of the keyword function at the top of our list will automatically trigger the following function:

(defpat parsing ( ['function name parameters $ code] )...)

This will allow us to give a value to name, parameters and code:

name: "fact"
parameters: (variables (variable "a"))
code: (if (comparison...))

And now we can start to appreciate the simplicity of this type of programming.

Here is the body of parsing for the keyword: 'function

(defpat parsing ( ['function name parameters $ code] )
   (setq code (maplist 'parsing code false))   
   (nconcn (list 'defun (atom name) (map 'parsing (cdr parameters)) code)
)

We apply parsing on each element of code, which allows us to produce the body of our function. This is equivalent to doing:

(parsing '(if ...))

There is only one element in this case.

But each element in the if will in turn be parsed with a parsing call. In this way, each element, no matter how deep it is in the tree, receives the same type of analysis, ensuring consistency and balance.

Here is a simple version of how to parse if.

I've put the rule above it so that the choice of pattern in the function is clearly visible.

; !if := $If comparison then else? $EndIf
; Since we put a ! in front of the if, all the keywords: 
; IF and ENDIF are skipped in the tree generation.
; No need to worry about them in the pattern.

(defpat parsing ( ['if test then $ else] )
   (setq test (parsing test))
   (setq then (parsing then))
   (if else
       ; after the $, we put the result in a list, hence the call to car for the else
       (list 'if test then (parsing (car else))) 
       (list 'if test then)
   )   
)

An if is composed of two or three sections: test, then and else.

Each of these sections is parsed with a recursive call to parsing.

; Then... We look at the list size to introduce a block
(defpat parsing ( ['then $ code])
   (if (eq 1 (size code))
      (parsing code)
      (consb 'block (maplist 'parsing code false))
   )
)

; Else...
(defpat parsing ( ['else $ code])
   (if (eq 1 (size code))
      (parsing code)
      (consb 'block (maplist 'parsing code false))
   )
)

We then merge the different lists thus obtained to return the final result.

We apply the same thing to the parameters.

We then merge (nconcn) all these lists to generate our function, which we return as the final result.

Special cases

Ok, I admit, I simplified the above functions a bit, but not so much if you look at the real code. For example, I added in the Basic language, the ability to manipulate variables with rather complicated dimensions: dim complicated[5][5][5].

If you take a look at the code, you will see that I implement them as a flat list of 5*5*5 elements (125). In the rest of the code, the content of complicated is accessed via the atshape operator which needs the list '(5 5 5) as an argument.

When you read the code, it doesn't necessarily jump out at you...

In the same way, we can declare a list by executing: Data 10,20,30, "A", "B" EndData.

Just to have a way to create these lists, which is consistent with our new language.

Example of code

To have examples of code that can be written in Basic, I invite you to look at: example, you will have a picture of the whole formalism that we have defined for this language.

And Python...

Let's be honest, the grammar method cannot be applied to indented languages. Some people will think I'm saying this out of pure malice.

In fact, not really, the main reason is that grammars can hardly integrate the notion of variable spaces that we use to distinguish blocks in Python.

However, there is a solution which consists in inserting dummy labels at strategic places. And it happens that I wrote a program LispE that does exactly that: indentation python

It takes a Python program and generates:

def Loss(a, X,Y):
     s = []
     for x in X:
         s.append(sum([w * e for (w,e) in zip(a,x)])
     end#
     return sum([(yy - e)**2 for (yy,e) in zip(Y,s)])
end#

It then adds an end# to indicate where a block ends. It then becomes possible to write a grammar to transpile Python code into LispE.

I'll let you write the grammar, if you like.

Conclusion

This article has a very simple goal, to show the steps needed to create a language. We chose LispE as a reference because this family of languages (the Lisp family) is particularly adapted to this kind of work. Actually, many languages in the past, especially object languages, started their career as a Lisp implementation.

The nice thing about this method is that you just have to modify the grammar a little bit, and quickly you can make your own language. For example, I created a French version of basic called basicois. You just have to apply the compiler on it and you have a French language parser. (In fact, it is already available: basicois_example.lisp).

And then fact becomes:

fonction fact(a)
   si a <> 1 alors
      a * fact(a-1)
   sinon
      1
   finsi
finfonction

Of course, you can make your own version in Greek or German, if you feel like it. Besides, if you don't modify any of the rule heads, you can use transpiler as is.

The grammar even contains a mechanism for overloading certain LispE function names with new names:

; The following rule is an overload rule
; It is compiled as a call to link
; @affiche := print

; Now affiche will be interpreted as a call to print
(link "affiche" 'print)

This is not the definitive way to create languages, clearly, but this is a nice way to play with different formalisms. You can easily create new languages tailored to specific audiences, with a more flexible syntax for example.

LispE is pretty rich in terms of functionalities (see help). Many aspects of modern programming such as functional programming or array programming are part of experience. But Lisp might prove too much of a challenge with its minimal syntax, however, nothing prevents you from designing different formalisms to tap into these functionalities.

⚠️ **GitHub.com Fallback** ⚠️