6.14 Text Generation LispE with a Grammar - naver/lispe GitHub Wiki

Old school text generation

Version française

The method we will present here belongs to the old school, the one before GPT3 and other language models. But sometimes, if you don't have a dozen GPUs at hand and a good terabyte of hard disk, well, the normal configuration of a computer scientist today... Ah, this is not your case... I didn't know... Sorry...

So there are some traditional methods to achieve similar results. Well, it won't write your year-end report...

I'm going to present you one that I've concocted with the language that has been developed here at NLE during the last two years: LispE.

The grammar

In the case of an old-fashioned generation, you need a generation grammar.

For example:

  S : NP VP
  PP : PREP NNP
  NP : DET NOUN
  NP : DET NOUN PP
  NP : DET ADJ NOUN
  NNP : DET NLOC
  NNP : DET ADJ NLOC
  VP : VERB NP
  DET : "the" "a" "this" "that
  NOUN : "cat" "dog" "bat" "man" "woman" "child" "puppy
  NLOC : "house" "barn" "flat" "city" "country
  VERB: "eats" "chases" "dicks" "sees"
  PREP: "of" "from" "in"
  ADJ: "big" "small" "blue" "red" "yellow" "petite"

There is also a grammar for French, but it is admittedly a bit complicated to read, especially because of the agreement rules.

Compile this thing

This grammar is rather simple to read. We start with a sentence node "S", which is composed of a nominal group and a verbal group. The rules that follow give the different forms that each of these groups can take. Thus a nominal group: NNP can be broken down into a determiner followed by an adjective and a noun.

The compilation of this grammar consists in creating a large dictionary indexed on the left parts of these rules:

{
   %ADJ:("big" "small" "blue" "red" "yellow" "petite")
   %DET:("the" "a" "this" "that")
   %NLOC:("house" "barn" "flat" "city" "country")
   %NOUN:("cat" "dog" "bat" "man" "woman" "child" "puppy")
   %PREP:("of" "from" "in")
   %VERB:("eats" "chases" "bites" "sees")
   ADJ:"%ADJ"
   DET:"%DET"
   NLOC:"%NLOC"
   NNP:(("DET" "NLOC") ("DET" "ADJ" "NLOC"))
   NOUN:"%NOUN"
   NP:(
      ("DET" "NOUN")
      ("DET" "NOUN" "PP")
      ("DET" "ADJ" "NOUN")
   )
   PP:(("PREP" "NNP"))
   PREP:"%PREP"
   S:(("NP" "VP"))
   VERB:"%VERB"
   VP:(("VERB" "NP"))
}

Some lines are simple copy/paste of the rules above, except for the lexical rules which are preceded by a "%". The goal is to be able to differentiate between applying a rule and generating words.

Analyze and generate with the same grammar

This is certainly the nice thing about the approach we propose here.

We will use this grammar in both directions, which means that we can feed it a piece of sentence and let it finish.

For example, if we start with: a cat, it can then propose its own continuations.

Note that here, the continuations will draw random words from the word lists. This can result in completely ridiculous sentences... or not.

The first step

The user provides the beginning of a sentence, but also, and this is fundamental, the initial symbol corresponding to what (s)he wants to produce.

This symbol is an entry point in our grammar. We will choose: S.

In other words, we will ask the system to produce a sentence.

In the first step we have two lists in parallel:

   Words   Categories
("a "cat")  ("S")

The replacement

S is an entry point in the grammar whose value is: ("NP" "VP")

So we replace the structure above to reflect this possibility.

  Words     Categories
("a "cat") ("NP" "VP")

The head of the category list is now: NP.

Since there are several possible rules for NP, we'll just loop around to find the one that covers our list of words:

  Words        Categories
("a "cat") ("DET" "Noun" "VP")

Now our head is DET which points to a lexical item. We just have to check that "a" belongs to the list associated with "DET".

This is the case, we can then eliminate elements from both lists:

  Words  Categories
("cat") ("Noun" "VP")

We can do the same operation for "Noun", the word list is then empty.

Words Categories
()     ("VP")

We then switch to the generation mode.

Generation

VP returns a list with only one element: ("Verb" "NP")

     Categories             Words
  ("Verb" "NP")          ("a" "cat")

Note that "Generated" contains as initial value the words coming from our sentence.

Since Verb is a lexical item, we draw a word at random from our list of verbs:

     Categories             Words
      ("NP")         ("a "cat" "chases")

We then draw a rule at random from those associated with NP:

     Categories             Words
("Det" "Adj" "Noun")    ("a "cat" "chases")

The job is now very simple, just draw a determiner, an adjective and a noun at random from their respective list:

     Categories                Words
        ()         ("a "cat" "chases" "a" "big" "dog")

Since the list of categories is now empty we stop there and returns our sentence.

Implementation detail in LispE

If you take a quick look at the code of the parser, you will observe the presence of two functions: match and generate. These functions are based on the extensive use of defpat, the pattern programming functions in LispE.

match

match is used to check if the words in a sentence can be parsed by the grammar. The conditions for match to succeed are twofold:

  • Either the word list and the category list are empty
  • Either the word list is empty and the system continues in generation mode on the remaining categories

; We have used up all our words and categories
; No need to go further
(defpat match ([] [] consume) 
   (nconcn consume "$$") 
)

; We stop and generate, the word list is empty
(defpat match ( current_pos [] consume)   
   (generate current_pos consume)
)

; We check the rule associated to the leading category
; consp checks if an object is a list. If it is not the case, it is a lexical rule.
; If not, we loop over the possible rules. 
(defpat match ( [POS $ current_pos] [w $ sentence] consume)
   (setq rule (key grammar POS))
   (if (consp rule) ; if it is a group of rules, we loop to find the right one
      (loop r rule
         (setq poslst (match (nconcn r current_pos) (cons w sentence) consume)
         (if poslst
            (return poslst) ; we find one we stop
         )
      )
      (if (in (key grammar rule) w) ; otherwise it is a lexical rule and we check if the current word is part of it
         (match current_pos sentence (nconcn consume w))
      )
   )
)

generate

Generation is the final step. Thanks to pattern programming, this operation is reduced to two functions.

; Generating a word
; We are looking for a rule
; This one is either a normal rule (consp) or a lexical rule
(defpat generate([POS $ current_pos] tree)
   (setq r (key grammar POS))
   (if (consp r)
      ; here places the categories of a randomly drawn rule on top
      (generate (nconcn (random_choice 1 r 30) current_pos) tree) 
      ; here we add a word drawn at random
      (generate current_pos (nconc tree (random_choice 1 (key grammar r) 30)) 
   )
)  

; There are no more categories available, we place an end-of-sequence symbol to indicate that 
; all was generated
(defpat generate ([] tree) (nconc tree "%%") )

Conclusion

For those who have already had the opportunity to work with Prolog, this way of designing a program should seem very familiar. For others, this way of programming may seem rather confusing. The use of a pattern to distinguish different functions with the same name but different arguments is called "polymorphism". This kind of operation is also available in C++:

    Element* provideString(wstring& c);
    Element* provideString(string& c);
    Element* provideString(wchar_t c);
    Element* provideString(u_uchar c);

For example, these lines of code come from the interpreter LispE itself.

What distinguishes defpat here from the example above, however, is the richness and complexity of the patterns that can be dynamically used to parse a list of words and categories. Instead of a static compiled call, we have here a very flexible method that allows us to concentrate on the code specific to the detected pattern.

In particular, this method allows tree or graph traversal without the programmer ever getting lost in the tangle of special cases. If the list of elements evolves, it is often enough to add an additional function to take these new elements into account without redesigning the rest of the program.