5.15 Transducer - naver/lispe GitHub Wiki

Transducer: lispe_transducer

back

This object is used to handle lexicons. It allow for lookup, lookdown but also parsing a sentence along the lexicon content. lookup or parse can be done with Edit Distance flags together with a threshold.

It exposes the following methods

(deflib transducer_flags () Display the edit distance flags)
(deflib transducer_factorize (trans) factorize the arcs and states of the automaton.)
(deflib transducer_parse (trans sentence (option 0) (threshold 0) (flags 0)) parse a sentence based on lexicon content)
(deflib transducer_lookdown (trans word (lemma 0)) lookdow for the surface form matching a word+pos+features. lemma is optional: if set to 1 or 2 then the string to look for is only a lemma. If set to 2, it also returns the features with the surface form)
(deflib transducer_lookup (trans word (threshold 0) (flags 0)) lookup for a word with a threshold and flags)
(deflib transducer_compilergx (trans (regular nil) (vector nil) (name nil)) Compile a regular expression combined with a vector and store it with name. If no parameters compile the 'regular' expressions for numbers.)
(deflib transducer_store (trans name (normalize true) (latintable 1)) store an automaton)
(deflib transducer_build (trans inputfile outputfile (normalize true) (latintable 1)) Build a transducer file out of a text file containing on the first line surface form, then on next line lemma+features)
(deflib transducer_add (trans value (normalize true) (latintable 1)) add a dictionary to the automaton)
(deflib transducer_load (trans filename) Load a transducer)
(deflib transducer ((filename)) create a transducer object)

Edit Distance

The Edit Distance flags are:

  • a_first: Allow flags to apply to the first character of the string
  • a_change: Allow a character to be changed
  • a_delete: Allow a character to be deleted
  • a_insert: Allow the insertion of a character
  • a_switch: Allow two consecutive characters to be switched
  • a_nocase: case is not taking into account
  • a_repetition: Allow characters to be repeated in sequence
  • a_vowel: Allow to match against accented vowels
  • a_surface: Return the surface form in lookdown
  • a_longest: longest match

transducer_parse

The options in transducer_parse are the following:

0: only surface and lemma when it is different
1: only surface and lemma when it is different with offsets
2: surface, lemma and features
3: surface, lemma, features and offsets

Creating a lexicon

The format of files that are compiled into lexicons either through transducer_build or through transducer_add, have a similar structure. In the case of a file, the first line should be a surface form, while the next line should be a lemma with some features, separated with a tabulation: \t and so on so forth:

classes
class\t+Plural+Noun
class
class\t+Singular+Noun
etc.

The function transducer_build takes such a file as input and generates a file which contains the corresponding transducer out of these lines. The two other parameters are actually used when processing a word or a text.

a) Normalization means that the lexicon can match words without being case sensitive. Hence, this lexicon will recognize CLASS as a word.

b) The system has been implemented to recognize words in UTF8 encoding (actually the transducers are stored in Unicode). However, it is possible to tell the system how to take into account Latin encodings. For instance 5 correspond to Latin 5, the Cyrillic character table. The default value is Latin 1.

transducer_add

This method expects its argument to be a map such as:

  • the key will be the surface form, and the value the lemmas with their features. A map might actually prove a problem to store ambiguous words.

Regular expressions

The regular expressions processed by transducer are very limited:

  1. %c: defines a character, c is a UTF8 character ...
  2. $.. : defines a string
  3. u-u: defines an interval between two Unicode characters
  4. [..]: defines a sequence of characters
  5. {...}: defines a disjunction of strings
  6. .+: structure should occur at least once.
  7. (..): defines an optional structure
  8. !n: inserts a features structure along its number in the feature list (n>=1).

Example

; This regular expression recognises Roman digits

; Features are provided as a list
; !1 indicates which element in the list to pick up: here the first one
(transducer trans compilergx "{DMCLXVI}+!1" ("\t+Rom"))

Example

Here is an example of how to load a transducer and apply it to words and sentences:

(use 'lispe_transducer)

; We load our transducer: _current is the path to this Lisp program
(setq tr (transducer (+ _current "english.tra")))

; A lookup on "check", no flags no threshold
(println (transducer_lookup tr "check"))

; These flags are actually binary values that we combine with the | operator
; the threshold is: 1, no more than one modification to the whole string
(println (transducer_lookup tr "chack" 1 (| a_first a_change)))

; Different outputs according to the option flag

(println (transducer_parse tr "the boy is drinking some water"))
(println (transducer_parse tr "the boy is drinking some water" 1))
(println (transducer_parse tr "the boy is drinking some water" 2))
(println (transducer_parse tr "the boy is drinking some water" 3))

Pattern Programming and Lexicons

Here is an actual example (in dialogues.lisp ) of how pattern programming can meet lexicon transducers:

(use 'lispe_transducer)

; We load our lexicon in a transducer object: english
(setq english (transducer (+ _current "english.tra")))

; This is a dictionary
; note $:z, which acts as a "rest of dictionary" operator
; z contains the rest of the dictionary
(defpat traversing({k:l $:z} result)
   (traversing l result)
   (traversing z result)
)

; This is a list
(defpat traversing ((x $ r) result)
   (traversing x result)
   (traversing r result)
)

; This is a string, we want all to extract all prepositions in sentences
; of at least 20 characters...
(defpat traversing ((string_ (< 20 (size sentence))) result)
   ;we parse our sentence with our lexicon
   (setq words (transducer_parse english sentence 2))
   (scanning words result)
)

; the fall back function, when nothing sticks
(defpat traversing (_ _) nil)

; A preposition
(defun isprep (x)
   (loop e x
      (check (in e "+Prep")
         (return true)
      )
   )
)

; scanning all words from the parse list
(defun scanning(parse result)
   (loop x parse
      (check (isprep x)
         (push result (at x 0))
      )
   )
)

(setq result ())

; We load our JSON structure and starts processing it
(traversing (json_read (+ _current "dialogue.json")) result)
(println (unique result))