The ppm star Package - mtpearce/idyom GitHub Wiki

This package implements the Prediction by Partial Match* statistical model (originally developed for lossless data compression). The package has its own simple interface and can be used independently of the other IDyOM machinery built on top of it.

ppm:build-model

(defun build-model (sequences alphabet)
  "Train a PPM* model on <sequences> composed from the specified <alphabet>."
  (let ((model (ppm:make-ppm alphabet :escape :c :mixtures t
                             :update-exclusion nil :order-bound nil)))
    (ppm:model-dataset model sequences :construct? t :predict? nil)
    model))

This function trains a PPM model on the supplied sequences composed from the supplied alphabet. For example:

CL-USER> (ppm:build-model '((a b r a c a d a b r a)) '(a b c d r))
#<PPM-STAR:PPM {1005B17243}>
CL-USER> 

ppm:ppm-predict

Here is a simple use of the package to generate probabilities for each element of each sequence in a list of sequences composed of some alphabet of symbols:

CL-USER> 
(defun predict (sequences alphabet) 
  (let ((model (ppm:make-ppm alphabet :escape :c :mixtures t 
                             :update-exclusion nil :order-bound nil)))
    (ppm:model-dataset model sequences :construct? t :predict? t)))
PREDICT
CL-USER> (predict '((a b a b a b a b)) '(a b))
((0 (A ((A 0.5) (B 0.5))) (B ((A 0.75) (B 0.25))) (A ((A 0.5) (B 0.5)))
  (B ((A 0.36363637) (B 0.6363636))) (A ((A 0.6666667) (B 0.33333334)))
  (B ((A 0.35714284) (B 0.64285713))) (A ((A 0.6666667) (B 0.33333334)))
  (B ((A 0.3529412) (B 0.6470588)))))
CL-USER> 

The output is a list of lists, one for each sequence in the list supplied to the function. Each of these lists is itself a list, composed of the symbol that appeared at that position in the sequence and a probability distribution reflecting the models predictions for that position. The probability of the symbol appearing at that location can be obtained by looking up the symbol in the probability distribution or one can use the distribution to compute the entropy (uncertainty) of the model's prediction at that location.

An extended version of this simple function appears in ppm-ui.lisp as ppm:ppm-predict.

ppm:ngram-frequencies

The code in ppm-ui.lisp also shows another example application: ppm:ngram-frequencies, which returns the n-gram counts of a given order (n) in a list of sequences composed from symbols in the supplied alphabet. Here are some examples:

CL-USER> (ppm:ngram-frequencies '((a b r a c a d a b r a)) '(a b c d r) 3)           
(((D A B) 1) ((C A D) 1) ((R A C) 1) ((B R A) 2) ((A D A) 1) ((A C A) 1)
 ((A B R) 2))
CL-USER> (ppm:ngram-frequencies '((l e t l e t t e r t e l e)) '(e l t r) 1)                     
(((R) 1) ((T) 4) ((E) 5) ((L) 3))
CL-USER> (ppm:ngram-frequencies '((a g c g a c g a g)) '(a c g) 2)
(((C G) 2) ((G A) 2) ((G C) 1) ((A C) 1) ((A G) 2))
CL-USER> (ppm:ngram-frequencies '((m i s s i s s i p p i)) '(i m p s) 4)
(((S I P P) 1) ((S I S S) 1) ((S S I P) 1) ((S S I S) 1) ((I P P I) 1)
 ((I S S I) 3) ((M I S S) 1))
CL-USER> (ppm:ngram-frequencies '((a s s a n i s s i m a s s a)) '(a i m n s) 2)
(((M A) 1) ((I M) 1) ((I S) 1) ((N I) 1) ((S I) 1) ((S A) 2) ((S S) 3)
 ((A N) 1) ((A S) 2))
CL-USER> (ppm:ngram-frequencies '((a s s a n i s s i m a s s a)
                                 (m i s s i s s i p p i))
                               '(a s n i m p)
                               2)
(((P I) 1) ((P P) 1) ((M I) 1) ((M A) 1) ((I P) 1) ((I M) 1) ((I S) 3)
 ((N I) 1) ((S I) 3) ((S A) 2) ((S S) 5) ((A N) 1) ((A S) 2))
CL-USER> (ppm:ngram-frequencies '((a b r a c a d a b r a)
                                 (l e t l e t t e r t e l e)
                                 (a s s a n i s s i m a s s a)
                                 (m i s s i s s i p p i)
                                 (w o o l o o b o o l o o))
                               '(a b c d e i l m n o p r s t w)
                               3)
(((O B O) 1) ((O L O) 2) ((O O B) 1) ((O O L) 2) ((W O O) 1) ((P P I) 1)
 ((M I S) 1) ((M A S) 1) ((I P P) 1) ((I M A) 1) ((I S S) 3) ((N I S) 1)
 ((S I P) 1) ((S I S) 1) ((S I M) 1) ((S A N) 1) ((S S I) 3) ((S S A) 2)
 ((T E L) 1) ((T E R) 1) ((T T E) 1) ((T L E) 1) ((E L E) 1) ((E R T) 1)
 ((E T T) 1) ((E T L) 1) ((L O O) 2) ((L E T) 2) ((D A B) 1) ((C A D) 1)
 ((R T E) 1) ((R A C) 1) ((B O O) 1) ((B R A) 2) ((A N I) 1) ((A S S) 2)
 ((A D A) 1) ((A C A) 1) ((A B R) 2))
CL-USER> 

ppm:write-model-to-postscript

There is also a function to write the model to a postscript representation of a suffix tree:

CL-USER> (ppm-star:write-model-to-postscript 
          (ppm-star:build-model '((a b r a c a d a b r a)) '(a b c d r))
          "/tmp/ppm.ps")
NIL
⚠️ **GitHub.com Fallback** ⚠️