Installing Third Party Software - e5pecial/nltk GitHub Wiki
How NLTK Discovers Third Party Software
NLTK finds third party software through environment variables or via path arguments through api calls. This page will list installation instructions & their associated environment variables.
Java
Java is not required by nltk, however some third party software may be dependent on it. NLTK finds the java binary via the system PATH
environment variable, or through JAVAHOME
or JAVA_HOME
.
To search for java binaries (jar files), nltk checks the java CLASSPATH
variable, however there are usually independent environment variables which are also searched for each dependency individually.
Windows
- Download & Install the jdk on java's official website: http://www.oracle.com/technetwork/java/javase/downloads/index.html?ssSourceSiteId=otnjp
Linux
It is best to use the package manager to install java.
Stanford Tagger, NER, Tokenizer and Parser.
To install:
- Make sure java is installed (version 1.8+)
- Download & extract the stanford tokenizer package (contains the stanford tagger): http://nlp.stanford.edu/software/lex-parser.shtml
- Download & extract the stanford NER package http://nlp.stanford.edu/software/CRF-NER.shtml
- Download & extract the stanford POS tagger package http://nlp.stanford.edu/software/tagger.shtml
- Download & extract the stanford Parser package: http://nlp.stanford.edu/software/lex-parser.shtml
- Add the directories containing
stanford-postagger.jar
,stanford-ner.jar
andstanford-parser.jar
to theCLASSPATH
environment variable - Point the
STANFORD_MODELS
environment variable to the directory containing the stanford tokenizer models, stanford pos models, stanford ner models, stanford parser models e.g (arabic.tagger
,arabic-train.tagger
,chinese-distsim.tagger
,stanford-parser-x.x.x-models.jar
...) - e.g.
export STANFORD_MODELS=/usr/share/stanford-postagger-full-2015-01-30/models:/usr/share/stanford-ner-2015-04-20/classifier
Tadm (Toolkit for Advanced Discriminative Modeling)
To install
- Download & compile TADM: http://tadm.sourceforge.net/
- Set the environment variable
TADM
to point to the tadm binaries directory.
Megam (MEGA Model Optimization Package)
To install
- Download & compile MEGAM's source: http://www.umiacs.umd.edu/~hal/megam/
- Set the environment variable
MEGAM
to point to the MEGAM directory. - If using macports version of ocaml, modify the MEGAM Makefile to specify the following:
WITHCLIBS =-I /opt/local/lib/ocaml/caml
andWITHSTR =str.cma -cclib -lcamlstr
C&C Tools/Boxer
To install
- Checkout & compile the latest SVN revision http://svn.ask.it.usyd.edu.au/trac/candc/wiki/Subversion
- Set the environment variable
CANDC
to point to the C&C directory.
Prover9 & Mace4
To install
- Download & extract Prover9 & Mace4: http://www.cs.unm.edu/~mccune/mace4/
- Set the environment variable
PROVER9
to point to the binaries directory.
Malt Parser
To install
- Make sure java is installed
- Download & extract the Malt Parser: http://www.maltparser.org/download.html
- Set the environment variable
MALT_PARSER
to point to the MaltParser directory, e.g./home/user/maltparser-1.8/
in Linux. - When using a pre-trained model, set the environment variable
MALT_MODEL
to point to.mco
file, e.g.engmalt.linear-1.7.mco
from http://www.maltparser.org/mco/mco.html.
Hunpos Tagger
To install
- Download & extract the hunpos tagger and a model file: https://code.google.com/p/hunpos/downloads/list
- Set the environment variable
HUNPOS_TAGGER
to point to the directory containing thehunpos-tag
binary - NLTK also searches for the model files using the same environment variable, so you can put the model file in the same location (NB the model file path can also be passed to the
nltk.tag.hunpos.HunposTagger
class via thepath_to_model
argument)
Senna for Various NLP Tasks
To install
- Download & extract the Senna files: http://ml.nec-labs.com/senna/
- Set the environment variable
SENNA
to point to the senna directory. NLTK searches for the binary executable files via this environment variable, but the directory path can also be passed to thenltk.tag.senna.SennaTagger
class via thesenna_path
argument.
CRFSuite for CRF Tagger
To install
- Download & compile : http://www.chokkan.org/software/crfsuite/
- Set the environment variable
CRFSUITE
to point to the directory containingcrfsuite
(for Linux) orcrfsuite.exe
for Window. NLTK searches for the binary executable files via this environment variable, but the executable file path can also be passed to thenltk.tag.crfsuite.CRFTagger
class via thefile_path
argument.
REPP Tokenizer
To install
mkdir -p /path/to/where/you/wanna/save/repp
svn co http://svn.delph-in.net/repp/trunk /path/to/where/you/wanna/save/repp
cd /path/to/where/you/wanna/save/repp/
autoreconf -i
./configure CPPFLAGS=-P
make
- The installation instructions above is tested for Linux and Mac OS. For more information, see http://moin.delph-in.net/ReppTop
- After installing you can set the environment variable
REPP_TOKENIZER
to point to the directory containing therepp
tokenizer, e.g. (/path/to/where/you/wanna/save/repp/
), then you can instantiate the tokenizer object without specifying any parameter, e.g. (tokenizer = nltk.tokenize.ReppTokenizer()
) - Also, you can directly create the
ReppTokenizer
object by passing in the directory containing therepp
tokenizer without setting the environment variable, i.e. (tokenizer = nltk.tokenize.ReppTokenizer(/path/to/where/you/wanna/save/repp)
)
If at the ./configure CPPFLAGS=-P
step, it shows an error like this on Mac:
configure: error: required ICU library are missing
Please install and link the ICU library (brew install icu4c && brew link icu4c --force
) and then retry from the ./configure CPPFLAGS=-P
step. If for any reason, you need to unlink the icu4c, try: brew unlink icu4c
.