ctakes ytex - apache/ctakes GitHub Wiki
Contains the following components:
- Semantic Similarity - compute the similarity between concepts
- Word Sense Disambiguation - use semantic similarity to disambiguate terms that have multiple meanings
- Bag-of-Words exporter - export sparse matrices from a database
- YTEX Database setup scripts
- Java Data Loader - simple utility for loading delimited files into database tables
Getting started
- Create ytex.properties:
create src\main\resources\org\apache\ctakes\ytex\ytex.properties
(see examples in ctakes-ytex/src/main/resources/org/apache/ctakes/ytex) - Add jdbc drivers to maven repo: if you are using ms sql server / or oracle, add the jdbc driver(s) to your local maven repo (see comments in dependencies section of pom.xml)
- Add ms sql auth dlls to path: if you are using ms sql server with integrated auth, make sure the ms sql jdbc auth directory is in your system PATH (sqljdbc_4.0\enu\auth\x64)
- Unzip ctakes-ytex-resources.zip: extract to ctakes-ytex-res/src/main
- run the maven build from the command line for ctakes-ytex, ctakes-ytex-uima projects. From the ctakes root directory:
mvn -pl ctakes-ytex,ctakes-ytex-uima -DskipTests
- set up your database: Open a shell/command prompt, set the CTAKES_HOME variable to be the checkout directory of ctakes (i.e. the parent of this file's directory), and run the ant script:
Windows
set CTAKES_HOME=xxx
cd %CTAKES_HOME%\ctakes-ytex\scripts\data
ant -Dconfig.local=..\..\target\classes all > ..\..\target\build.out 2>&1
Linux
CTAKES_HOME=xxx
export CTAKES_HOME
cd ${CTAKES_HOME}/ctakes-ytex/scripts/data
ant -Dconfig.local=../../target/classes all > ../../target/build.out 2>&1
You should be all set now.
Developing
We have currently tested mysql and ms sql server, oracle is pending. HSQL is used only for unit testing.YTEX is driven by a property file, ytex.properties. The property file used when running ytex java programs (aside from junit tests) is src\main\resources\org\apache\ctakes\ytex\ytex.properties. Refer to the example properties files under /ctakes-ytex/src/main/resources/org/apache/ctakes/ytex
YTEX relies on some config files generated from templates via an ant build script. These config files are
generated automatically by the maven build (which runs the ant build script).
The maven eclipse integration is a bit buggy; you should run the maven build from the command line so that
things work correctly.
For development, you must set up a database. There are 2 types of database setups, depending on if you have a UMLS installation (and you've configured ytex to use it):
- With UMLS - YTEX will generate a dictionary lookup table from your UMLS database. This requires tokenizing every string in MRCONSO which will take a while. If you have UMLS installed and want to use it, configure the umls.schema/umls.catalog properties in ytex.properties.
- Without UMLS - YTEX will load a pre-fabricated dictionary lookup table generated from the UMLS 2013AA. This is included in the ctakes-resources zip from sourceforge. You have to copy v_snomed_fword_lookup.txt to ctakes-ytex\scripts\data\umls.
Testing
YTEX is dependent upon a database which is set up by the ant script scripts\data\build.xml
Note that all ytex-related tables in these databases will be dropped and recreated every time the maven build is run (don't put data you want to keep there).
Maven runs the ant script prior to running the tests. By default, we set up an hsqldb database for testing in the TEMP dir. Override this by:
- passing -Ddb.type=[mysql|mssql|orcl] to the maven command line. If you do that, we will use the \src\test\resources\org\apache\ctakes\ytex\ytex.${db.type}.properties which point to the ytex_test schema (mysql/orcl) or catalog (mssql).
- or dropping your own ytex.properties in \src\test\resources\org\apache\ctakes\ytex.
Creating the ctakes-ytex-resources.zip
See scripts/build-nonasf.xmlCollection Readers
Annotation Engines
Output Writers
Utilities
Read documents from a database.
Source class: DBCollectionReader
Source package: org.apache.ctakes.ytex.uima
Parent class: org.apache.uima.collection.CollectionReader_ImplBase
No available configuration parameters.
Annotates Dates based upon whether or not text can be normalized to a date.
Source class: DateAnnotator
Source package: org.apache.ctakes.ytex.uima.annotators
Parent class: org.apache.uima.analysis_component.JCasAnnotator_ImplBase
Dependencies: Base Token
No available configuration parameters.
Use regex to identify the Named Entities. Read the named entity regex - concept id map from the db.
Source class: NamedEntityRegexAnnotator
Source package: org.apache.ctakes.ytex.uima.annotators
Parent class: org.apache.uima.analysis_component.JCasAnnotator_ImplBase
Dependencies: Section
Products: Identified Annotation
No available configuration parameters.
Use negex to assign polarity to Named Entities.
Source class: NegexAnnotator
Source package: org.apache.ctakes.ytex.uima.annotators
Parent class: org.apache.uima.analysis_component.JCasAnnotator_ImplBase
Dependencies: Sentence, Identified Annotation
No available configuration parameters.
Writes XMI files with full representation of input text and all extracted information.
Source class: DBConsumer
Source package: org.apache.ctakes.ytex.uima.annotators
Parent class: org.apache.uima.analysis_component.JCasAnnotator_ImplBase
Dependencies: Document Id
No available configuration parameters.
Create MedicationEventMention/EntityMention annotations for each set of CandidateConcept annotations that span the same text.
Source class: MetaMapToCTakesAnnotator
Source package: org.apache.ctakes.ytex.uima.annotators
Parent class: org.apache.uima.analysis_component.JCasAnnotator_ImplBase
Products: Identified Annotation
No available configuration parameters.