Troubleshooting, testing and live coding - nathanmarz/cascalog GitHub Wiki
Troubleshooting
Catching data errors with traps
You can use Cascading Traps with Cascalog to capture tuples whose processing fails. To store those tuples into a sink tap (for example a local file or hfs-textline), use the :trap
keyword with an error sink:
(def errors (lfs-textline "file:///tmp/people.bad_records" :sinkmode :replace))
;; or (stdout) or (hfs-textline "hdfs:///tmp/...") if running on Hadoop
(<- [?name ?age]
(people ?name ?age)
(:trap errors)
(< ?age 40))
Testing
You may use the functions and macros from the cascalog.testing namespace together with clojure.test test your queries. See Cascalog's own tests for examples.
It uses for example fact?-
to execute a query and compare its outputs with the expected ones or something like (facts query => (produces [3 10] [1 5] [5 11](/nathanmarz/cascalog/wiki/3-10]-[1-5]-[5-11))
where (def query (<- ...))
. Read Sam Ritchie's blog post Cascalog Testing 2.0 for more details and examples of midje-cascalog 0.4.0.
Live coding
There are certain features that support live, interactive coding:
- Use simple Clojure collections as data sources (
(def people ["ben" 21] ["jim" 42](/nathanmarz/cascalog/wiki/"ben"-21]-["jim"-42))
) - You can during development easily change some parts of Cascalog code to standard Clojure functions and call them from the REPL, for example a custom operator by replacing
(defaggregateop
with(defn
. - Queries can be of course executed from the REPL
Logging with Log4j (local mode only)
When all the taps in a job are lfs-textline
s or vectors (or stdout), you can run the -main
in your jar directly using java -jar
, instead of submitting it with hadoop jar
. This is sometimes called local mode.
When your jobs are running in this local mode, you can have a lot of information logged with log4j just by putting a standard log4j.xml in the classpath root of your jar. Any exceptions thrown in jobs will be printed to the configured log file with their full stacktrace.