FAQ - nsoft/jesterj GitHub Wiki

Frequently Asked Questions

What is JesterJ?

JesterJ is a system for loading data into search engines. It is possible to use it as a general purpose ETL platform, but its primary focus is to make it quick and easy to work with data in a format that is appropriate for search platforms. Specifically this means that the atomic unit of data is a "document" object which maps keys to multiple values (like a MultiMap). This differs from traditional ETL systems that focus on databases and conceptualize the in-flight data as a row, with columns (keys) and only support a single value per row/column.

Is it stable?

Version 1.0beta3 is the first "stable" version. This version should only receive bug fixes before being released as 1.0 (no new features). Prior versions were not really stable enough for general usage.

How much Data can it handle?

How much time do you have? In April 2023 the code has demonstrated stability across millions of documents. With the arbitrary standard of ingests over 12 hours being unacceptable, JesterJ is recommended for any corpus up to 30 million documents. Recent observations indicate that for a plan that has one input and one output approximately 800 1k documents is likely. This is admittedly not very fast and it was not the processing of the document within JesterJ that was limiting. Performance currently appears to be limited by Cassandra write times (possibly due to an index on one table, that perhaps can be eliminated). This also means that increasing the number of destinations, or deterministic paths through your plan will work but slow things down. The same plan with 10 send to Solr steps instead of 1 unfortunately dropped to 250-300 documents per second. 1.0 is the "make it work" release, and future releases will seek to improve performance.

Aren't there Spark/Hadoop based solutions that scale really well?

Yes. These types of systems have awesome potential, but they tend to require a LOT of setup time... Set up a spark cluster, set up a zookeeper cluster, write custom code to run inside of spark against a spark RDD. Spark systems are quite good once one invests in them, but many companies can't afford to spend tens of thousands of dollars to get the data flowing. Hadoop and Map Reduce systems tend to be very high bandwidth, and very high Latency, so they would only be appropriate for a search index that didn't need to show data in a timely fashion. JesterJ is meant to be very very easy to start using, and also scale well into large production usage. Handling very very large data sets, or extreme throughput may still require Hadoop or Spark based solutions, but the other 95% of projects should be able to use JesterJ.

What about Data Import Handler?

Solr's data import handler is very good if you simply want to reflect exactly what is in the database in your search Engine. However, if you want to combine data from multiple tables, you begin to work with increasingly complex queries and if you do any sort of data-enrichment (such as geo-locating) that then has to be done, and the intermediate result stored back in your database (or a secondary database). Certainly this can be done, and if existing database ETL expertise is available in house it may be a reasonable solution, but such systems often devolve into multiple disconnected steps with temporary tables. In such a situation fault tolerance is nearly impossible. With JesterJ your search technologist does not need to be an expert DBA too. They can massage the data in the most common language for search work (Java) and the entire process can be maintained in a single location. As noted above classic ETL systems become inefficient when they need to handle multi-value fields which usually need to be expressed as many rows and aggregated at the last second in custom code. Handling several multi-value fields for the same document can get really complicated.

Does it work with Apache Solr?

Yes! Solr is the platform that the project maintainer uses most, so this is the most well tested application.

Does it work with Elastic?

Unfortunately, not any more :( Support was added initially, but no users are known to be using it and the Elastic dev team has a tendency to break backwards compatibility. It's old notions of allowable Lucene versions were holding back Solr related advancements (for which there were users and paying customers). Upgrade was attempted, but too many incompatibilities with prior API's and new dependency verification code added by the Elastic dev team made the upgrade too difficult. If you're interested in using JesterJ with elastic, we've got a ticket you can work on to get it going again here: https://github.com/nsoft/jesterj/issues/124

Does it do... X ?

Possibly. There are a growing number of standard implementations for things like renaming fields or scanning rows in a database, or sending documents to Solr/Elastic, but of course there's always a task that you need that we haven't added. The other major goal of JesterJ is to hide the complexity of the infrastructure and make writing custom tasks very very simple. Simply write a Java class that implements Processor interface (one method, takes a Document, returns an array of Document) and you're off and running. (More details here: )

Does it contain cryptographic software subject to export control

Yes. No software written for this project falls into that category, but the distributed bundles ending in -node.jar contain all dependencies, including Apache Tika and BouncyCastle.org. Please see the front page (Home) of this wiki for more details an links to information about the bundled software.