Release Notes Heritrix 3.4.0 20200304 - internetarchive/heritrix3 GitHub Wiki
Summary of changes since Release Notes - Heritrix 3.4.0-20190418 - see the full changelog for more details.
This releases updates the Berkeley Database from a very old version 4.1.6 to version 7.5.11. This resolves a long-standing bug when recovering from checkpoints multiple times, but also means that the Heritrix state files from previous versions are not compatible with this version. In other words:
Any crawl state folders from previous versions of Heritrix are not compatible with this version! You can only use this new release with new crawls!
Additions
- ExtractorYoutubeDL enables the discovery video URLs using the external tool youtube-dl. #257 (nlevitt)
- WARC writing is now configurable with its own processor chain making it easier to write extra records. #285 (nlevitt)
- MatchesListDecideRule gained a timeoutPerRegexSeconds option to help debug runaway regular expressions. #290 (csrster)
- Added support for forced queue assignment and parallel queues. #286 (adam-miller)
- JDK 11 is now supported. #269-#273 (ato)
Changes
- BDB was upgrade to 7.5.11. See warning at top. #281 (anjackson)
- Heritrix now uses Guava's bloom filter and base32 encoder. #300, #304 (hennekey)
Removals
- JDK 7 is no longer supported. #269-#273 (ato)
Bugfixes
- A number of performance and reliability improvements were made to the unit tests.