Dev extractor guide - achakra/seck GitHub Wiki

About Extractor:

Extractor extracts various document formats downloaded by the crawler using Apache TIKA library. These includes html, documents, and media types.

Design Plan:

                  [ExtractorFactory]
                         |
                 [Abstract Extractor]   
                         |
        --------------------------------------
        |                |                   |
[HtmlExtractor]  [DocumentExtractor]   [MediaExtractor]

Sample code for extracting any of the above document types:

  String fileName = "test.docx";

  File file = new File(fileName);

  ExtractorFactory efactory = new ExtractorFactory();

  Extractor ex = efactory.getExtractor(file);

  Bloblet b = ex.extract();

  Metadata m =  b.getMetadata();

Html Extractor:

HtmlExtractor extracts HyperText Markup Language [text/html]

The class retrieves html title, content, hrefs, and imgs.


Document Extractor:

DocumentExtractor extracts metadatas for the following document types:

  • Microsoft Office document formats [application/msword, application/vnd.ms-powerpoint, vnd.visio, application/vnd.ms-outlook]
  • OpenDocument Format [application/vnd.oasis.opendocument.*]
  • Portable Document Format [application/pdf]
  • Rich Text Format [application/rtf]
  • Text formats [text/plain]

Media Extractor:

Extracts metadata out of images(.jpg, .png, .bmp ) and music (.mp3, .wav) using Apache Tika Library. Some of the metadata is image size, camera type,... Just create an object and call the extract function.

MediaExtractor metadataExtractor = new MediaExtractor();

metadataExtractor.extract(filename);