Dev extractor guide - achakra/seck GitHub Wiki
About Extractor:
Extractor extracts various document formats downloaded by the crawler using Apache TIKA library. These includes html, documents, and media types.
Design Plan:
                  [ExtractorFactory]
                         |
                 [Abstract Extractor]   
                         |
        --------------------------------------
        |                |                   |
[HtmlExtractor]  [DocumentExtractor]   [MediaExtractor]
Sample code for extracting any of the above document types:
  String fileName = "test.docx";
  File file = new File(fileName);
  ExtractorFactory efactory = new ExtractorFactory();
  Extractor ex = efactory.getExtractor(file);
  Bloblet b = ex.extract();
  Metadata m =  b.getMetadata();
Html Extractor:
HtmlExtractor extracts HyperText Markup Language [text/html]
The class retrieves html title, content, hrefs, and imgs.
Document Extractor:
DocumentExtractor extracts metadatas for the following document types:
- Microsoft Office document formats [application/msword, application/vnd.ms-powerpoint, vnd.visio, application/vnd.ms-outlook]
- OpenDocument Format [application/vnd.oasis.opendocument.*]
- Portable Document Format [application/pdf]
- Rich Text Format [application/rtf]
- Text formats [text/plain]
Media Extractor:
Extracts metadata out of images(.jpg, .png, .bmp ) and music (.mp3, .wav) using Apache Tika Library. Some of the metadata is image size, camera type,... Just create an object and call the extract function.
MediaExtractor metadataExtractor = new MediaExtractor();
metadataExtractor.extract(filename);