Dev extractor guide - achakra/seck GitHub Wiki
About Extractor:
Extractor extracts various document formats downloaded by the crawler using Apache TIKA library. These includes html, documents, and media types.
Design Plan:
[ExtractorFactory]
|
[Abstract Extractor]
|
--------------------------------------
| | |
[HtmlExtractor] [DocumentExtractor] [MediaExtractor]
Sample code for extracting any of the above document types:
String fileName = "test.docx";
File file = new File(fileName);
ExtractorFactory efactory = new ExtractorFactory();
Extractor ex = efactory.getExtractor(file);
Bloblet b = ex.extract();
Metadata m = b.getMetadata();
Html Extractor:
HtmlExtractor extracts HyperText Markup Language [text/html]
The class retrieves html title, content, hrefs, and imgs.
Document Extractor:
DocumentExtractor extracts metadatas for the following document types:
- Microsoft Office document formats [application/msword, application/vnd.ms-powerpoint, vnd.visio, application/vnd.ms-outlook]
- OpenDocument Format [application/vnd.oasis.opendocument.*]
- Portable Document Format [application/pdf]
- Rich Text Format [application/rtf]
- Text formats [text/plain]
Media Extractor:
Extracts metadata out of images(.jpg, .png, .bmp ) and music (.mp3, .wav) using Apache Tika Library. Some of the metadata is image size, camera type,... Just create an object and call the extract function.
MediaExtractor metadataExtractor = new MediaExtractor();
metadataExtractor.extract(filename);