Scrape: retrieve video meta data - nova-video-player/aos-AVP GitHub Wiki

To scrape is the action of retrieving video meta data relying on two web services: thetvdb.com and tmdb.com.

If you experience some issue with NOVA scraper, you can use the following simple test program https://github.com/nova-video-player/TestScraper to troubleshot what is going wrong and propose some enhancements.

In the code ShowScraper2.java and MovieScraper2.java are performing this task based on the filename analysed after a pre-processing that performs a cleaning rejecting known bad text patterns.

This cleaning process is performed here MovieDefaultMatcher.java

Languages supported are registered in BaseScraper2.java

For TV shows, Nova has switched to thetvdb-java library to retrieve TV show metadata from thetvdb.com.

Changes made on thetvdb.com backend on 02/11/2019 created many scrape issues cf. https://forums.thetvdb.com/viewtopic.php?f=122&t=60239

The following changes have been made in the TV show scrape process to provide better scrape results, the results of the search via thetvdb are split in several categories:

  1. shows without a valid poster
  2. shows with numeric slug with valid poster
  3. shows with a valid poster and non numeric slug

The list of shows is then ordered with list 3 first then 2 then 1. List 3 is further processed by ordering the list with the first elements minimizing the Levenshtein distance between the pre-processed video file named and the results provided by thetvdb.

It helps a lot for shows like https://www.thetvdb.com/search?query=white%20collar

Note that the Levenshtein distance might cause some problems when using multi-lingual search. For instance in French "White Collar" returns "FBI: Duo très spécial" which obviously has a large Levenshtein distance. This problem has been recently solved via computing both locale language and english Levenshtein metrics and selecting showID that has the minimum Levenshtein distance and rematching show title in local language. A preferable mitigation to this issue would be that thetvdb backend itself sorts the results by popularity. Such request has been made to thetvdb here https://gitlab.thetvdb.com/site/thetvdb_api/issues/75 and there https://forums.thetvdb.com/viewtopic.php?f=17&t=60976

TIP:

For a rapid identification of recordings with external backends and without additional NFO files, you can just rename the files using this fast&ditry technique: If the original filename is RECORDING_FILE_WHATEVER.TS, then append at the beginning:

  • Movies: TITLE.hdtv.RECORDING_FILE_WHATEVER.TS
  • TV-Shows: TITLE_S00E00.hdtv.RECORDING_FILE_WHATEVER.TS

Then when re-scanning the directory the scrapping process will identify the content based on the TITLE.