How To set up your .env file - openeduhub/oeh-search-etl GitHub Wiki

Getting started: What is the .env.example file?

If you ended up on this page of our GitHub Wiki, you have most probably already seen the .env- or .env.example-file mentioned in our short Readme. That's great!

If you cloned the oeh-search-etl-repository, you'll see a .env.example-file sitting within the converter-folder. From your IDE's project-view, it'll look like this: converter/.env.example.

This file holds configuration variables for you. Besides other settings, it tells Scrapy which LOG LEVEL, MODE or EDU_SHARING-server to use and communicate with. By the file's ending, you might have already guessed that this is just an example file that you still need to configure to fit your use-case.

You most definitely need an .env file!

As mentioned in our Readme.md, before you're able to build crawlers by yourself, you'll need a functional .env-file sitting inside your converter/- folder. To get there, you have two easy choices now:

  • The manual way, by copy-pasting the .env.example and renaming the copied file to .env with the help of your file explorer
  • or by simply entering cp converter/.env.example converter/.env in your Terminal from the project root folder

You should now see a file called .env inside your converter/-folder. Please take a peek inside the file and read the comments above each setting, these will help with the following explanations.

Individual .env-Settings explained

When building your own crawlers, you might want to know what the crawler is currently doing while it's running. LOG_LEVEL allows you to set how much information you want to see during a crawl process, slightly similar to python's own setLevel-parameter within its logging module . LOG_LEVEL allows you to set a cutoff-threshold for event logging messages, currently four levels are supported:

  • ERROR
  • WARNING
  • INFO
  • DEBUG

Similar to Python's logging levels whatever is below your selected threshold will be ignored and not displayed on the console. While you're building a crawler and testing its functionality, you might want to set LOG_LEVEL = "DEBUG" at the beginning.

Different modes for different stages of crawler development

Since the MODE-setting is immensely useful for debugging purposes, here's a quick overview which settings interact with each other:

MODE =  Output location requires EDU_SHARING-server connection interacts with additional .env-settings
"csv" output_<spider_name>.csv CSV_ROWS
"edu-sharing" no local output file by default ✔️ EDU_SHARING_BASE_URL EDU_SHARING_USERNAME EDU_SHARING_PASSWORD
"json" output_<spider_name>.json ❌(see Addendum)
"None" check your Terminal/Console while the crawler is running

If you want to locally test your Scrapy spider first without any edu-sharing server interactions, we recommend the setting MODE = "json". Besides the output in your terminal, this will net you a useful .json-file that you can use to verify that the meta data your spider has gathered is correctly formatted and fits the data model for further processing.

Hint: If you always want to have a .json-file available after a spider has finished crawling, no matter the currently selected mode, you can add the line JSON = "1" to your .env-configuration file.

If you prefer a tabular overview, the csv-mode might come in handy: It interacts with CSV_ROWS and allows you to customize which fields of the gathered dataset you want to see in your output_<spider_name>.csv file.

Once your crawler is stable enough, use the edu-sharing-mode!

While debugging your Scrapy spiders, most of the time you'll be using the "local" modes like json or csv, but once your Scrapy spider has reached a stable state and you want to actually test the meta data that you gathered in conjunction with an edu-sharing instance, you'll need to make sure that your server settings are correctly set up in EDU_SHARING_SERVER, EDU_SHARING_USERNAME and EDU_SHARING_PASSWORD. (Otherwise you'll notice error messages in regards to es_connector.py in your terminal)

Once you're sure that your crawler works flawlessly, it's time to flip the DRY_RUN-switch: Set DRY_RUN = False so that your gathered meta data end up in the edu-sharing database. You should now be able to see the data that your crawler has successfully scraped from a source inside the SYNC_OBJECTS-folder of your edu-sharing web interface as individual entries within a sub-folder identical to your spider's name-value.

Thumbnails? No problem with SPLASH (Web Thumbnailer)

Depending on the data source that your spiders are crawling, you might need to take a screenshot of the website itself if it isn't offering any images that might be suitable as a thumbnail. By default DISABLE_SPLASH = False which means that your Crawler will tell the server mentioned in the SPLASH_URL setting to take a screenshot of the entry that is currently getting scraped.

For debugging purposes it's perfectly fine to set DISABLE_SPLASH = True as well as DRY_RUN = True so that you're not causing additional traffic towards the source of your web-crawl.

Addendum

If you're wondering where the .env-settings are used, a look into the oeh-search-etl/converter/settings.py will answer that question. The settings.py-file also allows you to customize the fields to be exported into your .json-output by modifying the FEED_EXPORT_FIELDS-list. For further information, please consult the official Scrapy documentation on Feed exports.