How To set up your .env file - openeduhub/oeh-search-etl GitHub Wiki
.env.example
file?
Getting started: What is the If you ended up on this page of our GitHub Wiki, you have most probably already seen the .env
- or .env.example
-file
mentioned in our short Readme. That's great!
If you cloned the oeh-search-etl
-repository, you'll see a .env.example
-file sitting within the converter
-folder.
From your IDE's project-view, it'll look like this: converter/.env.example
.
This file holds configuration variables for you. Besides other settings, it tells Scrapy which LOG LEVEL
, MODE
or EDU_SHARING
-server to use and communicate with. By the file's ending, you might have already guessed that this is
just an example file that you still need to configure to fit your use-case.
.env
file!
You most definitely need an As mentioned in our Readme.md
, before you're able to build crawlers by yourself, you'll need a functional .env
-file
sitting inside your converter/
- folder. To get there, you have two easy choices now:
- The manual way, by copy-pasting the
.env.example
and renaming the copied file to.env
with the help of your file explorer - or by simply entering
cp converter/.env.example converter/.env
in your Terminal from the project root folder
You should now see a file called .env
inside your converter/
-folder.
Please take a peek inside the file
and read the comments above each setting, these will help with the following explanations.
.env
-Settings explained
Individual When building your own crawlers, you might want to know what the crawler is currently doing while it's
running. LOG_LEVEL
allows you to set how much information you want to see during a crawl process, slightly similar to
python's own setLevel
-parameter within its logging module
. LOG_LEVEL
allows you to set a cutoff-threshold for event logging messages, currently four levels are supported:
ERROR
WARNING
INFO
DEBUG
Similar to Python's logging levels whatever is below your
selected threshold will be ignored and not displayed on the console. While you're building a crawler and testing its
functionality, you might want to set LOG_LEVEL = "DEBUG"
at the beginning.
Different modes for different stages of crawler development
Since the MODE
-setting is immensely useful for debugging purposes, here's a quick overview which settings interact
with each other:
MODE = |
Output location | requires EDU_SHARING -server connection |
interacts with additional .env -settings |
---|---|---|---|
"csv" |
output_<spider_name>.csv |
❌ | CSV_ROWS |
"edu-sharing" |
no local output file by default | ✔️ | EDU_SHARING_BASE_URL EDU_SHARING_USERNAME EDU_SHARING_PASSWORD |
"json" |
output_<spider_name>.json |
❌ | ❌(see Addendum) |
"None" |
check your Terminal/Console while the crawler is running | ❌ | ❌ |
If you want to locally test your Scrapy spider first without any edu-sharing server interactions, we recommend the
setting MODE = "json"
. Besides the output in your terminal, this will net you a useful .json
-file that you can use
to verify that the meta data your spider has gathered is correctly formatted and fits the data model for further processing.
Hint: If you always want to have a .json
-file available after a spider has finished crawling, no matter the
currently selected mode, you can add the line JSON = "1"
to your .env
-configuration file.
If you prefer a tabular overview, the csv
-mode might come in handy: It interacts with CSV_ROWS
and allows you to
customize which fields of the gathered dataset you want to see in your output_<spider_name>.csv
file.
edu-sharing
-mode!
Once your crawler is stable enough, use the While debugging your Scrapy spiders, most of the time you'll be using the "local" modes like json
or csv
, but once
your Scrapy spider has reached a stable state and you want to actually test the meta data that you gathered in conjunction with
an edu-sharing instance, you'll need to make sure that your server settings are
correctly set up in EDU_SHARING_SERVER
, EDU_SHARING_USERNAME
and EDU_SHARING_PASSWORD
. (Otherwise you'll notice
error messages in regards
to es_connector.py in your
terminal)
Once you're sure that your crawler works flawlessly, it's time to flip the DRY_RUN
-switch: Set DRY_RUN = False
so
that your gathered meta data end up in the edu-sharing database. You should now be able to see the data that your
crawler has successfully scraped from a source inside the SYNC_OBJECTS
-folder of your edu-sharing web interface as
individual entries within a sub-folder identical to your spider's name
-value.
SPLASH
(Web Thumbnailer)
Thumbnails? No problem with Depending on the data source that your spiders are crawling, you might need to take a screenshot of the website itself
if it isn't offering any images that might be suitable as a thumbnail. By default DISABLE_SPLASH = False
which means
that your Crawler will tell the server mentioned in the SPLASH_URL
setting to take a screenshot of the entry that is
currently getting scraped.
For debugging purposes it's perfectly fine to set DISABLE_SPLASH = True
as well as DRY_RUN = True
so that you're not
causing additional traffic towards the source of your web-crawl.
Addendum
If you're wondering where the .env
-settings are used, a look into the oeh-search-etl/converter/settings.py
will
answer that question.
The settings.py-file also allows you
to customize the fields to be exported into your .json
-output by modifying the FEED_EXPORT_FIELDS
-list. For further
information, please consult the official Scrapy documentation on Feed exports.