Enabling local data sources - digitalmethodsinitiative/4cat GitHub Wiki
Most of 4CAT's data sources use external APIs. However, the tool is also capable of capturing, storing, and querying locally saved data, for instance with 4chan and 8kun data (see the data source overview for a list of all local data sources). These data are stored in a PostgreSQL database and can be queried with Sphinx search.
This page explains how to enable the collection and querying of local data sources.
Enable local data source collection
The first step is to enable the collection of locally stored data.
Step 1: Add database tables
We first need to generate the database tables for the local data sources you want to add.
This is done by running the SQL query stored in the database.sql file in the data source's datasources/ folder (e.g. datasources/fourchan/database.sql).
How to run this SQL query will depend on your specific installation. Usually, this involves running a command through psql from the data source folder like so:
psql -U username -d mydatabase -a -f database.sql
On manual, local 4CAT installations, you can also use the query tool in software like pgAdmin. If you're using Docker, the following code adds the database tables for 4chan collection to a fourcat database with user fourcat:
docker exec -it 4cat_backend /bin/bash
cd datasources/fourchan/
psql --host=db --port=5432 --user=fourcat --dbname=fourcat -f database.sql
Step 2: Enable the data source
Once the database tables are generated, let's enable the data source through 4CAT's Web interface.
Navigate to Control Panel -> Settings -> Data sources in 4CAT. Then, enable the desired data sources by checking the checkmark.
Step 3: Set individual data source settings
Enabling a local data source generates a specific menu for that data source on the Data sources settings page in the Control Panel (e.g. "4chan search"). Here you might have to make some adjustments. For instance, for imageboard data collection, you have to specify which boards you want to scrape, for instance by adding 4chan/pol/ like so:
You can add more than one board to the list, e.g. ["pol", "v", "fit"]. You can also specify the interval with which boards are scraped, and whether to download images.
Step 4: Restart 4CAT to start collection
Go to Control Panel -> Restart or Upgrade and click the Restart button. If you're using Docker, you can also use the Docker Desktop interface to stop and start the 4cat_backend container.
After 4CAT restarts, you should begin to see log messages showing collected data.
Congrats! You're collecting data in a local PostgreSQL database. The data source will now show up in the Create dataset page.
Enabling text search
However, to execute queries for most local data sources, you will have to run a full-text search engine. To do so, we need to install Sphinx search and index the database.
The instructions will differ based on whether you're using 4CAT through Docker or if you're running it manually.
Installing and running Sphinx via Docker
Step 1: Create a sphinx.conf file
- Run the command
docker exec 4cat_backend python3 helper-scripts/generate_sphinx_config.pyto create a Sphinx configuration file, which contains information on all of the enabled local data sources (per the steps above). - Copy the
sphinx.conffile to the host machine's current working directory, so we can edit the file. You can do so through the command:docker cp 4cat_backend:/usr/src/app/helper-scripts/sphinx.conf ./You will later copy thesphinx.conffile to the a new Sphinx container.
Step 2: Update sphinx.conf file
- Ensure
sql_hostis the 4CAT database container name, e.g.,sql_host = db(older 4CAT versions did not do this automatically). - Change the
listenhosts to0.0.0.0fromlocalhost. This allows Sphinx to receive connections from other containers and, if desired, your host machine.
listen = 0.0.0.0:9213
listen = 0.0.0.0:9306:mysql41
Step 3: Create a sphinxsearch container
This container will index your collected data and allow you to search the data with 4CAT. The Docker image can be found here. To create the container, run the following command:
docker run -it --publish 9306 --name 4cat_sphinx -d macbre/sphinxsearch:3.3.1 /bin/sh
Step 4: Connect the Sphinx container to the 4CAT network
- Run
docker network lsto identify 4CAT network, likely4cat_default - Run
docker network connect 4cat_default 4cat_sphinxassuming4cat_defaultis the name of your 4CAT network and you used the--name 4cat_sphinxoption when creating thesphinxsearchcontainer in the previous step.
Step 5: Update Sphinx host setting in 4CAT
Edit the "Sphinx host" setting in 4CAT via Control Panel -> Settings -> 4CAT Tool Settings
Option 1
- Edit "Sphinx host" to either the name of the
sphinxsearchcontainer (e.g.,4cat_sphinx) or
Option 2
- Run
docker network inspect 4cat_defaultafter adding the sphinx container to the network. Find the new sphinx container in the Container section and copy the IPv4Address. - In the 4CAT Control Panel, go to "4CAT Tool Settings" and change the "Sphinx host" value to the Sphinx IP address you just copied.
Note on older 4CAT versions
Prior to 2023-07, the host for Sphinx was hard-coded to run alongside 4CAT, but it must be updated for a Docker container setup.
This only affects the 4chan data source. Change this line in datasources/fourchan/search_4chan.py to the Sphinx container IP address.
- Change
MySQLDatabasehost (default islocalhost) to Docker IP address found via inspecting 4cat docker networkdocker network inspect 4cat_default. (You can copy the file to your host directory in order to edit viadocker cp 4cat_backend:/usr/src/app/datasources/fourchan/search_4chan.py ./or edit directly in the container if desired.) - After updating, copy to
4cat_backendcontainer (i.e.,docker cp datasources/fourchan/search_4chan.py 4cat_backend:/usr/src/app/datasources/fourchan/)
Step 6: Create indexes and run Sphinx
We finally need to create full-text search indexes for any of the data that you already collected. Generating indexes means Sphinx will create fast lookup tables so words can be searched quickly. After, we run Sphinx through executing ./searchd. Follow the following steps:
# Copy the `sphinx.conf` file we generated above to the sphinx bin folder
docker cp sphinx.conf 4cat_sphinx:/opt/sphinx/sphinx-3.3.1/bin/
# Connect to container
docker exec -it 4cat_sphinx /bin/sh
# Navigate to sphinx-3.3.1/bin/
cd /opt/sphinx/sphinx-3.3.1/bin/
# Create data and data/binlog folders IN the sphinx folder (sphinx-3.3.1/data/)
mkdir ../data
mkdir ../data/binlog
# run indexer
./indexer --all
# start searchd
./searchd
This generates full-text search indexes for all the local data sources you enabled and actives Sphinx. Make sure to the container running and restart ./searchd whenever you restart the container!
To index newly collected posts, you can run docker exec 4cat_sphinx /bin/sh -c "cd /opt/sphinx/sphinx-3.3.1/bin/ && ./indexer --all --rotate" whenever the container is running.
Docker troubleshooting
- You can check what Sphinx is listening to by running the following commend in the sphinx container (
docker exec -it sphinx_container_id /bin/bash)netstat -nlp
Installing and running Sphinx manually
If you're not using Docker, you can also install and run Sphinx manually.
- Download the Sphinx 3.3.1 source code.
- Create a sphinx directory somewhere in the directory of your 4CAT instance, e.g.
4cat/sphinx/. In it, paste all the unzipped contents of thesphinx-3.3.1.zipfile you just downloaded (so that it's filled with the directoriesapi,bin, etc.). In thesphinxdirectory, also create a folder calleddata, and in thisdatadirectory, one calledbinlog. - Add a Sphinx configuration file. You can generate one by running the
generate_sphinx_config.pyscript in the folderhelper-scripts. After runninggenerate_sphinx_config.py, a file calledsphinx.confwill appear in thehelper-scriptsdirectory. Copy-paste this file to thebinfolder in yoursphinxdirectory (in the example above:4cat/sphinx/bin/sphinx.conf). - Generate indexes for the posts that you already collected (if you haven't run any scrape yet, you can do this later). Generating indexes means Sphinx will create fast lookup tables so words can be searched quickly. In your command line interface, navigate to the
bindirectory of your Sphinx installation and run the command./indexer --all(Linux) orindexer.exe --all(Windows). This should generate the indexes.- If you get the error
No such file or directory, will not index., make sure there's adatafolder in thesphinxdirectory.
- If you get the error
- Finally, before executing any search queries, make sure Sphinx is active. In your command line interface, run
./searchd(Linux) orsearchd.exe(Windows; see known issues below if you get an error), once again within Sphinx'sbinfolder. Make sure to leave this process running (you may want to use something liketmux).
See the Sphinx docs for more information.
Sphinx is now ready for search via 4CAT!
You will need to re-run the indexer (docker exec 4cat_sphinx /bin/sh -c "cd /opt/sphinx/sphinx-3.3.1/bin/ && ./indexer --all --rotate" for Docker, ./indexer --all for Linux, and indexer.exe --all for Windows) to update Sphinx's indexes with newly collected data. As your data grows, this can take a lot of time, so we run the indexer nightly via a cronjob script.
Known Issues
- On Windows, you might encounter the error
The code execution cannot proceed because ssleay32.dll was not found(see also this page). This can be solved by downloading Sphinx version 3.1.1. and copy-pasting the following files from the 3.1.1.bindirectory to your 3.3.1bindirectory:- libeay32.dll
- msvcr120.dll
- ssleay32.dll
- On Linux, you might run into permission issues. Make sure to execute the scripts with the right user.