BASC 4archive - antonizoon/antonizoon.github.io GitHub Wiki
The 4archive, one of the few selective public archives of 4chan data, shut down on May 7th, 2015.
Sometimes you might only have access to a SQL database backup. To work with this database, you will have to create a local MySQL/MariaDB server, and import the database backup.
-
Install MySQL/MariaDB. Some Linux distros may have already migrated to MariaDB, which is a drop-in replacement for MySQL.
-
Start the MySQLd Service. On Debian/Ubuntu:
sudo service mysqld start
-
Run the MySQL/MariaDB Secure Installation Script as root, and follow the instructions. Make sure to drop the test database and reload the privilege tables.
sudo mysql_secure_installation
-
Once you're finished, log in as the root MySQL user.
mysql -u root -p
-
Create a user to manage the databases. For example, I used user:
localuser
with password:password
.
MariaDB [(none)]> CREATE USER 'localuser'@'localhost' IDENTIFIED BY 'password';
Query OK, 0 rows affected (0.00 sec)
MariaDB [(none)]> GRANT ALL PRIVILEGES ON *.* TO 'localuser'@'localhost'
-> WITH GRANT OPTION;
Query OK, 0 rows affected (0.00 sec)
- Now create a new database to import the backup into. In this example, we call it
4archive
. Then quit.
MariaDB [(none)]> create database 4archive;
Query OK, 1 row affected (0.00 sec)
MariaDB [(none)]> quit
Now you can import the database. As the localuser
user, we access the localhost
SQL server and insert the 4archive_dump.sql
dump into the 4archive
database.
$ mysql -u localuser -p -h localhost 4archive < 4archive_dump.sql
The first step was to convert the MySQL Dump into a manageable offline database. I decided to choose SQLite, since it is a portable and simple SQL Database format. (Later on, I might look into CouchDB or MongoDB due to the workings of our chan.zip format).
I imported the MySQL dump into a local MariaDB server (via Linux, of course). Then, following this export guide, I used the Ruby gem sequel
to convert to SQLite.
$ gem install sequel mysql sqlite3
$ sequel mysql://<user>:<password>@<host>/<database> -C sqlite://db.sqlite
Example, for our 4archive database:
$ sequel mysql://localuser:password@localhost/4archive -C sqlite://4archive.sqlite
It saves to the file db.sqlite
. Nice and easy.
Well, not yet. MySQL uses Latin1 encoding by default, which has long been superseded by UTF-8 encoding (used in SQLite). This usually isn't an issue for characters within the ASCII set, but accented Latin characters might get corrupted in the process. You should convert the MySQL database to use UTF-8 if possible, or otherwise take this possible bug into account.
This primarily affects the post.original_image_name
entry. Since the post.body
(containing the actual comment text) is stored as a BLOB, it is not affected.
The next step is to obtain all the Imgur/Imageshack images. 4archive was unique in that it hosted no images, instead depending on third-party services to do the hard work.
However, it's a good idea to keep these images archived in the Internet Archive, so we need to make a list of images to scrape. Another good idea is to sort them by chan.zip style folders.
I constructed this SQL JOIN Statement to generate a report of all image urls and accompanying information. Made for SQLite, but should work with MySQL as well.
This statement:
- Conducts an INNER JOIN between the
posts
andthreads
table. - Displays the columns
board
,thread_id
,post_id
/chan_id
,image_url
,image_name
, andoriginal_image_name
. - Only displays posts that contain images.
- Orders by
board
, thenthread_id
, thenpost_id
/chan_id
, so everything is chronologically sequential.
SELECT threads.board,
threads.thread_id,
posts.chan_id AS post_id,
posts.image_url,
posts.chan_image_name AS image_name
FROM posts
INNER JOIN threads ON threads.id = posts.threads_id
WHERE AND posts.image_url NOT NULL
ORDER BY threads.board,
threads.thread_id,
posts.chan_id;
While we could just make a list of all images in an unsorted fashion, a better method is to put the Imgur/Imageshack links into folders ordered by board/thread/image_urls.txt
, as mandated by the chan.arc
standard. That way, the images would be automatically sorted into folders when they are downloaded.
We created a Python script to do the hard work of creating the folders and dumping the image_urls to a text file. (e.g. b/21451323/imageurls.txt
) Takes only a few minutes to generate all folders and text files.
The next step is to actually download the images. Because there is a massive amount of data to scrape,
One major hurdle is that some posts were unable to have their accompanying images archived for whatever reason. Instead of having a NULL value, they have a placeholder "/images/images-404.png"
or /images/thread-404.png
. We use the regex /images/\w+-404
to match any occurrence of this in the database, and ignore it.
# ignore 404 urls
regex404 = "/images/\w+-404"
if (regex404.match(image_url)):
continue
The next step was to export all the threads from this database into chan.zip format, which can then be hosted on Github Pages.
This meant that we finally needed to implement the JSON Templating system, using Jinja2 or such.