Sitemap Generation - internetarchive/openlibrary GitHub Wiki
See original documentation at: http://gio.blog.archive.org/2015/02/02/ol-sitemap-generation/ http://gio.blog.archive.org/2015/02/04/ol-how-to-generate-the-sitemaps/
Sitemaps are generated on ol-home
using the latest data dump ol_dump.txt.gz
as the source file. See Generating Data Dumps for more on generating ol_dump
The last dump is available at: http://openlibrary.org/data/ol_dump_works_latest.txt.gz for more details, you can see https://openlibrary.org/developers/dumps.
To generate the Sitemap, execute this code on ol-home:
ssh ol-home
cd /1/var/tmp # this way you can rsync it from rsync://ol-home/var_1/tmp/
python /1/var/lib/openlibrary/deploy/openlibrary/openlibrary/data/sitemap.py ol_dump_works_latest.txt.gz
# or is this the right script? /opt/openlibrary/openlibrary/scripts/2009/01/sitemaps/sitemap.py
After the sitemap is generated you need to place it in /1/var/lib/openlibrary/sitemaps as defined in /olsystem/etc/nginx/sites-available/openlibrary.conf on ol-www1.
location ~ ^/static/(docs|tour|sitemaps|jsondumps|images/shelfview|sampledump.txt.gz)(/.*)?$ {
root /1/var/lib/openlibrary/sitemaps;
autoindex on;
rewrite ^/static/(.*)$ /$1 break;
}
You can do this using rsync:
sudo rsync -av rsync://ol-home/var_1/tmp/sitemaps/sitemaps /1/var/lib/openlibrary/sitemaps/
curl -I https://openlibrary.org/static/sitemaps/siteindex.xml.gz
HTTP/1.1 200 OK
Server: nginx/1.1.19
Date: Wed, 04 Feb 2015 16:48:50 GMT
Content-Type: text/plain
Content-Length: 14689
Last-Modified: Wed, 04 Feb 2015 12:48:06 GMT
Connection: keep-alive
Accept-Ranges: bytes
#!/bin/bash
SITEINDEX="/1/var/lib/openlibrary/sitemaps/sitemaps/siteindex.xml.gz"
SERVERS="ol-covers0 ol-www0"
for SERVER in $SERVERS; do
LAST_UPDATED=$(ssh $SERVER ls -l --time-style=long-iso $SITEINDEX | cut -d' ' -f6)
echo "Sitemaps on $SERVER were last updated on ${LAST_UPDATED}."
done
echo "Ensure that the file dates on servers ($SERVERS) are the first day of the current month."
Sitemaps on ol-covers0 were last updated on 2023-07-01.
Sitemaps on ol-www0 were last updated on 2023-07-01.
Ensure that the file dates on servers (ol-covers0 ol-www0) are the first day of the current month.