Maven Indexing - Ucombinator/jade GitHub Wiki

To obtain a representative test base for Jade, we decided to index the Maven Central repository.

The main repository is hosted by Sonatype. It is difficult to access programmatically.

Starting in 2015, Google began hosting a mirror of the repository on Google Cloud Storage. The root of this repository is located at https://storage.googleapis.com/maven-central/.

Although the robots.txt file says all user-agents are prohibited, we emailed Les Vogel through the [email protected] address and he said we were fine to do whatever we wanted (within reason).

Accessing files on Google Cloud Storage is relatively straightforward. A (very simple) example is given here under the "Cloud Storage" heading, but a more thorough walkthrough will be given below.

(All example code given is in Python 3.7.0.)

Accessing the repository

To access files on Google Cloud Storage (GCS), it appears necessary to have an account with Google Cloud Platform (GCP). Accounts can be created for free. Once an account is created, follow these instructions (under the heading "Obtaining and providing service account credentials manually" in the "GCP CONSOLE" box) to obtain an authentication .json file on your local machine. Almost nothing matters about the configuration except that the file correctly corresponds to your account.

Once you have your file (which I will refer to as auth.json) on your local machine, install the GCS Python library via PIP:

$ pip install google-cloud-storage

(Note that if you use both Python 2 and 3, you may need to specify pip3 or else a full path to the appropriate pip executable for your Python interpreter of choice.)

To interact with the repository, we need to obtain a Bucket:

from google.cloud import storage

MAVEN_BUCKET = 'maven-central'
AUTH_FILE = 'auth.json'

client = storage.Client.from_service_account_json(AUTH_FILE)
bucket = client.get_bucket(MAVEN_BUCKET)

"Bucket" is the GCP term for what might otherwise be called a "repository". It is essentially just a collection of files (called "blobs" in the GCP lexicon) and metadata about those files. The maven-central bucket contains all of the files of the Maven repository and sufficient metadata to process those files (such as by recreating a local Maven clone, or determining which files are largest, etc.).

Building a local index of files in the repository

We now have access to the Bucket in the form of a Bucket object in the interpreter. There are many methods in the Bucket, but we only care about bucket.list_blobs(), which will provide an iterator over all the blobs (object) in the bucket (repository). (This method accepts an optional parameter, max_results, which denotes the maximum number of blobs to iterate through.)

A tab-separated index file (index.tsv) can be generated:

i = 0
with open('index.tsv', 'w') as f:
    for blob in bucket.list_blobs():
        f.write(f"{i}\t{blob.name}\t{blob.size}\n")

The file will have three columns: the number of the blob in the index, the name of the blob (which is the full file name in the repository), and the size of that blob in bytes. For example, here are the first ten lines of the index.tsv file I generated:

1	README.md	1238
2	index.html	3155
3	repos/central/data/./94a8262a403880.properties	301
4	repos/central/data/./9e9bbc30f020cf.properties	310
5	repos/central/data/./9e9bbc30f020cf.properties.md5	32
6	repos/central/data/./9e9bbc30f020cf.properties.sha1	40
7	repos/central/data/./archetype-catalog.xml	6552513
8	repos/central/data/./archetype-catalog.xml.md5	32
9	repos/central/data/./archetype-catalog.xml.sha1	40
10	repos/central/data/./fb69c44c24b38.properties	307

Note that building the complete index took just shy of 9 hours, and there does not appear to be a faster way to perform this operation.

Downloading individual blobs

To download a blob blob-name to a file file-name, simply do:

blob = bucket.get_blob('blob-name')
blob.download_to_filename('file-name')

Processing the index

Now we have an index index.tsv that tells us all of the files in Maven as well as their sizes. As of this writing, a little processing provided the following statistics from the index:

Index file is 8.3 GiB
~71M blobs (71,364,531)
~7.8M .jar files (7,752,139)
~270k artifacts (269,285)
Total size of all files in repo: ~9TiB (10,103,426,642,816 bytes)
Total size of just .jar files: ~4TiB (4,586,501,379,706 bytes)

Hash and signature files

Further processing the list of files will reveal that the predominant file types by extension are:

.md5 (18,072,497)
.sha1 (18,050,130)
.asc (10,896,218)
.jar (7,752,139)
.json (6,307,140)

.md5 and .sha1 files contain only hashes used to verify the integrity of other files. That is, a file foo.bar may have a foo.bar.md5 or foo.bar.sha1 (or both), in which case foo.bar.md5 and/or foo.bar.sha1 contain hashes of the file foo.bar.

.asc files contain GPG signatures for a similar purpose. So a foo.bar.asc file contains the GPG signature of foo.bar.

It may be worth noting that most .asc files seem to also have corresponding .md5 and .sha1 files, such that it is common to see all of the following:

foo.bar
foo.bar.asc
foo.bar.asc.md5
foo.bar.asc.sha1
foo.bar.md5
foo.bar.sha1

We wanted to be sure of this assertion, though. We needed to verified that every foo.md5, foo.sha1, or foo.asc corresponds to an existing foo. To that end, we employed the use of some one-liners for the shell.

Verifying whether every hash file corresponds to a base file

First, we produced a file containing just the filenames for every file in the index. We did this so we could later use the comm utility (which does a fast byte-wise line-by-line comparison of two files). This was done by:

$ perl -ane 'print "$F[1]\n"' < index.tsv > filenames.txt

This puts the filenames in the file filenames.txt.

Then we produced a file containing the names of files which we expect to exist based on the presence of their hash files (either .md5 or .sha1):

$ perl -ane 'if ($F[1] =~ /\.(md5|sha1)/) {print "$`\n"}' < index.tsv > hash-basenames.txt

This would take the file names from the previous section and produce:

foo.bar
foo.bar.asc
foo.bar.asc
foo.bar.asc
foo.bar
foo.bar

We can see that there are some duplicates. To remove duplicates and also sort the output, we wrote small programs (TODO: link to those):

$ ./uniqsemisort < hash-basenames.txt > sorted-hash-basenames.txt

For example, the previous file names would be reduced to:

foo.bar
foo.bar.asc

Now we can compare the expected filenames (in sorted-hash-basenames.txt) to the full list of existing filenames (filenames.txt):

$ comm -1 -3 filenames.txt sorted-hash-basenames.txt > hash-comparison.txt

The resulting output file, hash-comparison.txt, contains a list of files which we expected to exist (based on the presence of either a .md5 or .sha1 file) but which did not exist.

Analyzing the missing hash files

We came up with some 2,985 missing files.

There are 54 central-metadata.json files. These are not listed when browsing Maven Central's folders through the browser, but they are accessible.

There are 495 maven-metadata.xml files. All of these are nested inside of dot-folders (either .DAV or .svn). There is one extra #maven-metadata.xml.

There are 154 *.gz files. 150 of these are in the top-level .index directory and appear to be concerned with the index itself. 4 exist in other places.

Verifying whether every GPG signature file corresponds to a base file

A similar process can be used for verifying the .asc (GPG signature) files.

Assuming we already have filenames.txt from previously:

$ perl -ane 'if ($F[1] =~ /\.asc/) {print "$`\n"}' < index.tsv > asc-basenames.txt

Then we uniqsemisort it:

$ ./uniqsemisort < asc-basenames.txt > sorted-asc-basenames.txt

And compare to the original filenames:

$ comm -1 -3 filenames.txt sorted-asc-basenames.txt > asc-comparison.txt

The resulting output file, asc-comparison.txt, contains a list of files which we expected to exist (based on the presence of a .asc file) but which did not exist.

Analyzing the missing GPG signature files

We found 4,471 files in the sorted-asc-basenames.txt.