Archiver Overview - sociam/xray-archiver GitHub Wiki

Archiver Overview

Introduction

The Archiver is responsible for the retrieval of app meta-data, the downloading of app APK's, the analysis of app APK's, as well as the storage of all data into the PostgreSQL database. Aside from the Analyser which is implemented in PostgreSQL, everything is implemented using Node JS.

The process of archiving applications begins with the collation of search terms, which are then used to identify apps to download, analyse, and retrieve meta-data for.

Explorer

The search term Explorer essentially builds a queue of terms which can be used to request a list of apps from the google play store. It operates by requesting search term auto-completions from the Google Play Store, and add's each response to the X-Ray database.

Using a distinct set of strings from the cross product of the alphabet (and an empty char), each string is entered into the Google Play Search. Upon entering, a list of 5 auto-complete responses is returned.

E.G.

Input : "fa"

Output : [
    "Facebook",
    "Facebook Messenger",
    "Fake GPS",
    "Face Time",
    "Fantasy Football"
]

The auto-complete suggestions made by the Google Play Store are based upon the common searches made devices in the within the same region. Meaning that requesting search terms in the US will result in different search terms being presented to those presented if requesting search terms in the UK.

The Explorer is implemented in Node JS.

Retriever

The retriever utilises the information on google play store to build a repository of mobile app meta-data, including information about the number of downloads, as well as the rating that an app has on the Google Play Store. For simplicity, the retriever makes use of the google-play-scraper, a Node JS package that makes available a set of scraping methods specifically tailored to an app's Google Play Store page.

The retriever relies on a queue of search terms in order to know what apps the retriever should scape information for. Stored in the search_terms table, the retriever doesn't care where they come from, so instead of using the explorer to initialise the table with data, the table could be populated with your own set of search terms.

The Retriever will begin with the search term that was used the longest time ago, and will then request App Package Names for the 120 apps that the Google Play Store deems most relevant to the provided search term. The retriever will then request additional information one at a time for each of the resulting package names, inserting all results into the xray DB as it goes.