Purpose Extraction - UCI-Networking-Group/OVRseen GitHub Wiki
This page explains the workflow of OVRseen that leverages Polisis to extract data collection purposes from privacy policies.
The dependencies for this part has been covered in the dependencies for OVRseen's network-to-policy consistency analysis. Please also run the following command to activate a Python virtual environment (with the right dependencies) before using OVRseen.
OVRseen/virtualenv $ ./python3_venv.sh
OVRseen/virtualenv $ source python3_venv/bin/activate
Before we use OVRseen to perform privacy policy analysis using Polisis, we have to first set up the following.
-
Polisis API key: Polisis is released as an online service. The authors expose HTTP API for batch processing. To access the HTTP API, we need to obtain an API key from the authors. Please check their website for contact information. The following steps we provide assume a valid API key.
-
HTTP server: Polisis takes the URL to a privacy policy webpage as an input. It then crawls the webpage, runs NLP models on their server, and performs the analysis. We observed that not all websites can be successfully crawled by Polisis, and some privacy policies we collected were not in HTML format. In order to make Polisis work flawlessly, we recommend that one host the privacy policy HTML files on their own webserver and use this as an input to Polisis. One can purchase a cheap VPS (e.g. Amazon EC2/Lightsail) for this purpose.
There are different ways to host HTML files on the Internet. Here, we assume two things: (1) NGINX http server software installed on a Linux server, and (2) a publicly accessible IP or domain (e.g., 203.0.113.113
).
Please copy the privacy policy HTML files from a local machine to the (remote) Linux server. These HTML files were obtained in step 4) of the Setup in OVRseen's network-to-policy consistency analysis.
$ ssh [email protected] mkdir /srv/ovrseen
$ scp -r ext/html_policies/ [email protected]:/srv/ovrseen/privacy-policies/
This is a sample of the NGINX configuration to serve the directory over the Internet.
server {
listen 80 default_server;
listen [::]:80 default_server;
location /ovrseen/ {
alias /srv/ovrseen/;
autoindex on;
}
}
Finally, after being set up properly, we should be able to access privacy policies through http://203.0.113.113/ovrseen/privacy-policies/
.
Please note that, unfortunately, Polisis authors mentioned that they had to discontinue their privacy policy analysis online service as of September 2021 due to some technical issue.
To run this analysis, we re-use ext/
we created when performing OVRseen's network-to-policy consistency part after running all the steps.
1) Please run the following command to make a copy of ext/
for purpose extraction.
OVRseen/privacy_policy/purpose_extraction $ cp -Tr ../network-to-policy_consistency/ext ./ext
2) Please generate a list of privacy policy URLs as inputs to Polisis API. In the following command, we also assume 203.0.113.113
as the used public IP address.
OVRseen/privacy_policy/purpose_extraction $ ls ext/html_policies/ | awk 'BEGIN{ print "App.Title,Privacy.Policy" }{ gsub(/\.html$/, ""); printf("%s,http://203.0.113.113/policies/%s.html\n", $1, $1); }' > url_list.csv
3) Please run the analyze_polisis.py
script to interact with Polisis API and save the results into ext/polisis_output/
. Please replace <API_KEY>
in the command line with a valid Polisis API key.
OVRseen/privacy_policy/purpose_extraction $ python3 analyze_polisis.py ext/polisis_output/ url_list.csv <API_KEY>
4) To extract data collection purposes, we need to process the output from Polisis in step 3) above. The output (i.e., polisis_output.zip
) is also available in intermediate_outputs
folder of our datasets. Our extract_datasets.sh
should have found and copied the right files into the right locations in OVRseen's directory structure. If this script has not been run, please move polisis_output.zip
into ext/
and run the following command.
OVRseen/privacy_policy/purpose_extraction/ext $ unzip polisis_output.zip
5) If step 1) above has not been done, please run the following command to make a copy of ext/
for purpose extraction.
OVRseen/privacy_policy/purpose_extraction $ cp -r ../network-to-policy_consistency/ext/* ./ext
6) Then, we run the following commands to map the data flows analyzed by PoliCheck to the text segments annotated with purposes by Polisis (this may take a while). This process is explained in detail in Section 4.2 in our paper.
OVRseen/privacy_policy/purpose_extraction $ python3 process-polisis-analysis-json.py ext
OVRseen/privacy_policy/purpose_extraction $ python3 extract-polisis-purpose.py ext
Finally, we will obtain ext/policheck_results_w_purposes_expanded.csv
, which can also be found in intermediate_outputs
in our datasets. The results we reported in Section 4.2 and Figure 7 were obtained from this output CSV file. For convenience, our script also prints out the following statistics, which correspond to the statistics in Section 4.2 and Figure 7.
all flows: 1135
consistent flows: 776
success and found purpose: 224 (expanded to 370 tuples)
success but no purpose: 69
failed: 66
merger acquisition: 64
service operation and security: 14
legal requirement: 9
marketing: 27
analytics research: 70
advertising: 119
additional service feature: 38
personalization customization: 12
basic service feature: 17