Repository - GiselleSerate/pandorica GitHub Wiki

Branches

master and develop should be self-explanatory branches. poc-mod is built for a current POC which needs a list of current bad domains (use is limited to to_file_parser.py).

Files in pandorica

Dockerfile: Defines the container that Pandorica will run in for testing. Currently doesn't work for the general case, but the only thing you'd need to swap out is the CMD for this file to be useful for non-test applications as well.

Jenkinsfile: Defines the autorun Jenkins pipeline.

requirements.txt: Defines things that need to be pip installed. (Don't necessarily just pip freeze into here--not everything needs to be in here, some dependencies will just be resolved automatically by pip.)

Files in install

setup.sh: From SafeNetworking; installs mappings into Elasticsearch. Used only during first-time setup of your database.

Files in src

domain_docs.py Don't run directly--defines status codes and documents to write to the database.

domain_processor.py: Tags untagged domains using Autofocus. Safe to interrupt and resume at will.

interval_calculator.py: Calculates residence/reinsert intervals. Safe to interrupt and resume at will.

notes_parser.py: Downloads only the latest notes, then parses all downloaded notes. It is mostly safe to interrupt this script, as long as the actual download has happened.

pandorica.py: Runs the entire pipeline. In general, avoid running this script manually and use Jenkins. This ensures that logs persist in Jenkins and all the HTML files end up downloaded to a folder with permissions such that the Jenkins agent can access them. (Raw Python does have the miniscule advantage over Jenkins that it only checks the Elasticsearch connection once instead of three times, but on the order of time magnitudes we're dealing with, this is entirely negligible, and it's entirely not worth the headache of trying to make sure Jenkins remains happy with the state of the database and the downloaded files.)

scraper.py: Don't run directly--contains the engtools scraper class which you can include in other things.

setup.py: Don't run directly--allows the src directory to be pip installed.

to_file_parser.py: Built for the POC team. Downloads the latest version notes off the firewall and parses out a subset of the added domains to a text file. None of this interacts with Elasticsearch.

Files in lib

.defaultrc: Gets loaded before the .panrc (if a .panrc exists). It will override environment variables set previously, so if it's necessary, you can delete/comment out lines from this in your local install.

dns.py: From SafeNetworking; a collection of document definitions.

dnsutils.py: From SafeNetworking; a collection of functions to ask AutoFocus about domains and tags.

setuputils.py: Some functions to connect to and wait for Elasticsearch to be up. Presumably this could be in an __init__.py or something if we refactor Pandorica to be a module, but currently this just gets called manually if need be.

sfnutils.py: From SafeNetworking; a collection of functions to get information from Elasticsearch caches.

Files in test

.testrc: Gets loaded to configure the test. In Jenkins, no .panrc is loaded; we need to establish a connection to Elasticsearch regardless.

Jenkinsfile: Configures the Jenkins test.

Updates_3026-3536.html: HTML document to parse during the test.

docker-compose.yaml: Define containers to bring up during the test.

test_parser.py: Runs full test.

testcases.py: Defines parameters of the test we're running in a class.