Skip to content

ART Unit Testing

Killian Levacher edited this page Oct 29, 2020 · 39 revisions

ART Testing Framework

The Adversarial Robustness Toolbox (ART) is a library which supports multiple frameworks simultaneously. For this reason, tests written for ART must be written keeping in mind that they will be ran across all frameworks supported by ART.

This page will clarify how tests should be written to achieve this end, presenting the conventions used as well as the various test helper tools available in ART to simplify this process.

Art makes heavy use of Pytest functionalities such as fixtures. Any information related to fixtures in general, can be found here.

The followings are good example ART tests that can be used as templates:

1. Running a test with a specific framework

While debugging tests, it can become useful at times to run a given test with a specific framework. To do so, the command line argument mlFramework can be specified along with the relevant framework name.

pytest -q tests/estimators/classification/test_common_deeplearning.py --mlFramework=pytorch

The mlFramework argument can be used with the following frameworks (tensorflow, keras, keras_tf, pytorch, mxnet and scikitlearn). If no framework is provided, ART will run the tests with a default framework of its choice.

2. Writing Framework Agnostic Tests

In order to achieve framework agnosticity, ART provides a few pytest fixtures which hide any framework specific concerns of the test code within the pytest conftest.py files. This makes writing tests for ART much easier and cleaner. A list of all relevant ART fixtures can be found below.

As a general rule, tests should only implement the test logic regardless of the framework being used. Any framework specific code should be hidden and placed within the relevant pytest conftest.py files.

The following example presents a typical ART test.

@pytest.mark.framework_agnostic
def test_myTest(art_warning, get_default_mnist_subset, get_image_classifier_list):
    try:
        (x_train_mnist, y_train_mnist), (x_test_mnist, y_test_mnist) = get_default_mnist_subset
    
        classifier, sess = get_image_classifier_list(one_classifier=True)
    
        # example test code
        labels = np.argmax(y_test_mnist, axis=1)
        accuracy_2 = np.sum(np.argmax(classifier.predict(x_test_mnist), axis=1) == labels) / x_test_mnist.shape[0]
        assert accuracy_2 == 0.99
    except ARTTestException as e:
        art_warning(e)
  • get_default_mnist_subset: The test avails of the get_default_mnist_subset fixture which takes care of retrieving the Mnist dataset, shaped correctly for whatever framework this test will be run with. The Pytorch and Tensorflow frameworks for example expect different image channel orderings. This fixtures takes care of providing the test with the channel ordering corresponding to the framework being used.

  • get_image_classifier_list: The get_image_classifier_list is used quite extensibly within the tests and creates an image classifier using the framework this test is being ran with. If a framework specific implementation for an ART component does not exist yet, the test will fail gracefully and simply output a warning to notify that the test could not be run with this framework due to a missing component.

  • @pytest.mark.framework_agnostic: The @pytest.mark.framework_agnostic pytest marker should be used in most cases. It indicates that, although the test can be run successfully in any framework, it does not dependent upon any framework specific implementations. Hence there is no need to run the same test across all frameworks, only one random framework will suffice. ART will thus run this test with a random framework. While most tests will fit this category, a few exceptions will eventually occur. Tests located in test_common_deeplearning.py for example must always be run with all the frameworks since they check whether each framework specific implementations of ART classifiers produce the exact same outputs.

  • try/except and art_warning: In some cases, framework specific implementations of classifiers or other components needed will not have been implemented yet for a given framework. In order to gracefully move on to the next test, ART tests should be contained within a try/except clause with a art_warning should be created. This will produces a report after the testing completes listing the components implementations currently missing for a given framework.

3. ART Test Conventions

In addition to using fixtures, the following conventions are used across ART tests.

3.1 Tests Framework Independence

  1. Test files names and test names themselves should not contain any reference to specific frameworks. For instance, tests named test_feature_pytorch should be renamed to test_feature
  2. As a rule of thumb, any framework specific test code (eg: if framework == "tensorflow": do this) should be placed in a relevant fixture in the appropriate conftest.py file (see below).
  3. In order to keep each test's framework specific limitations (if any) please do not place pytest markers on a test class. Markers should be placed instead on each method. In other words the following pattern
@pytest.mark.skipMlFramework("tensorflow","scikitlearn", etc...)
class TestMyNewFeature:
   def test_feature1(self, param1):
     pass
   def test_feature2(self, param1):
     pass

should be replaced by the following

@pytest.mark.skipMlFramework("tensorflow","scikitlearn", etc...)
def test_feature1(self, param1):
   pass

@pytest.mark.skipMlFramework("tensorflow","scikitlearn", etc...)
def test_feature2(self, param1):
   pass
  1. In order to increase test code readability, please refrain from hardcoding np.asarray()'s within the test code. Instead please use the store_expected_values and expected_values fixtures for that purpose (see section below)

3.2 New Fixtures

In order to keep ART tests maintainable over time, it is essential we use standardised fixtures across all tests. For this reason, new fixtures should be created only as a last resort.

If you feel a new fixture should be created, please follow these guidelines:

  1. Before creating a pytest fixture, please ensure a similar one hasn't yet been created for another test (these can be found in all the conftest.py files within the project).
  2. If a similar fixture already exists, please refrain from creating a similar fixture. Instead either a) try to alter your test to use that existing fixture, or b) improve the existing fixture to take into account your new use case.
  3. If you feel there is really a need to create a new fixture, this fixture should be placed in a conftest.py file located in the directory where the test file using it is located.
  4. If creating a new fixture please reach out to the project owners so that we can add it to this documentation.

3.3 Reducing test code duplication

  1. An ART-wide random generator master seed is already set within the project root conftest.py file. Hence there is no need to add such master_seed(1234) seeds within test code.
  2. If the same test needs to be ran for multiple combinations of parameters, please do not create loops for each parameter combination. Instead please use the standard pytest @pytest.mark.parametrize parameterization(eg: test_deeplearning_common.py::test_loss_functions() )
  3. If test code is repeated across tests, please instead encapsulate this repeated code in a method named back_end_<testing_this> and call this method in each test (eg: test_membership_inference.py::backend_check_accuracy())

4. Common useful ART fixtures and markers to be aware of

Here is a list of most common ART fixtures available when writing tests. They can be found in any of the pytest conftest.py files within the project

4.1 Dataset fixtures:

Fixture Name Purpose
get_mnist_dataset provides the mnist dataset with the image channel ordered for the relevant framework being used
get_mnist_dataset provides the mnist dataset with the image channel ordered for the relevant framework being used
get_iris_dataset provides the iris dataset with the image channel ordered for the relevant framework being used
get_default_mnist_subset provides a smaller mnist dataset
image_data_generator provides the mnist dataset as a data generator
mnist_shape provides the shape of the mnist dataset based on where the channel is positioned
image_iterator returns an image iterator specific to framework the test is being run with
image_data_generator returns an image data generator specific to framework the test is being run with
create_test_image creates a default test image
audio_data Create audio fixtures of shape (nb_samples=3,) with elements of variable length
audio_batch_padded Create audio fixtures of shape (batch_size=2,) with elements of variable length.

4.2 Component fixtures:

Fixture Name Purpose
image_dl_estimator provides an image deep learning estimator corresponding to the framework the test is being run with
tabular_dl_estimator provides a tabular deep learning estimator corresponding to the framework the test is being run with
image_dl_estimator_defended provides a defended version of the estimator returned by image_dl_estimator
image_dl_estimator_for_attack provides an image deep learning estimator for the framework the test is being run with usable to perform a specific attack
decision_tree_estimator return a decision tree estimator specific to framework the test is being run with

4.3 Test Util fixtures:

Fixture Name Purpose
framework returns the name framework type this test is current running with
create_test_dir creates a temporary test directory
store_expected_values stores any large value needed for a test in a json file. The expected_values can be use thereafter to retrieves these values
expected_values Retrieves values expected for a given test, previously stored using the store_expected_values. This fixture identifies whether a value needed for this test should take into account what framework the tests is being run with or not.

5. ART markers:

ART provides the following markers:

Fixture Name Purpose
@pytest.mark.framework_agnostic indicates that, although the test can be run successfully in any framework, it does not dependent upon any framework specific implementations. Hence there is no need to run the same test across all frameworks, only one random framework will suffice.
@pytest.mark.skipMlFramework("tensorflow","scikitlearn", etc...) indicates that a test currently fails when ran using a specific to be skipped for specific mlFramework values. Valid values are: "tensorflow1", "tensorflow2", "keras", "kerastf", "pytorch", "mxnet", "scikitlearn", as well as "tensorflow" (shorthand for "tensorflow1", "tensorflow2") "dl_frameworks" (shorthand for "tensorflow", "keras", "kerastf", "pytorch", "mxnet"), "non_dl_frameworks" (shorthand for "scikitlearn")
@pytest.mark.skip_travis() to be used in exceptional circumstances. This indicates that this test should be ignored by Travis
@pytest.mark.skipModule("apex.amp", etc...) Checks whether a python module exists within the virtual environment and if not skips the test (currently supported values are "apex.amp", "deepspeech_pytorch")
DEPRECATED @pytest.mark.only_with_platform("keras") This marker is deprecated and should only be used for legacy tests that are not yet framework agnostic. Instead use @pytest.mark.skipMlFramework

6. Testing code with Expected Values

At times, tests require to assert that a given component produces an expected value. Such expected values can at times be numerous and consist of very large arrays which make the test code unnecessary convoluted and much harder to read. ART provides two helper fixtures which cache any expected values required and thus makes your test code much more readable and small.

While writing your test, the first version of your test using hardcoded expected values can use the store_expected_values fixture in order to cache such values as follows:

@pytest.mark.framework_agnostic
def test_myTest(get_default_mnist_subset, get_image_classifier_list, store_expected_values):
    try:
       (x_train_mnist, y_train_mnist), (x_test_mnist, y_test_mnist) = get_default_mnist_subset
    
       classifier, sess = get_image_classifier_list(one_classifier=True)
        
       expected_value1 = np.asarray(
            [
                0.0000000e00,
                0.0000000e00,
                0.0000000e00,
                2.3582461e-03,
                4.8802234e-04,
                1.6699843e-03,
                -6.4777887e-05,
                -1.4215634e-03,
                -1.3359448e-04,
                2.0448549e-03,
                2.8171093e-04,
                1.9665064e-04,
                1.5335126e-03,
                1.7000455e-03,
                -2.0136381e-04,
                6.4588618e-04,
                2.0524357e-03,
                2.1990810e-03,
                8.3692279e-04,
                0.0000000e00,
                0.0000000e00,
                0.0000000e00,
                0.0000000e00,
                0.0000000e00,
                0.0000000e00,
                0.0000000e00,
                0.0000000e00,
                0.0000000e00,
            ]
        )
    
        # ... more expected value arrays 
    
        # example test code
        labels = np.argmax(y_test_mnist, axis=1)
        accuracy_2 = np.sum(np.argmax(classifier.predict(x_test_mnist), axis=1) == labels) / x_test_mnist.shape[0]
        assert accuracy_2 == expected_value1
    
        store_expected_values(expected_value1, expected_value2, ...)
    except NotImplementedError as e:
        warnings.warn(UserWarning(e))

Once the expected values have been cached, the final version of the test can be increased in readability and simplicity by using the expected_values fixture as follows:

@pytest.mark.framework_agnostic
def test_myTest(get_default_mnist_subset, get_image_classifier_list, expected_values):
    try:
        (x_train_mnist, y_train_mnist), (x_test_mnist, y_test_mnist) = get_default_mnist_subset
    
        # this test is being run with
        classifier, sess = get_image_classifier_list(one_classifier=True)
    
        #retrieve the cached expected values
        (expected_value1, expected_value2, ...) = expected_values
    
        # example test code
        labels = np.argmax(y_test_mnist, axis=1)
        accuracy_2 = np.sum(np.argmax(classifier.predict(x_test_mnist), axis=1) == labels) / x_test_mnist.shape[0]
        assert accuracy_2 == expected_value1
    except NotImplementedError as e:
        warnings.warn(UserWarning(e))