Experiments - Pyosch/powertac-server GitHub Wiki

What is an experiment?

An experiment in this context is an empirical study that tests some hypothesis. For example, what is the impact of solar energy supply on customer prices? Or on the higher complexity end, discover dominance relationships among broker strategies -- see for example Patrick Jordan “Empirical Game-Theoretic Analysis of the TAC Supply Chain Game.” In Proceedings of AAMAS ’07.

Experiment design

An experiment typically involves a hypothesis, a baseline, a treatment, and an analysis that determines whether the treatment fails the null hypothesis with at least some level of confidence, typically 95%. A good reference is Paul Cohen's book Empirical Methods for Artificial Intelligence (MIT Press 1995). Eric Sodomka's paper on empirical studies gives good background on controlling exogenous variability in an experimental setup like Power TAC.

A visualization of an experiment is shown below. Each circle in the grid is a simulation session. Sessions along the x-dimension have different initial conditions (random seeds, weather); sessions along the y-dimension have different treatments related to the hypothesis.

experiment grid

The baseline is just a well-defined setup of server and broker configuration. A treatment is some variation on the baseline. It could be removing or adding one or more brokers, swapping out one or more brokers, or changing some aspect of broker or server configuration. Note that changing configuration in a broker is in some sense equivalent to swapping out a broker. For larger changes, it's often helpful to define a "trajectory" between the baseline and the treatment to see how the system behaves along that path. So if the treatment is 10,000 electric vehicles, then the trajectory might run from 100 to 10,000 in several steps, perhaps in search of some sort of "knee" in a significant performance measure. For a given set of initial conditions, the baseline and each step in the trajectory is an experiment instance.

Each experiment instance in a given experiment consists of two or more simulation sessions (y-dimension), one for the baseline and one for each treatment (steps in a trajectory). Larger numbers of instances (x-dimension) generally provide higher confidence for a given effect size. As outlined in the Sodomka paper, controlling exogenous sources of variability can reduce the number of simulation sessions needed, often by a large factor. In Power TAC, there are several sources of variability, including weather data, the model state at the end of the boot session, the random seeds used for the various models, and the behavior of brokers. Of these, all but the broker behavior can be controlled. This is done by selecting a number of fixed sets of weather sequences, boot records, and seeds. If each experiment instance consists of 20 sim sessions, then you need 20 sets of initial conditions, each containing a weather sequence (or a location and date range), a boot record, and a full set of sim seeds. Note that weather data and random number seeds can be extracted from existing state logs. The weather sequences and seed sets can be re-used across experiment sets, but boot records can only be re-used if the treatments do not affect the behavior of the simulation during the boot period. Game length is typically a random value set by one of the random values, but it may be useful to fix the game length (through server configuration) across treatments to reduce unnecessary variability.

Note that some treatments affect the content of the boot record. For example, varying the population or behavior of some customer or customer type will almost certainly change the customer-profile information in the boot record. For this type of treatment, it is necessary to generate a new boot record for each sim session.

Running an experiment

--- info on using the experiment scheduler ---

Analyzing results

In general, it does not make sense to run an experiment until you know exactly how you will measure the effect you are studying. Usually that will involve writing data extractors and analyzers. Many examples are available in the powertac-tools repo.

It often happens that the effect is found, but the necessary confidence level is not reached. In many cases, this can be resolved by running another experiment set with the same baseline and treatment, but different weather-boot-seed sets.

Adding treatments and instances

Frequently in this type of work, you will set up an experiment with a few treatments along some interesting dimension, and some number of instances (commonly 20-30, rarely over 40). Analysis of results may suggest that you need to add one or more treatments to expand the range of your intervention, or to more finely subdivide an interesting portion of the range. Also, you may find after running an experiment with 20 instances (remember, 20 instances, one baseline, and 3 treatments adds up to 80 games that are typically take two hours or more) that you can't quite get to 95% confidence. In these cases, you would like to be able bring up the experiment again, add instances or treatments, and have the tool run the additional games and add them to the original experiment record.

Example

Here is an early example of an xml file that could specify an experiment comparing two different brokers in a custom configuration. Note that the actual experiment scheduler will almost certainly provide a user interface for this information, backed up in a relational database.

<powertac-experiment>
  <experiment-set name="xxx">
    <server-version>1.5.0-SNAPSHOT</server-version>

    <!-- Brokers used in each instance -->
    <broker name="AgentUDE17" keepLogs="false"/>
    <broker name="COLDPower17" keepLogs="false"/>

    <!-- config common to all instances -->
    <config-var name="accounting.accountingService.bankInterest"
                value=".06"/>
    <config-var name="distributionutility.distributionUtilityService.feePerPoint"
                value="-24.0"/>
    <config-var name="server-main.competitionControlService.gameLength"
                value="1700"/>

    <instances>
      <!-- These vars are used in each instance. The number of entries here
           determines the number of sims to run for each instance. If
           you wanted to replicate the environment from an archived
           game, you could just specify the same state log for each
           entry. You need to specify at least seeds and weather. If
           boot records are not given here, then a new one will be
           generated for each sim. -->
      <var seeds="filenameS00" boot="filenameB00" weather="filenameW00"/>
      <var seeds="filenameS01" boot="filenameB01" weather="filenameW01"/>
      <var seeds="filenameS02" boot="filenameB02" weather="filenameW02"/>
      <var seeds="filenameS03" boot="filenameB03" weather="filenameW03"/>
      <var seeds="filenameS04" boot="filenameB04" weather="filenameW04"/>
      <!-- typically 10 or more of these -->
    </instances>

    <reuse-boot-record>true</reuse-boot-record>

    <!-- Finally, here are the instances -->
    <treatments>
      <!-- Here we define the instances. Each config will be run n
           times, where n is the number of per-instance
           vars. Each instance is defined by a single broker
           or config var value. -->
      <broker name="ewiTest022" keepLogs="true"/>
      <broker name="ewiTest023" keepLogs="true"/>
    </treatments>

  </experiment-set>
</powertac-experiment>