Run Commands - PathmindAI/nativerl GitHub Wiki

Commands

To initiate training, you must set the following environment variables before commencing training. This step must be repeated everything you train a brand new RL policy.

Set Java Paths

export JAVA_HOME=`pwd`/jdk8u222-b10
export JDK_HOME=$JAVA_HOME
export JRE_HOME=$JAVA_HOME/jre
export PATH=$JAVA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$JAVA_HOME/jre/lib/amd64/server:$JAVA_HOME/jre/lib/amd64/:$LD_LIBRARY_PATH

Initialize Conda Environment

source conda/bin/activate

Training Configuration

export REWARD_TERMS_SNIPPET='rewardTermsRaw[0] = after.goalReached * 0.1;' // Set your reward function here
export MAX_TIME_IN_SEC='43200' // Max training time (default is 43200 or 12 hours)
export NUM_SAMPLES='4' // Parallel trials. We recommend 4.
export MAX_ITERATIONS='500' // Max iterations (500 is usually sufficient)
export CHECKPOINT_FREQUENCY='25'
export RESUME=${RESUME:='false'}
export USER_LOG='true'
export SCHEDULER='PBT' // https://www.deepmind.com/blog/population-based-training-of-neural-networks
export EPISODE_REWARD_RANGE='0.01'

Compute Configuration

export NUM_WORKERS='1' // If you have additional CPU cores, we recommend setting this to 3 or 4.
export NUM_CPUS='1' // Doesn't do much so keep this at 1.

Simulation Configuration

export MAIN_AGENT='Main' // The name of your main agent (typically "Main")
export EXPERIMENT_CLASS='Simulation' // The name of your simulation agent (typically "Simulation")
export EXPERIMENT_TYPE='Simulation'
export MAX_MEMORY_IN_MB='4096'

RL Configuration

export MULTIAGENT='false' // Set to 'true' for multi-agent
export FREEZING='false'
export ACTIONMASKS='false' // Set to 'true' if you are using action masking
export TRAIN_BATCH_MODE='complete_episodes'
export NUM_HIDDEN_NODES='256'
export NUM_HIDDEN_LAYERS='2'

Does Nothing But Necessary for AnyLogic

export NAMED_VARIABLE='true'
echo > setup.sh
mkdir -p database
touch database/db.properties

Execute Training

bash train.sh

Key Configurations

You will not need to touch most of these configurations. However, some are crucial to ensure the policy learns well.

Training Duration

The settings below determine the duration of training. These should be good enough for 90% of use cases. For particularly complex simulations, try 24 hours and 2000 training iterations.

export MAX_TIME_IN_SEC='43200' // Default is 12 hours 
export MAX_ITERATIONS='500' // 500 training iterations

Early Stopping Criteria

To minimize wasted training time, NativeRL has a built-in early stopping mechanism. There are many stopping criteria, but I would only focus on the below for now.

export EPISODE_REWARD_RANGE='0.01' // Stop if reward doesn't change by more than 1% over 75 iterations

Compute

The three settings below are immensely important. They basically dictate how much “experience” we are giving the RL policy in each training iteration. The more the better but, we are limited by compute.

export NUM_WORKERS='4' // More workers means more parallel AnyLogic simulation runs (i.e. episodes)
export NUM_SAMPLES='4' // Read Population-Based Training to learn this one