Configuration Walkthrough - neosr-project/neosr GitHub Wiki

In this document each configuration option for neosr will be explained. Templates can be used for convenience.

Initial Notes

  • Make sure to use single-quotes (') instead of double-quotes (") in the paths if you are on Windows (not applied to unix-like systems). This applies for dataroot_* and other options that expects system paths, but not other options. For example:
[datasets.train]
dataroot_gt = 'C:\My_Images_Folder\gt\' # notice the single-quote '
dataroot_lq = 'C:\My_Images_Folder\lq\'
  • Avoid special characters in both your paths and on filenames. Although parsing UTF-8 isn't a problem in most cases, it can still potentially cause problems.

  • Prefer full paths over of relative paths. Both can be used, but full paths avoids user confusion.

  • Sub-directories are parsed by default.

  • Do not mix OTF degradations with paired and default. OTF should always be used with model otf and dataloader otf.

Launch Options

This section describes the relevant launch options.


--auto_resume

The --auto_resume argument will resume training if the name option in your config file corresponds to an existing folder in /experiments and if a model is found under the /experiments/model_name/models/ folder.

neosr-train config.toml --auto_resume

--launcher

The --launcher argument specifies the job launcher. This is only useful if you're doing distributed training (multiple GPUs). Possible options are none, pytorch and slurm. See the distributed training section for more information.

uv run python -m torch.distributed.launch --nproc_per_node=8 --master_port=29500 train.py -opt config.toml --launcher pytorch

[!NOTE] The env var CUDA_VISIBLE_DEVICES might be needed to make sure all devices are visible. You can either set it using ~/.profile or by passing it directly on the command line before python: CUDA_VISIBLE_DEVICES=0,1,2,3,4 python ...

Header Options


name

The name option sets the folder name where your training files will be stored. It's a convention to use a prefix based on the scale factor of the model you're training:

name = "4x_mymodel"

model_type

The model_type option specifies which model should be used. If you are training with a paired or single dataset, you should set it to default:

model_type = "image" # "otf"

If you want to use on-the-fly degradations, set it to otf instead.


scale

The scale option sets the scale ratio of your generator. It can be 1x, 2x or 4x:

scale = 4

num_gpu

The num_gpu sets the number of GPUs. You don't have to specify this option, unless you're doing distributed training. Default is 'auto', which gets the number of gpus using torch.cuda.device_count.

num_gpu = "auto"

use_amp and bfloat16

The use_amp option enables Automatic Mixed Precision to speed up training. If you are using a GPU with tensor cores (Nvidia Turing or higher), using AMP is recommended. The bfloat16 option sets the dtype to BFloat16 instead of the default float16. Using this is recommended if your GPU has support (Nvidia Ampere or higher).

[!IMPORTANT] Beware AMP can cause loss of precision and may cause instabilities. Not all functions are yet implemented in bfloat16, so if you find an error while using it, please update to Pytorch Nightly and report on issues.

# Turing or newer
use_amp = true

# Ampere or newer
bfloat16 = true

fast_matmul

This option specifies to use TF32 or double bfloat16 (determined by your hardware and the heuristics of set_float32_matmul_precision) in all float32 precision operations. In practice, it increases performance without affecting final results in most cases.

fast_matmul = true

compile

The compile option is experimental. This option enables pytorch's new torch.compile(), which can speed up training.

compile = true

[!NOTE] For now, only linux has support for torch.compile, due to Triton not being officially supported on Windows yet.


manual_seed

The manual_seed option enables deterministic training. You should use it if your goal is to make precise tests/comparisons. It is recommended that you use manual seed values =>1024.

[!IMPORTANT] If you are not making experiments and just want to train a real-world model, leave this option commented, otherwise training performance will decrease significantly.

manual_seed = 1024

Note: Using high seeds is recommended because of torch.Generator.


Dataset options

This section describes the options within:

[datasets.train]

(dataset) type

The type option specifies the type of dataset loader. Possible options are paired, single and otf. The single type should only be used for inference, not training. The paired option is the default one and will only work if you have LQ images set in dataroot_lq. The otf type is meant to be used together with model: otf.

[datasets.train]
type = "paired" # For paired datasets
#type = "otf" # For on-the-fly degradations

dataroot_gt, dataroot_lq

The dataroot_gt and dataroot_lq options are the folder paths to your dataset. This can be either normal images or an LMDB database. The "gt" (ground-truth) are the ideal images, the ones you want your model to transform your images to. The "lq" (low quality) images are the degraded ones. For a folder with images, just include the path:

[datasets.train]
dataroot_gt = 'C:\My_Images_Folder\gt\'
dataroot_lq = 'C:\My_Images_Folder\lq\'

If you're using LMDB, both paths should end with the .lmdb suffix:

[datasets.train]
dataroot_gt = "/home/user/dataset_gt.lmdb"
dataroot_lq = "/home/user/dataset_lq.lmdb"

meta_info

The meta_info (optional) option is a text file describing the image file names. This is optional, but recommended to avoid unexpected training aborts due to dataset errors such as file name mismatches.

[!NOTE] If you use create_lmdb.py to convert all your images into an LMDB database, the meta_info option is not necessary, as the script will automatically generate and link it.

[datasets.train]
meta_info = 'C:\meta_info.txt'

patch_size

The patch_size is one of the most important options you have to change. It sets the size that each image will be cropped before being sent to the network. A random area of each image is cropped at every new batch.

Notes on patch_size:

  • patch_size is the crop size being applied to your LQ, in pixels. The GT pair will be cropped to patch_size multiplied by your scale ratio. For example, if you set patch_size: 32 and scale: 4 that means your GT will be 128px and your LQ will be 32px.

  • Commonly used constant values for patch_size are: 32, 48, 64, 96 and 128.

  • Depending on the arch you're using, you may encounter tensor size mismatches and other problems with some patch_size values. In general, multiples of 8 or 16 should work on most networks.

  • For transformers, your patch_size must be divisible by the window size. Standard values for window size are 8, 12, 16, 24 and 32.

  • Increasing patch_size will lead to better end results (better model restoration accuracy), but VRAM usage will increase quadratically.

[datasets.train]
patch_size = 48

batch_size

The batch_size option specifies the number of images to feed the network in each iteration.

Notes on batch_size:

  • Large batches have normalizing effect, i.e. training becomes more stable.
  • Research shows that the batch size not only stabilizes training, but also makes the network learn faster. It may also improve the accuracy of the final restoration, although this depends on the optimizer you're using.
  • Common batch_size values are: 4, 8 and 16. Anything higher than 64 can be considered "high batch" (in research).
  • batch_size sets batches per gpu.
[datasets.train]
batch_size = 8

accumulate

The accumulate option specifies the number of batches to be accumulated, also known as Gradient Accumulation. Using this option allows for effectively trading training speed for less vram usage. Your batch_size number will be multiplied by the accumulate value to train at that resulting batch. For example: if batch_size is 2 and accumulate is 8, your effective batch will be 16. Default: 1.

[datasets.train]
accumulate = 1

color = "y"

The color option specifies to convert all dataset images to grayscale. This is only useful if you're training a monochrome model, and should not be on your configuration file at all unless you want your model to generate monochrome images. Note: you need to change the network options to match the number of channels of your dataset. If you use color: y, the number of input and output channels of your network should be set to 1.

[datasets.train]
color = "y"

use_hflip, use_rot

The use_hflip and use_rot options are augmentations. It will rotate and flip images during training to increase variety. This is a standard basic augmentation that has been shown to improve models. Defaults to true even if not present in the configuration file.

[datasets.train]
use_hflip = true
use_rot = true

augmentations, aug_prob

The augmentations and aug_prob specifies to use augmentations and their probability, respectively. Currently supported are MixUp, CutMix, ResizeMix and CutBlur. This option is specified as a list. Each probability value corresponds to their respective position in the augmentation list. For example:

[datasets.train]
augmentation = [ "none", "mixup", "cutmix", "resizemix", "cutblur" ]
aug_prob = [ 0.0, 0.5, 0.5, 0.7, 0.7 ]

The configuration above will run all 4 augmentations, giving probability 0 to none (meaning some augmentation will always be applied), 0.5 for MixUp and CutMix (50% chance of applying), and 0.7 for ResizeMix and CutBlur (70% chance). Make sure the amount of probability values and the amount of augmentation types are the same.

[!NOTE] CutBlur is meant to be applied to real-world SR. If applied to bicubic-only it may cause undesired effects.


num_worker_per_gpu

The num_worker_per_gpu option is the number of threads used by the Pytorch Dataloader. By default, it is set to 4. To automatically fetch the maximum amount of workers supported by the system, use the option "auto":

[datasets.train]
num_worker_per_gpu = "auto"

dataset_enlarge_ratio

The dataset_enlarge_ratio option is used to artificially increase the size of the dataset. If your dataset is too small, training will reach an epoch too fast, causing slowdowns. Using this option will virtually multiply the dataset by N times, so epochs will take longer to reach.

[datasets.train]
dataset_enlarge_ratio = 10

Validation

This section describes the options within [datasets.val] and [val].

[!IMPORTANT] By default, validation doesn't tile the inputs. This means if your val images have large resolution, you might run out of VRAM while validation is running. For this reason, it is recommended that you tile all your validation images to smaller resolutions (such as 256x256 or 512x512) before starting training. Alternatively, use the tile option as mentioned bellow.

The validation options, when enabled, will automatically run your model in a folder of images every time the val_freq iter is reached. For example:

[datasets.val]
name = "any_name"
type = "single"
dataroot_lq = 'C:\folder\path\'
[val]
val_freq = 1000

The configuration above will perform inference on the dataroot_lq folder whenever it reaches 1000 iterations (val_freq: 1000) and save the images at the visualization/ folder of your model. Alternatively, you can use a paired validation set (both GT and LQ) and calculate metrics such as PSNR, SSIM and DISTS:

[datasets.val]
name = "any_name"
type = "paired"
save_img = false
dataroot_gt = "/folder/path/gt/"
dataroot_lq = "/folder/path/lq/"
[val]
val_freq = 1000
tile = 200
save_lq = false
[val.metrics.psnr]
type = "calculate_psnr"
[val.metrics.ssim]
type = "calculate_ssim"
[val.metrics.dists]
type = "calculate_dists"
better = "lower"
[val.metrics.topiq]
type = "calculate_topiq"

The option tile sets the number of tiles each image will be cut to during validation. This prevents out of memory errors, at the expense of slower validation inference. The option save_img defaults to True, but when set to False, validation will run and metrics will be shown in the log, however the images won't be saved. The option save_lq defaults to True, and will copy the original LQ validation images into the val folder, for easier comparisons. Validation results are saved in experiments/model_name/visualization/. The metric value can be seen on the training log file, and/or with tensorboard or wandb (see Logger options below), as well as printed on the terminal. Currently supported metrics are PSNR, SSIM, DISTS and TOPIQ.


path

The path options describe the path for pretrained models or resume state.

[path]
# Generator Pretrain
pretrain_network_g = 'C:\path\to\pretrain.pth'
# Discriminator Pretrain
pretrain_network_d = 'C:\path\to\pretrain.pth'

If you want to use a pretrain that has a different upscale ratio and/or you want to load a pretrain that was trained on a slight different version of the arch, you can use the following option:

[path]
strict_load_g = false

[!NOTE] Unless you have a very specific need, do not change the configuration above.

If you have a .state file that you want to load, comment out the pretrain_network_* option and use resume_state instead:

[path]
resume_state = 'C:\path\to\pretrain.state'

If you wish to save the experiments/ folder to another location, you can specify it using experiments_root:

[path]
experiments_root = 'C:\path\to\folder\'

If you wish to print the network structure on the logger, for debugging purposes, use print_network:

[path]
print_network = true

network_g and network_d

These options describe which network architecture should be used. For a list of supported architectures, see the neosr readme.md. Unless the template files has some network parameter explicitly commented, all network parameters are set to defaults based on their research papers. This means that you don't need to manually type any parameters, just use their names. For example:

[network_g]
type = "rgt"
[network_d]
type = "unet"

The above option will train the RGT generator with the U-Net discriminator.

[!IMPORTANT] Some networks have a parameter to specify the upscaling factor. These should be set to the same value as your scale option. The name of this parameter varies for each arch (upsampling, upscale, etc), see arch-specific options. By default, it's will always be set to 4, so if you're training a 2x model make sure this parameter is the same.


Train

These options describe the main training options, such as optimizers and losses.


eco, eco_iters, eco_init

These options specify to enable ECO (Empirical Centroid-oriented Optimization). This is a strategy to make optimization smoother, by blending LQ/GT with the model's output and gradually decrease such blend. The option eco is a bool flag that enables or disables it (default: false). eco_iters specifies the amount of iters ECO will reach its effect (default: 80000). eco_init specifies when ECO should start to take effect (default: 15000). eco_init only takes effect if pretrain_network_g is None.

[train]
eco = true
eco_iters = 80000
eco_init = 15000

[!IMPORTANT] eco needs a pretrained network to work properly. If a pretrained network is not provided, it will skip the first 15k iters before enabling eco. Using a pretrained network is highly recommended.


ema

This option specifies to use Exponential Moving Average on the weights between each iteration. This can lead to smoother convergence, and is particularly recommended if sam and schedule_free are enabled.

[train]
ema = 0.999

Where 0.999 is the decay rate.


sam, sam_init

This option specifies to use Sharpness-Aware Minimization, which is a technique to improve optimization, leading to more smooth convergence and better generalization. However, SAM requires two forward-backward passes, which results in a decrease of ~50% in training speed. Currently, FriendlySAM can be enabled by using the following options:

[train]
sam = "fsam"
sam_init = 1000

Where sam_init specifies the iteration when SAM will be enabled.

[!IMPORTANT] When training from scratch and with low batch sizes (less than 8), SAM could cause NaN. In that case, use sam_init to start SAM only after N iterations. When using AMP (automatic mixed precision), be careful. Due to limitations on pytorch's GradScaler, SAM does not scale gradients to appropriate precision ranges, which could lead to NaN.


warmup_iter

This option linearly ramps up the learning rate for the specified iter count. For example:

[train]
warmup_iter = 10000

If you start training with a learning rate of 1e-4 using the above option, the learning rate will start from 0 and increase to 1e-4 (linearly) when it reaches 10k iterations.

This technique is used to reduce overfitting when fine-tuning models. The reference value is 2% the total iters you want to train your model for. If unsure, use value between 3200 and 10000. If -1 is specified, warmup is disabled. Note: warmup_iter should not be used on schedule-free optimizers. Instead, use warmup_steps from within the optimizer settings.


grad_clip

This option disables Gradient Clipping. By default, grad_clip is enabled, however it can be disabled by passing False:

[train]
grad_clip = false

Gradient Clipping provides more stable training and allows for training with higher learning rates.


clamp

This option disables image input clamping to 0-1 range. By default, clamp is enabled.

[train]
clamp = false

optim_g and optim_d

The optim_ options set the optimizers for the generator and discriminator, and their options. For the supported optimizers, see the wiki. For their respective options, see pytorch documentation and pytorch-optimizer.

[train.optim_g]
type = "adamw"
lr = 1e-4
weight_decay = 0.0
betas = [ 0.9, 0.99 ]
fused = true
[train.optim_d]
type = "adamw"
lr = 1e-4
weight_decay = 0.0
betas = [ 0.9, 0.99 ]
fused = true

The above option will set AdamW to a learning rate of 1e-4 (scientific notation).

[!NOTE] The fused: true option can only be used with Adam and AdamW and is experimental. Some networks may not work properly when set to true.


scheduler

This option sets the learning rate scheduler. Supported types are MultiStepLR and CosineAnnealing.

[train.scheduler]
type = "multisteplr"
milestones = [ 60000, 120000 ]
gamma = 0.5

The above option sets the MultiStepLR scheduler to reduce the learning by gamma: 0.5 at iter counts of 60k and 120k. Example using CosineAnnealing:

[train.scheduler]
type = "cosineannealing"
T_max = 350000
eta_min = 4e-5

The setting above will set CosineAnnealing scheduler, reducing learning rate to 4e-5 when it reaches 350k iters.

[!NOTE] For more information, see the pytorch documentation


Losses

For options on all losses, please read the dedicated wiki page.


Logger

These options describe the logger configuration.


total_iter

Sets the total number of iterations for training. When the total_iter value is reached, the model will stop the training script and save the last models.

[logger]
total_iter = 500000 # end of training will be 500k iter

[!NOTE] The variable total_iter is based on epochs. If you modify the batch size or the dataset size mid-training it may reach end of training, even though it has not reached the amount of iterations specified. If you encounter such issue, increase total_iter to a high value and resume training.


print_freq

This sets the terminal and log file printing of training information.

[logger]
print_freq = 100

The above option will print training information at every 100 iterations.


save_checkpoint_freq

This option sets the frequency of saving model files and state file.

[logger]
save_checkpoint_freq = 1000

The above option will save models and state at every 1k iter count.


use_tb_logger, save_tb_img and wandb

This option enables to use tensorboard. A folder will be created inside experiments/tb_logger/ containing files needed to initialize tensorboard.

[logger]
use_tb_logger = true
save_tb_img = true

The option save_tb_img saves the validation images to tensorboard. Note: tensorboard initialization can be slow and log could get big when save_tb_img is enabled.

For starting tensorboard, use the following command:

uv run tensorboard --logdir tb_logger/

It will open a localhost server (usually on port 6006), which you can then open on any browser. For more details on using tensorboard, see the documentation.

Alternatively, you can use wandb by using the following option:

[logger]
use_tb_logger = true
[logger.wandb]
project = "experiments/tb_logger/project/"
resume_id = 1

The option use_tb_logger: true is required to use wandb.


Distributed Training

These options describe the distributed training configuration.


backend

This option specifies the backend to use for distributed training.

[dist_params]
backend = "nccl"
port = 29500

The above option will set up distributed training using the nvidia nccl library on port 29500. You can also launch training with slurm by passing a command line argument:

neosr-train options.yml --launcher slurm

Or through pytorch:

uv run python -m torch.distributed.launch --nproc_per_node=8 --master_port=29500 train.py -opt config.toml --launcher pytorch

[!NOTE] The env var CUDA_VISIBLE_DEVICES might be needed to make sure all devices are visible. You can either set it using ~/.profile or by passing it directly on the command line before python: CUDA_VISIBLE_DEVICES=0,1,2,3,4 python ...