Configuration Walkthrough - neosr-project/neosr GitHub Wiki
In this document each configuration option for neosr
will be explained. Templates can be used for convenience.
Initial Notes
- Make sure to use single-quotes (
'
) instead of double-quotes ("
) in the paths if you are on Windows (not applied to unix-like systems). This applies fordataroot_*
and other options that expects system paths, but not other options. For example:
[datasets.train]
dataroot_gt = 'C:\My_Images_Folder\gt\' # notice the single-quote '
dataroot_lq = 'C:\My_Images_Folder\lq\'
-
Avoid special characters in both your paths and on filenames. Although parsing UTF-8 isn't a problem in most cases, it can still potentially cause problems.
-
Prefer full paths over of relative paths. Both can be used, but full paths avoids user confusion.
-
Sub-directories are parsed by default.
-
Do not mix OTF degradations with
paired
anddefault
. OTF should always be used with modelotf
and dataloaderotf
.
Launch Options
This section describes the relevant launch options.
--auto_resume
The --auto_resume
argument will resume training if the name
option in your config file corresponds to an existing folder in /experiments
and if a model is found under the /experiments/model_name/models/
folder.
neosr-train config.toml --auto_resume
--launcher
The --launcher
argument specifies the job launcher. This is only useful if you're doing distributed training (multiple GPUs).
Possible options are none
, pytorch
and slurm
. See the distributed training section for more information.
uv run python -m torch.distributed.launch --nproc_per_node=8 --master_port=29500 train.py -opt config.toml --launcher pytorch
[!NOTE] The env var CUDA_VISIBLE_DEVICES might be needed to make sure all devices are visible. You can either set it using
~/.profile
or by passing it directly on the command line before python:CUDA_VISIBLE_DEVICES=0,1,2,3,4 python ...
Header Options
name
The name
option sets the folder name where your training files will be stored. It's a convention to use a prefix based on the scale factor of the model you're training:
name = "4x_mymodel"
model_type
The model_type
option specifies which model should be used. If you are training with a paired
or single
dataset, you should set it to default
:
model_type = "image" # "otf"
If you want to use on-the-fly degradations, set it to otf
instead.
scale
The scale
option sets the scale ratio of your generator. It can be 1x, 2x or 4x:
scale = 4
num_gpu
The num_gpu
sets the number of GPUs. You don't have to specify this option, unless you're doing distributed training.
Default is 'auto', which gets the number of gpus using torch.cuda.device_count
.
num_gpu = "auto"
use_amp
and bfloat16
The use_amp
option enables Automatic Mixed Precision to speed up training. If you are using a GPU with tensor cores (Nvidia Turing or higher), using AMP is recommended. The bfloat16
option sets the dtype to BFloat16 instead of the default float16. Using this is recommended if your GPU has support (Nvidia Ampere or higher).
[!IMPORTANT] Beware AMP can cause loss of precision and may cause instabilities. Not all functions are yet implemented in
bfloat16
, so if you find an error while using it, please update to Pytorch Nightly and report on issues.
# Turing or newer
use_amp = true
# Ampere or newer
bfloat16 = true
fast_matmul
This option specifies to use TF32 or double bfloat16 (determined by your hardware and the heuristics of set_float32_matmul_precision
) in all float32 precision operations. In practice, it increases performance without affecting final results in most cases.
fast_matmul = true
compile
The compile
option is experimental. This option enables pytorch's new torch.compile()
, which can speed up training.
compile = true
[!NOTE] For now, only linux has support for torch.compile, due to Triton not being officially supported on Windows yet.
manual_seed
The manual_seed
option enables deterministic training. You should use it if your goal is to make precise tests/comparisons. It is recommended that you use manual seed values =>1024.
[!IMPORTANT] If you are not making experiments and just want to train a real-world model, leave this option commented, otherwise training performance will decrease significantly.
manual_seed = 1024
Note: Using high seeds is recommended because of torch.Generator
.
Dataset options
This section describes the options within:
[datasets.train]
type
(dataset) The type
option specifies the type of dataset loader.
Possible options are paired
, single
and otf
. The single
type should only be used for inference, not training. The paired
option is the default one and will only work if you have LQ images set in dataroot_lq
. The otf
type is meant to be used together with model: otf
.
[datasets.train]
type = "paired" # For paired datasets
#type = "otf" # For on-the-fly degradations
dataroot_gt
, dataroot_lq
The dataroot_gt
and dataroot_lq
options are the folder paths to your dataset. This can be either normal images or an LMDB database.
The "gt" (ground-truth) are the ideal images, the ones you want your model to transform your images to. The "lq" (low quality) images are the degraded ones.
For a folder with images, just include the path:
[datasets.train]
dataroot_gt = 'C:\My_Images_Folder\gt\'
dataroot_lq = 'C:\My_Images_Folder\lq\'
If you're using LMDB, both paths should end with the .lmdb
suffix:
[datasets.train]
dataroot_gt = "/home/user/dataset_gt.lmdb"
dataroot_lq = "/home/user/dataset_lq.lmdb"
meta_info
The meta_info
(optional) option is a text file describing the image file names. This is optional, but recommended to avoid unexpected training aborts due to dataset errors such as file name mismatches.
[!NOTE] If you use
create_lmdb.py
to convert all your images into an LMDB database, themeta_info
option is not necessary, as the script will automatically generate and link it.
[datasets.train]
meta_info = 'C:\meta_info.txt'
patch_size
The patch_size
is one of the most important options you have to change. It sets the size that each image will be cropped before being sent to the network. A random area of each image is cropped at every new batch.
Notes on patch_size
:
-
patch_size is the crop size being applied to your LQ, in pixels. The GT pair will be cropped to patch_size multiplied by your scale ratio. For example, if you set
patch_size: 32
andscale: 4
that means your GT will be 128px and your LQ will be 32px. -
Commonly used constant values for
patch_size
are:32
,48
,64
,96
and128
. -
Depending on the arch you're using, you may encounter tensor size mismatches and other problems with some
patch_size
values. In general, multiples of 8 or 16 should work on most networks. -
For transformers, your
patch_size
must be divisible by the window size. Standard values for window size are 8, 12, 16, 24 and 32. -
Increasing
patch_size
will lead to better end results (better model restoration accuracy), but VRAM usage will increase quadratically.
[datasets.train]
patch_size = 48
batch_size
The batch_size
option specifies the number of images to feed the network in each iteration.
Notes on batch_size
:
- Large batches have normalizing effect, i.e. training becomes more stable.
- Research shows that the batch size not only stabilizes training, but also makes the network learn faster. It may also improve the accuracy of the final restoration, although this depends on the optimizer you're using.
- Common batch_size values are: 4, 8 and 16. Anything higher than 64 can be considered "high batch" (in research).
batch_size
sets batches per gpu.
[datasets.train]
batch_size = 8
accumulate
The accumulate
option specifies the number of batches to be accumulated, also known as Gradient Accumulation.
Using this option allows for effectively trading training speed for less vram usage. Your batch_size
number will be multiplied by the accumulate
value to train at that resulting batch. For example: if batch_size
is 2 and accumulate
is 8, your effective batch will be 16. Default: 1.
[datasets.train]
accumulate = 1
color = "y"
The color
option specifies to convert all dataset images to grayscale. This is only useful if you're training a monochrome model, and should not be on your configuration file at all unless you want your model to generate monochrome images.
Note: you need to change the network options to match the number of channels of your dataset. If you use color: y
, the number of input and output channels of your network should be set to 1
.
[datasets.train]
color = "y"
use_hflip
, use_rot
The use_hflip
and use_rot
options are augmentations. It will rotate and flip images during training to increase variety. This is a standard basic augmentation that has been shown to improve models. Defaults to true
even if not present in the configuration file.
[datasets.train]
use_hflip = true
use_rot = true
augmentations
, aug_prob
The augmentations
and aug_prob
specifies to use augmentations and their probability, respectively. Currently supported are MixUp
, CutMix
, ResizeMix
and CutBlur
. This option is specified as a list. Each probability value corresponds to their respective position in the augmentation
list. For example:
[datasets.train]
augmentation = [ "none", "mixup", "cutmix", "resizemix", "cutblur" ]
aug_prob = [ 0.0, 0.5, 0.5, 0.7, 0.7 ]
The configuration above will run all 4 augmentations, giving probability 0 to none (meaning some augmentation will always be applied), 0.5 for MixUp and CutMix (50% chance of applying), and 0.7 for ResizeMix and CutBlur (70% chance). Make sure the amount of probability values and the amount of augmentation types are the same.
[!NOTE]
CutBlur
is meant to be applied to real-world SR. If applied to bicubic-only it may cause undesired effects.
num_worker_per_gpu
The num_worker_per_gpu
option is the number of threads used by the Pytorch Dataloader. By default, it is set to 4. To automatically fetch the maximum amount of workers supported by the system, use the option "auto"
:
[datasets.train]
num_worker_per_gpu = "auto"
dataset_enlarge_ratio
The dataset_enlarge_ratio
option is used to artificially increase the size of the dataset. If your dataset is too small, training will reach an epoch too fast, causing slowdowns. Using this option will virtually multiply the dataset by N times, so epochs will take longer to reach.
[datasets.train]
dataset_enlarge_ratio = 10
Validation
This section describes the options within [datasets.val]
and [val]
.
[!IMPORTANT] By default, validation doesn't tile the inputs. This means if your val images have large resolution, you might run out of VRAM while validation is running. For this reason, it is recommended that you tile all your validation images to smaller resolutions (such as 256x256 or 512x512) before starting training. Alternatively, use the
tile
option as mentioned bellow.
The validation options, when enabled, will automatically run your model in a folder of images every time the val_freq
iter is reached.
For example:
[datasets.val]
name = "any_name"
type = "single"
dataroot_lq = 'C:\folder\path\'
[val]
val_freq = 1000
The configuration above will perform inference on the dataroot_lq
folder whenever it reaches 1000 iterations (val_freq: 1000
) and save the images at the visualization/
folder of your model.
Alternatively, you can use a paired validation set (both GT and LQ) and calculate metrics such as PSNR, SSIM and DISTS:
[datasets.val]
name = "any_name"
type = "paired"
save_img = false
dataroot_gt = "/folder/path/gt/"
dataroot_lq = "/folder/path/lq/"
[val]
val_freq = 1000
tile = 200
save_lq = false
[val.metrics.psnr]
type = "calculate_psnr"
[val.metrics.ssim]
type = "calculate_ssim"
[val.metrics.dists]
type = "calculate_dists"
better = "lower"
[val.metrics.topiq]
type = "calculate_topiq"
The option tile
sets the number of tiles each image will be cut to during validation. This prevents out of memory errors, at the expense of slower validation inference. The option save_img
defaults to True, but when set to False, validation will run and metrics will be shown in the log, however the images won't be saved. The option save_lq
defaults to True, and will copy the original LQ validation images into the val folder, for easier comparisons.
Validation results are saved in experiments/model_name/visualization/
. The metric value can be seen on the training log file, and/or with tensorboard
or wandb
(see Logger options below), as well as printed on the terminal. Currently supported metrics are PSNR, SSIM, DISTS and TOPIQ.
path
The path
options describe the path for pretrained models or resume state.
[path]
# Generator Pretrain
pretrain_network_g = 'C:\path\to\pretrain.pth'
# Discriminator Pretrain
pretrain_network_d = 'C:\path\to\pretrain.pth'
If you want to use a pretrain that has a different upscale ratio and/or you want to load a pretrain that was trained on a slight different version of the arch, you can use the following option:
[path]
strict_load_g = false
[!NOTE] Unless you have a very specific need, do not change the configuration above.
If you have a .state
file that you want to load, comment out the pretrain_network_*
option and use resume_state
instead:
[path]
resume_state = 'C:\path\to\pretrain.state'
If you wish to save the experiments/
folder to another location, you can specify it using experiments_root
:
[path]
experiments_root = 'C:\path\to\folder\'
If you wish to print the network structure on the logger, for debugging purposes, use print_network
:
[path]
print_network = true
network_g
and network_d
These options describe which network architecture should be used. For a list of supported architectures, see the neosr readme.md
.
Unless the template files has some network parameter explicitly commented, all network parameters are set to defaults based on their research papers. This means that you don't need to manually type any parameters, just use their names. For example:
[network_g]
type = "rgt"
[network_d]
type = "unet"
The above option will train the RGT generator with the U-Net discriminator.
[!IMPORTANT] Some networks have a parameter to specify the upscaling factor. These should be set to the same value as your
scale
option. The name of this parameter varies for each arch (upsampling
,upscale
, etc), see arch-specific options. By default, it's will always be set to4
, so if you're training a 2x model make sure this parameter is the same.
Train
These options describe the main training options, such as optimizers and losses.
eco
, eco_iters
, eco_init
These options specify to enable ECO (Empirical Centroid-oriented Optimization). This is a strategy to make optimization smoother, by blending LQ/GT with the model's output and gradually decrease such blend. The option eco
is a bool flag that enables or disables it (default: false). eco_iters
specifies the amount of iters ECO will reach its effect (default: 80000). eco_init
specifies when ECO should start to take effect (default: 15000). eco_init
only takes effect if pretrain_network_g
is None.
[train]
eco = true
eco_iters = 80000
eco_init = 15000
[!IMPORTANT]
eco
needs a pretrained network to work properly. If a pretrained network is not provided, it will skip the first 15k iters before enablingeco
. Using a pretrained network is highly recommended.
ema
This option specifies to use Exponential Moving Average on the weights between each iteration. This can lead to smoother convergence, and is particularly recommended if sam
and schedule_free
are enabled.
[train]
ema = 0.999
Where 0.999
is the decay rate.
sam
, sam_init
This option specifies to use Sharpness-Aware Minimization, which is a technique to improve optimization, leading to more smooth convergence and better generalization. However, SAM requires two forward-backward passes, which results in a decrease of ~50% in training speed. Currently, FriendlySAM can be enabled by using the following options:
[train]
sam = "fsam"
sam_init = 1000
Where sam_init
specifies the iteration when SAM will be enabled.
[!IMPORTANT] When training from scratch and with low batch sizes (less than 8), SAM could cause
NaN
. In that case, usesam_init
to start SAM only after N iterations. When using AMP (automatic mixed precision), be careful. Due to limitations on pytorch's GradScaler, SAM does not scale gradients to appropriate precision ranges, which could lead toNaN
.
warmup_iter
This option linearly ramps up the learning rate for the specified iter count. For example:
[train]
warmup_iter = 10000
If you start training with a learning rate of 1e-4 using the above option, the learning rate will start from 0 and increase to 1e-4 (linearly) when it reaches 10k iterations.
This technique is used to reduce overfitting when fine-tuning models. The reference value is 2% the total iters you want to train your model for. If unsure, use value between 3200
and 10000
. If -1
is specified, warmup is disabled.
Note: warmup_iter
should not be used on schedule-free optimizers. Instead, use warmup_steps
from within the optimizer settings.
grad_clip
This option disables Gradient Clipping. By default, grad_clip
is enabled, however it can be disabled by passing False:
[train]
grad_clip = false
Gradient Clipping provides more stable training and allows for training with higher learning rates.
clamp
This option disables image input clamping to 0-1 range. By default, clamp
is enabled.
[train]
clamp = false
optim_g
and optim_d
The optim_
options set the optimizers for the g
enerator and d
iscriminator, and their options. For the supported optimizers, see the wiki
. For their respective options, see pytorch documentation and pytorch-optimizer.
[train.optim_g]
type = "adamw"
lr = 1e-4
weight_decay = 0.0
betas = [ 0.9, 0.99 ]
fused = true
[train.optim_d]
type = "adamw"
lr = 1e-4
weight_decay = 0.0
betas = [ 0.9, 0.99 ]
fused = true
The above option will set AdamW to a learning rate of 1e-4 (scientific notation).
[!NOTE] The
fused: true
option can only be used with Adam and AdamW and is experimental. Some networks may not work properly when set to true.
scheduler
This option sets the learning rate scheduler. Supported types are MultiStepLR
and CosineAnnealing
.
[train.scheduler]
type = "multisteplr"
milestones = [ 60000, 120000 ]
gamma = 0.5
The above option sets the MultiStepLR scheduler to reduce the learning by gamma: 0.5
at iter counts of 60k and 120k.
Example using CosineAnnealing:
[train.scheduler]
type = "cosineannealing"
T_max = 350000
eta_min = 4e-5
The setting above will set CosineAnnealing scheduler, reducing learning rate to 4e-5
when it reaches 350k iters.
[!NOTE] For more information, see the pytorch documentation
Losses
For options on all losses, please read the dedicated wiki page.
Logger
These options describe the logger configuration.
total_iter
Sets the total number of iterations for training. When the total_iter
value is reached, the model will stop the training script and save the last models.
[logger]
total_iter = 500000 # end of training will be 500k iter
[!NOTE] The variable
total_iter
is based on epochs. If you modify the batch size or the dataset size mid-training it may reach end of training, even though it has not reached the amount of iterations specified. If you encounter such issue, increasetotal_iter
to a high value and resume training.
print_freq
This sets the terminal and log file printing of training information.
[logger]
print_freq = 100
The above option will print training information at every 100 iterations.
save_checkpoint_freq
This option sets the frequency of saving model files and state file.
[logger]
save_checkpoint_freq = 1000
The above option will save models and state at every 1k iter count.
use_tb_logger
, save_tb_img
and wandb
This option enables to use tensorboard
. A folder will be created inside experiments/tb_logger/
containing files needed to initialize tensorboard.
[logger]
use_tb_logger = true
save_tb_img = true
The option save_tb_img
saves the validation images to tensorboard. Note: tensorboard initialization can be slow and log could get big when save_tb_img
is enabled.
For starting tensorboard, use the following command:
uv run tensorboard --logdir tb_logger/
It will open a localhost server (usually on port 6006), which you can then open on any browser. For more details on using tensorboard, see the documentation.
Alternatively, you can use wandb
by using the following option:
[logger]
use_tb_logger = true
[logger.wandb]
project = "experiments/tb_logger/project/"
resume_id = 1
The option use_tb_logger: true
is required to use wandb
.
Distributed Training
These options describe the distributed training configuration.
backend
This option specifies the backend to use for distributed training.
[dist_params]
backend = "nccl"
port = 29500
The above option will set up distributed training using the nvidia nccl
library on port 29500.
You can also launch training with slurm
by passing a command line argument:
neosr-train options.yml --launcher slurm
Or through pytorch:
uv run python -m torch.distributed.launch --nproc_per_node=8 --master_port=29500 train.py -opt config.toml --launcher pytorch
[!NOTE] The env var CUDA_VISIBLE_DEVICES might be needed to make sure all devices are visible. You can either set it using
~/.profile
or by passing it directly on the command line before python:CUDA_VISIBLE_DEVICES=0,1,2,3,4 python ...