<a href="https://colab.research.google.com/github/facebookresearch/vissl/blob/v0.1.6/tutorials/Large_Scale_Training_V0_1_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Large Scale Training with VISSL Training (mixed precision, LARC, ZeRO etc)

In this tutorial, show configuration settings that users can set for training large models.

You can make a copy of this tutorial by `File -> Open in playground mode` and make changes there. DO NOT request access to this tutorial.

# Using LARC

LARC (Large Batch Training of Convolutional Networks) is a technique proposed by **Yang You, Igor Gitman, Boris Ginsburg** in https://arxiv.org/abs/1708.03888 for improving the convergence of large batch size trainings.
LARC uses the ratio between gradient and parameter magnitudes is used to calculate an adaptive local learning rate for each individual parameter.

See the [LARC paper](<https://arxiv.org/abs/1708.03888>) for calculation of learning rate. In practice, it modifies the gradients of parameters as a proxy
for modifying the learning rate of the parameters.




## How to enable LARC

VISSL supports the LARC implementation from [NVIDIA's Apex LARC](https://github.com/NVIDIA/apex/blob/master/apex/parallel/LARC.py). To use LARC, users need to set config option
:code:`OPTIMIZER.use_larc=True`. VISSL exposes LARC parameters that users can tune. Full list of LARC parameters exposed by VISSL:


```yaml
OPTIMIZER:
  name: "sgd"
  use_larc: False  # supported for SGD only for now
  larc_config:
    clip: False
    eps: 1e-08
    trust_coefficient: 0.001
```

**NOTE:** LARC is currently supported for SGD optimizer only in VISSL.




# Using Apex


In order to use Apex, VISSL provides `anaconda` and `pip` packages of Apex (compiled with Optimzed C++ extensions/CUDA kernels). The Apex
packages are provided for all versions of `CUDA (9.2, 10.0, 10.1, 10.2, 11.0), PyTorch >= 1.4 and Python >=3.6 and <=3.9`.

Follow VISSL's instructions to [install apex in pip](https://github.com/facebookresearch/vissl/blob/master/INSTALL.md#step-2-install-pytorch-opencv-and-apex-pip) and instructions to [install apex in conda](https://github.com/facebookresearch/vissl/blob/master/INSTALL.md#step-3-install-apex-conda>).

# Using Mixed Precision

Many self-supervised approaches leverage mixed precision training by default for better training speed and reducing the model memory requirement.
For this, we use [NVIDIA Apex Library with AMP](https://nvidia.github.io/apex/amp.html#o1-mixed-precision-recommended-for-typical-use).

Users can tune the AMP level to the levels supported by NVIDIA. See [this for details on Apex amp levels](https://nvidia.github.io/apex/amp.html#opt-levels).

To use Mixed precision training, one needs to set the following parameters in configuration file:


```yaml
MODEL:
  AMP_PARAMS:
    USE_AMP: True
    # Use O1 as it is robust and stable than O3. If you want to use O3, we recommend
    # the following setting:
    # {"opt_level": "O3", "keep_batchnorm_fp32": True, "master_weights": True, "loss_scale": "dynamic"}
    AMP_ARGS: {"opt_level": "O1"}
```

# Using ZeRO

**ZeRO: Memory Optimizations Toward Training Trillion Parameter Models** is a technique developed by **Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He** in [this paper](https://arxiv.org/abs/1910.02054).
When training models with billions of parameters, GPU memory becomes a bottleneck. ZeRO can offer 4x to 8x memory reductions in memory thus allowing to fit larger models in memory.


## How ZeRO works?


Memory requirement of a model can be broken down roughly into:

1. activations memory
2. model parameters
3. parameters momentum buffers (optimizer state)
4. parameters gradients

ZeRO *shards* the optimizer state and the parameter gradients onto different devices and reduces the memory needed per device. See [here](https://fairscale.readthedocs.io/en/latest/deep_dive/oss_sdp_fsdp.html) for a deep dive by [FAIRscale](https://github.com/facebookresearch/fairscale).


## How to use ZeRO in VISSL?

VISSL uses [FAIRScale](https://github.com/facebookresearch/fairscale)_ library which implements ZeRO in PyTorch.
Using VISSL in ZeRO involves only configuration changes and no code changes.

In order to use ZeRO, the user needs to set `OPTIMIZER.name=zero` and nest the desired optimizer (for example SGD) settings in `OPTIMIZER.base_optimizer`.

An example for using ZeRO with LARC and SGD optimization:
```yaml
OPTIMIZER:
  name: zero
  base_optimizer:
    name: sgd
    use_larc: False
    larc_config:
      clip: False
      trust_coefficient: 0.001
      eps: 0.00000001
    weight_decay: 0.000001
    momentum: 0.9
    nesterov: False
```

**NOTE**: ZeRO works seamlessly with LARC and mixed precision training. Using ZeRO with activation checkpointing is not yet enabled primarily due to manual gradient reduction need for activation checkpointing.


# Using the Stateful Data Sampler

## Issue with PyTorch DataSampler for large data training

PyTorch default [torch.utils.data.distributed.DistributedSampler](https://github.com/pytorch/pytorch/blob/master/torch/utils/data/distributed.py#L12) is the default sampler used for many trainings. However, it becomes limiting to use this sampler in case of large batch size trainings for 2 reasons:

1. **Large datasets cause shuffling slowdowns.** Assuming shuffling is enabled, each trainer shuffles the full data and then gets a view of this shuffled data. If the dataset is large (100 millions, 1 billion or more), generating a very large permutation on each trainer can lead to large CPU memory consumption per machine. Hence, it becomes difficult to use the PyTorch default `DataSampler` when user wants to train on large data and for several epochs (for example: 10 epochs of 100M images).

2. **Training cannot be resumed easily mid-epoch** When the training is resumed mid-epoch, the sampler will serve the full dataset. However, in case of large data trainings (like 1 billion images or more), one usually trains for 1 epoch only. Since this training might takes weeks, and machines often fail, we want the training to resume from the middle of the epoch. The Pytorch sampler will instead serve the full 1 billion images.


To solve both the above issues, VISSL provides a custom sampler: `StatefulDistributedSampler` which inherits from the PyTorch `DistributedSampler` and fixes the above issues in following manner:

- Sampler creates the view of the data per trainer and then shuffles only the data that trainer is supposed to view. This lessens the CPU memory requirement.

- Sampler adds an instanace variable `start_iter` which tracks the model's iteration number of a given epoch. When the training is used, the `start_iter` will be properly set to the last iteration number and the sampler will serve only the remainder of the data.



## How to use VISSL custom DataSampler


Using VISSL provided custom samplier `StatefulDistributedSampler` is extremely easy and involves simply setting the correct configuration options as below:


```yaml
DATA:
  TRAIN:
    USE_STATEFUL_DISTRIBUTED_SAMPLER: True
  TEST:
    USE_STATEFUL_DISTRIBUTED_SAMPLER: True
```

**NOTE**: Users can use `StatefulDistributedSampler` for the training dataset and use PyTorch default`DataSampler` for the test set. It is not mandatory to use the same sampler type for all data splits.

# Activation Checkpointing

Activation checkpointing is a very powerful technique to reduce the memory requirement of a model. This is especially useful when training very large models with billions of parameters.



## How it works?

Activation checkpointing trades compute for memory. It discards intermediate activations during the forward pass, and recomputes them during the backward pass. In
our experiments, using activation checkpointing, we observe negligible compute overhead in memory-bound settings while getting big memory savings.

In summary, This technique offers 2 benefits:

- saves gpu memory that can be used to fit large models
- allows increasing training batch size for a given model

We recommend users to read the documentation available [here](https://pytorch.org/docs/stable/checkpoint.html) for further details on activation checkpointing.


## How to use activation checkpointing in VISSL?

VISSL integrates activation checkpointing implementation directly from PyTorch available [here](https://pytorch.org/docs/stable/checkpoint.html).
Using activation checkpointing in VISSL is extremely easy and doable with simple settings in the configuration file. The settings required are as below:

```yaml
MODEL:
  ACTIVATION_CHECKPOINTING:
    # whether to use activation checkpointing or not
    USE_ACTIVATION_CHECKPOINTING: True
    # how many times the model should be checkpointed. User should tune this parameter
    # and find the number that offers best memory saving and compute tradeoff.
    NUM_ACTIVATION_CHECKPOINTING_SPLITS: 8
DISTRIBUTED:
  # if True, does the gradient reduction in DDP manually. This is useful during the
  # activation checkpointing and sometimes saving the memory from the pytorch gradient
  # buckets.
  MANUAL_GRADIENT_REDUCTION: True
```