Best practices in training neural networks

Here are some best practices for training (large, deep) neural networks:

Use a linear warmup to your initial learning rate.
Use cosine decay once your model starts to plateau.
Use a batch size warmup schedule.
Read a mind numbing amount of papers.
Exclude weight decay from your embeddings.
Set AdamW \(\epsilon\) to 1e-8.