Best practices in training neural networks
Here are some best practices for training (large, deep) neural networks:
- Use a linear warmup to your initial learning rate.
- Use cosine decay once your model starts to plateau.
- Use a batch size warmup schedule.
- Read a mind numbing amount of papers.
- Exclude weight decay from your embeddings.
- Set AdamW \(\epsilon\) to 1e-8.