About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Conference paper
Partial data permutation for training deep neural networks
Abstract
Random data permutation is considered as a best practice for training deep neural networks. When the input is large, permuting the full dataset is costly and limits scaling on distributed systems. Some practitioners use partial or no permutation that may potentially result in poor convergence.We propose a partitioned data permutation scheme as a low-cost alternative to full data permutation. Analyzing their entropy, we show that the two sampling schemes are asymptotically identical. We also show with minibatch SGD, both sampling schemes produce unbiased estimators of the true gradient. In addition, they have the same bound on the second moment of the gradient. Thus they have similar convergence properties. Our experiments confirm that SGD has similar training performance in practice with both sampling schemes.We further show that due to inherent randomness such as data augmentation and dropout in the training, even faster sampling schemes than partial permutation such as sequential sampling can achieve good performance. However, if no extra randomness is present in training, sampling schemes with low entropy can indeed degrade performance significantly.