Author(s): Bartosz Ludwiczuk Originally published on Towards AI. · Introduction· Vanishing gradient issue· Mitigation of the vanishing gradient issue· Training 1000 layer network· Training component analysis· Diving Deeper into Skip Connections· 10000-layer network Mean gradient for 1st layer in all experiments Introduction One of the largest Convolutional Networks, ConvNext-XXLarge[1] from OpenCLIP[2], boasts approximately 850 million parameters and 120 layers (counting all convolutional and linear layers). This is a dramatic increase compared to the 8 layers of AlexNet[3] but still fewer than the 1001-layer experiment introduced in the PreResNet[4] paper. Interestingly, about a decade ago, training networks with more than 100 layers was considered nearly impossible due to the vanishing gradient problem. However, advancements such as improved activation functions, normalization layers, and skip connections have significantly mitigated this issue — or so it seems. But is the problem truly solved? In this blog post, I will explore: What components enable training neural networks with more than 1,000 layers? Is it possible to train a 10,000-layer Convolutional Neural Network successfully? Vanishing gradient issue Before diving into experiments, let’s briefly revisit the vanishing gradient problem, a challenge that many sources have already explored in detail. The vanishing gradient problem occurs when the gradients in the early layers of a neural network become extremely small, effectively halting their ability to learn useful features. This issue arises due to the chain rule used during backpropagation, where the gradient is propagated backward from the final layer to the first. If the gradient in any layer is close to zero, the gradients for preceding layers shrink exponentially. A major cause of this behavior is the saturation of activation functions. To illustrate this, I trained a simple 5-layer network using the sigmoid activation function, which is particularly prone to saturation. You can find the code for this experiment on GitHub. The goal was to observe how the gradient norms of the network’s weights evolve over time. Gradient Norms Per Layer (Vanishing Gradient Issue). FC5 is the top layer, FC1 is the first layer. Image by author The plot above shows the gradient norms for each linear layer over several training iterations. FC5 represents the final layer, while FC1 represents the first. Vanishing Gradient Problem: In the first training iteration, there’s a huge difference in gradient norms between FC5 and FC4, with FC4 being approximately 10x smaller. By the time we reach FC1, the gradient is reduced by a factor of ~10,000 compared to FC5, leaving almost nothing of the original gradient to update the weights. This is a textbook example of the vanishing gradient problem, primarily driven by activation function saturation. Sigmoid activation function and its gradient. Plot add preactivation and activations/gradient values. Image by author Let’s delve deeper into the root cause: the sigmoid activation function. To understand its impact, I analyzed the first layer's pre-activation values (inputs to the sigmoid). The findings: Most pre-activation values lie in the flat regions of the sigmoid curve, resulting in activations close to 0 or 1. In these regions, the sigmoid gradient is nearly zero, as shown in the plot above. This means that any gradient passed backward through these layers is severely diminished, effectively disappearing by the time it reaches the first layers.The maximum gradient of the sigmoid function is 0.25, achieved at the midpoint of the curve. Even under ideal conditions, with 5 layers, the maximum gradient diminishes to 0.25⁵≈ 1e-3. This reduction becomes catastrophic for networks with 1,000 layers, rendering negligible the first layers' gradients. Skip connection. Source: Deep Residual Learning for Image Recognition, Kaiming He Mitigation of the vanishing gradient issue Several advancements have been instrumental in addressing the vanishing gradient problem, making it possible to train very deep neural networks. The key components that contribute to this solution are: 1. Activation Functions (e.g., Tanh, ReLU, GeLU) Modern activation functions have been designed to mitigate vanishing gradients by offering higher maximum gradient values and reducing regions where the gradient is zero. For example: ReLU (Rectified Linear Unit) has a maximum gradient of 1.0 and eliminates the saturation problem for positive inputs. This ensures gradients remain significant during backpropagation. Other functions, such as GeLU[5] and Swish[6], smooth out the gradient landscape, further improving training stability. 2. Normalization Techniques (e.g., BatchNorm[7], LayerNorm[8]) Normalization layers play a crucial role by adjusting pre-activation values to have a mean close to zero and a consistent variance. This helps in two significant ways: It reduces the likelihood of pre-activation values entering the saturation regions of activation functions, where gradients are nearly zero. Normalization ensures more stable training by keeping the activations well-distributed across layers. For instance: BatchNorm[7] normalizes the input to each layer based on the batch statistics during training. LayerNorm[8] normalizes across features for each sample, making it more effective in some scenarios. 3. Skip Connections (Residual Connections) Skip connections, introduced in architectures like ResNet[9], allow input signals to bypass one or more intermediate layers by directly adding the input to the layer's output. This mechanism addresses the vanishing gradient problem by: Providing a direct pathway for gradients to flow back to earlier layers without being multiplied by small derivatives or passed through saturating activation functions. Preserving gradients even in very deep networks, ensuring effective learning for earlier layers. By avoiding multiplications or transformations in the skip path, gradients remain intact, making them a simple yet powerful tool for training ultra-deep networks. Skip connection equation. Image by author Training 1000 layer network For this experiment, all training was conducted on the CIFAR-10[10] dataset. The baseline architecture was ConvNext[1], chosen for its scalability and effectiveness in modern vision tasks. To define successful convergence, I used a validation accuracy of >50% (compared to the 10% accuracy of random guessing). Source code on GitHub. All runs are available at Wandb. The following parameters were used across all experiments: Batch size: 64 Optimizer: AdamW[11] Learning rate scheduler: OneCycleLR My primary objective was to replicate the findings of the PreResNet paper and investigate how adding more layers impacts training. Starting with a 26-layer network as […]
↧