Architecture Decisions Flow
Contents
Data Augmentation
- Train network to one image size(224x224) and fine tune after for less epochs to larger size(448x448 for example)
- Train image detection network with image classification dataset
Initialization
- Random
- Xavier
Activation functions
- Sigmoid
- ReLU
- Leaky ReLU
- ELU
- SELU
Loss functions
- hinge loss \(L_i = \sum_{j\neq y_i} \max(0, s_j - s_{y_i} + \Delta)\)
- cross-entropy loss
- Triplet-loss - multi class loss from FaceNet
- Center-loss - from this paper
- Angular Softmax - from Sphere Face
Regularization
- Dropout
- GaussianDropout
- L1, L2, Lp
- Labels smoothing
Normalization
- Batch Norm
- Layer Norm
- Weight Norm
- Fusing parameters(???)
Optimizers
- First order
- Gradient Descent
- Momentum
- Adam
- Second order
Convolutions
- Usual convolutions
- 3x3 is better
- 1x1 convs(pointwise convolutions) from Network-in-network(NiN)
- Flattened convolutions(Cx1, 1xC kernels)(Paper)
- depthwise separable convolutions(Xception)
- 1x1 convs and then separable by channels 3x3 convs
- Separable by channels 3x3 convs and after 1x1 convs for all features
- Grouped convolutions (initially in AlexNet, updated in ResNeXt)
- Shuffled Grouped Convolutions(Shuffle Net)
Another architectures decisions
- Average pooling as part of the last classifier
- Use conv with stride without overlapping, not average/max pooling
- Inception module(parallel computation of various filters with 1x1 convs and after concatenating them)
- Bypassing features over two layers(as in ResNet or HighwayNets)
- Concatenating features from current layer with features from previous ones(as in DenseNet)
- Combine Inception Block with DenseNet approach
- Switch from Cartesian coordinate system to Polar coordinate system
A systematic evaluation of CNN modules
- Link to initial paper
- use ELU non-linearity without batchnorm or ReLU with it.
- apply a learned color space transformation of RGB.
- use the linear learning rate decay policy.
- use a sum of the average and max pooling layers.
- use mini-batch size around 128 or 256. If this is too big for your GPU, decrease the learning rate proportionally to the batch size.
- use fully-connected layers as convolutional and average the predictions for the final decision.
- when investing in increasing training set size, check if a plateau has not been reach.
- cleanliness of the data is more important then the size.
- if you cannot increase the input image size, reduce the stride in the consequent layers, it has roughly the same effect.
- if your network has a complex and highly optimized architecture, like e.g. GoogLeNet, be careful with modifications.