Architecture Decisions Flow

Contents

Data Augmentation

Train network to one image size(224x224) and fine tune after for less epochs to larger size(448x448 for example)
Train image detection network with image classification dataset

Usual convolutions
3x3 is better
1x1 convs(pointwise convolutions) from Network-in-network(NiN)
Flattened convolutions(Cx1, 1xC kernels)(Paper)
depthwise separable convolutions(Xception)
- 1x1 convs and then separable by channels 3x3 convs
- Separable by channels 3x3 convs and after 1x1 convs for all features
Grouped convolutions (initially in AlexNet, updated in ResNeXt)
Shuffled Grouped Convolutions(Shuffle Net)

Average pooling as part of the last classifier
Use conv with stride without overlapping, not average/max pooling
Inception module(parallel computation of various filters with 1x1 convs and after concatenating them)
Bypassing features over two layers(as in ResNet or HighwayNets)
Concatenating features from current layer with features from previous ones(as in DenseNet)
Combine Inception Block with DenseNet approach
Switch from Cartesian coordinate system to Polar coordinate system

Link to initial paper
use ELU non-linearity without batchnorm or ReLU with it.
apply a learned color space transformation of RGB.
use the linear learning rate decay policy.
use a sum of the average and max pooling layers.
use mini-batch size around 128 or 256. If this is too big for your GPU, decrease the learning rate proportionally to the batch size.
use fully-connected layers as convolutional and average the predictions for the final decision.
when investing in increasing training set size, check if a plateau has not been reach.
cleanliness of the data is more important then the size.
if you cannot increase the input image size, reduce the stride in the consequent layers, it has roughly the same effect.
if your network has a complex and highly optimized architecture, like e.g. GoogLeNet, be careful with modifications.