Architecture vs. Optimization Approaches

On this page I will try distinguish various approaches that now used for building NN. For example using convolution or LSTM is an architecture approach. But using batch normalization or not - is optimization, because it can be added to any network without any architecture changes.


  • Convs of with various kernels
  • 1x1 convolutions from Network-in-network(NiN). With them we may reduce features > perform usual convolution > increase number of features
  • Average pooling layer as part of the last classifier
  • Inception module(parallel computation of various filter with 1x1 convs and after concatenating them)
  • Flattened convolutions(Cx1, 1xC kernels)
  • Bypassing features over two layers(as in ResNet)
  • Concatenating features from current layer with features from previous ones(as in DenseNet)
  • Inception V4 - combine ResNet features propagating approach with Inception module.
  • Combine Inception Block with DenseNet approach.
  • Blog post about various Neural Networks for image classification.
  • XCeption block with separable convolutions(with or without ReLU after it).
  • Depthwise separable convolution filters initial paper
  • LSTM or GRU cell
  • attention mechanisms
  • Various activation functions
  • Max pooling or average pooling
  • Use conv with stride without overlaping, not average/max pooling
  • 1x1 convs and then separable by channels 3x3 convs
  • Separable by channnels 3x3 convs and after 1x1 convs for all features


  • Batch norm
  • Regularization loss
  • Various learning rate
  • Dropout


  • Dataset augmentation
  • Learn network to one image size(224x224) and fine tune after for less epochs to larger size(448x448 for example)
  • Train image detection network with image classification dataset

A systematic evaluation of CNN modules:

  • Link to initial paper
  • use ELU non-linearity without batchnorm or ReLU with it.
  • apply a learned colorspace transformation of RGB.
  • use the linear learning rate decay policy.
  • use a sum of the average and max pooling layers.
  • use mini-batch size around 128 or 256. If this is too big for your GPU, decrease the learning rate proportionally to the batch size.
  • use fully-connected layers as convolutional and average the predictions for the final decision.
  • when investing in increasing training set size, check if a plateau has not been reach.
  • cleanliness of the data is more important then the size.
  • if you cannot increase the input image size, reduce the stride in the con- sequent layers, it has roughly the same effect.
  • if your network has a complex and highly optimized architecture, like e.g. GoogLeNet, be careful with modifications.