Quick Reference


One-shot learning
aim to lean not from thousands of examples but from one or only a few.
Transfer learning
apply already trained model with previous knowledge to the new domain.
is an abbreviation for multilayer perceptrons
Computational neuroscience
is primarily concerned with building more accurate models of how the brain actually works.
a stochastic event or system is one that is unpredictable because of a random variable.
is a system in which no randomness is involved in the development of future states of the system.
is the vector of partial derivatives in each dimension.
Gradient Descent
the procedure of repeatedly evaluating the gradient and then performing a parameter update.
Stochastic Gradient Descent
depends on source it may mean Gradient Descent with batches, or with only one example(on-line gradient descent).
computing the gradient analytically using the chain rule.
Chain Rule
Gradient expressions may be chained with multiplication of output gradient with local function gradient.
is a learner’s tendency to consistently learn the same wrong thing.
is the tendency to learn random things irrespective of the real signal.


Matrix multiplication

# shape (2, 3)
A = [[1, 2, 3],
     [4, 5, 6]]
# shape (3, 2)
B = [[7, 8],
     [9, 10],
     [11, 12]]

# shape (2, 2)
Z = [[1 * 7 + 2 * 9 + 3 * 11, 1 * 8 + 2 * 10 + 3 * 12],
     [4 * 7 + 5 * 9 + 6 * 11, 4 * 8 + 5 * 10 + 6 * 12]]
# multiply row of A by the column of B and sum the result
Z[i, j] = sum(A[i] * B[:, j])

Logarithms / Exponents

\begin{equation*} \ln (e ^ x) = x \end{equation*}
\begin{equation*} \log_{a} (a ^ x) = x \end{equation*}


standard deviation - square root of variance. \(std = \sqrt{variance}\). \(\sigma = \sqrt{ \frac{1}{n} \sum_{i=1}^{n}(x_i - x_{mean})^2}\)

Activation functions

\begin{equation*} \textsf{Sigmoid} \end{equation*}
\begin{equation*} f(x) = \frac{1}{1 + e^{-x}} \end{equation*}
\begin{equation*} \textsf{Tanh} \end{equation*}
\begin{equation*} f(x) = \tanh(x) = \frac{2}{1 + e^{-2x}} - 1 \end{equation*}
\begin{equation*} \textsf{Softmax} \end{equation*}
\begin{equation*} f(x)_i = \frac{e^{x_i}}{\sum_{k=1}^{K} e^{x_k}} \end{equation*}

Back propagation / Gradient descent

Back propagation

Starting with the final output recursively applies the chain rule to compute the gradients of every layer. For \(L_{i}\) layer backprop can be computed as derivative for every element based on \(L_{i + 1}\) layer backprop output.

Gradient descent

To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. Gradient - vector containing all of the partial derivatives. This mean in case while computed derivative for one function input, all other stay the same.

Stochastic gradient descent

Perform Gradient Descent only with some part of examples

Validation metrics

Confusion Matrix - matrix contains True/False positives/negatives.

Precision: \(\frac{{TruePositive}}{{TruePositive + FalsePositive}}\). Put another way, it is the number of positive predictions divided by the total number of positive class values predicted. A low precision can also indicate a large number of False Positives. How many selected items are relevant. Also the precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

Recall: \(\frac{{TruePositive}}{{TruePositive + FalseNegtive}}\). Put another way it is the number of positive predictions divided by the number of positive class values in the test data. Recall can be thought of as a measure of a classifiers completeness. A low recall indicates many False Negatives. How many relevant items are selected. Also the recall is intuitively the ability of the classifier to find all the positive samples.

F1 score: \(\frac{{2*Recall*Precision}}{{Recall + Precision}}\) balanced precision and recall.

kNN and k-means

kNN(k-nearest neighbors algorithm) - classification algorithm when class of undefined element will be issued based on classes of K nearest neighbors.

k-means - clusterization algorithm. Aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster:

  • Define k clusters
  • Calculate distance to every point
  • Assign each pending point to the nearest cluster
  • Recalculate new clusters centers
  • Recalculate new distances: \(v_i = (1/c_i) \sum_{j=1}^{c_i} x_i\), where \(c_i\) represents number of data points in \(i^{th}\) cluster.
  • If no any points were reassigned - stop iterations

Covariance and correlation

Both describe the degree to which two random variables or sets of random variables tend to deviate from their expected values in similar ways.

If \(X\) and \(Y\) are two random variables, with means (expected values) \(\mu_X\) and \(\mu_Y\) and standard deviations \(\sigma_X\) and \(\sigma_Y\), respectively, then their covariance and correlation are as follows:





where \(E[ ]\) is the expected value, also known as the mean.


Principal components - components with most variation, directions where the data is most spread out.

Eigenvectors and values exist in pairs: every eigenvector has a corresponding eigenvalue. An eigenvector is a direction. An eigenvalue is a number, telling you how much variance there is in the data in that direction, in the example above the eigenvalue is a number telling us how spread out the data is on the line. The eigenvector with the highest eigenvalue is therefore the principal component.

In fact the amount of eigenvectors/values that exist equals the number of dimensions the data set has.

Reducing dimension performed by stripping some eigenvectors with small eigenvalues. Only eigenvectors with large eigenvalues remains.

Also Multiple Discriminant Analysis(MDA) approach exist. In MDA we are additionally interested to find the directions that maximize the separation (or discrimination) between different classes (for example, in pattern classification problems where our dataset consists of multiple classes. In contrast two PCA, which ignores the class labels).

PCA step by step:

  • Compute means of every dimension.
  • Compute the scatter matrix \(S = \sum\limits_{k=1}^n (\pmb x_k - \pmb m)\;(\pmb x_k - \pmb m)^T\), where \(\pmb m\) is the mean vector.
  • Or alternatively compute covariance matrix (numpy.cov function) (a matrix whose element in the i, j position is the covariance between the \(i^{th}\) and \(j^{th}\) elements of a random vector).
  • Compute eigenvectors/ eigenvalues: eig_val_sc, eig_vec_sc = np.linalg.eig(scatter_matrix)
    • Eigenvalues \(\alpha\) can be obtained by solving an equation \(|\textbf{A} - \alpha \textbf{I}| = 0\), where \(\textbf{A}\) is a matrix and \(| |\) means determinant.
    • Eigenvectors \(\pmb v\) than can be obtained by \((\textbf{A} - \alpha_j \textbf{I})\pmb v_j = 0\).
  • We can check correctness of eigenvectors/eigenvalues as \(\pmb\Sigma\pmb{v} = \lambda\pmb{v}\), where \(\pmb\Sigma\) - covariance matrix, \(\pmb{v}\) - eigenvector, \(\lambda\) - eigenvalue.
  • Sorting the eigenvectors by decreasing eigenvalues
  • Choosing k eigenvectors with the largest eigenvalues and receive \(\pmb W\) matrix.
  • To receive dimension reduction we should only compute \(\pmb y = \pmb W^T \times \pmb x\)

L1/ L2 normalization

The idea of regularization is to add an extra term to the cost function, a term called the regularization term.

Regularization term for \(L_p\) norm can be computed as \(||x||_{p}=(\sum_{i}|x_{i}|^{p})^{1/p}\).

Great explanation can be found on stackoverflow or here


Something used for not differentiable functions. SHould be filled.