Anomaly Detection Using Gaussian Distribution#

Jupyter Demos#

▢️ Demo | Anomaly Detection - find anomalies in server operational parameters like latency and threshold

Gaussian (Normal) Distribution#

The normal (or Gaussian) distribution is a very common continuous probability distribution. Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known. A random variable with a Gaussian distribution is said to be normally distributed and is called a normal deviate.

Let’s say:

x-in-R

If x is normally distributed then it may be displayed as follows.

Gaussian Distribution

mu - mean value,

sigma-2 - variance.

x-normal - β€œ~” means that β€œx is distributed as …”

Then Gaussian distribution (probability that some x may be a part of distribution with certain mean and variance) is given by:

Gaussian Distribution

Estimating Parameters for a Gaussian#

We may use the following formulas to estimate Gaussian parameters (mean and variation) for ith feature:

mu-i

sigma-i

i

m - number of training examples.

n - number of features.

Density Estimation#

So we have a training set:

Training Set

x-in-R

We assume that each feature of the training set is normally distributed:

x-1

x-2

x-n

Then:

p-x

p-x-2

Anomaly Detection Algorithm#

  1. Choose features x-i that might be indicative of anomalous examples (Training Set).

  2. Fit parameters params using formulas:

mu-i

sigma-i

  1. Given new example x, compute p(x):

p-x-2

Anomaly if anomaly

epsilon - probability threshold.

Algorithm Evaluation#

The algorithm may be evaluated using F1 score.

The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.

F1 Score

f1

Where:

precision

recall

tp - number of true positives.

fp - number of false positives.

fn - number of false negatives.

References#