**1. PARADIGMS OF LEARNING**

**Interpretation of Probability**

Probability expresses uncertainty that an even may or may not occur and is a key concept in pattern recognition.

Lets assume two random variables

**X**and

**Y**such that

**X**can take values x

_{i}where

*i = 1,..., M*and

**Y**can take values

*y*where

_{i}*i = 1,...,N.*

Also let the total number of times X takes the value x

_{i}by c

_{i}and the total number of times Y takes the value

*y*be

_{i}*c*

_{j}.*Note the Following:*

*1.*The probability that X would take value xi and Y would take a value yi is written as:

*p(X = x*

_{i}, Y = y_{j}) = n_{ij}/NThis is called the joint probability of

*X = x*and

_{i}*Y= y*

_{j}2. The probability that X would take value xi is given as

*P(X = x*

_{i}) = c_{i}/NRules of Probability

The two rule of probability are the sum and the product rule given below:

**Bayesian Model**

Bayesian model of comparison involves the use of probabilities to represent uncertainty in the choice of model along with consistent application of the sum and product rules of probability.

**2. STATISTICAL DECISION THEORY**

**Optimal Decision**

Optima decision is in decision theory is a decision among the possible alternatives that is closest to the expected result

**Receiver Operating Characteristic Curve(ROC)**

This is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied

**Area Under the ROC Curve**

The AUC is equal to the probability that a classifier would rank a randomly chosen positive instance higher than a randomly chosen negative one.

**3. PHASES OF DATA ANALYSIS USING MACHINE LEARNING, VEDA**

VEDA - Visual Exploratory Data Analysis

Exploratory data analysis in statistics and machine learning is a approach to analyzing data sets to summarize the key features of data, most time using visual methods.

**Study Design**

Study design is the aspect of Exploratory Data Analysis. A set of methods and procedures used in collecting and analyzing measures of variables specified in the research problem. Design of a study specifies new study type an sub type.

**Exploratory vs Confirmatory Data Analysis**

While confirmatory data analysis tests a-priori hypothesis, that is the outcome prediction made before the measurement phase begins,

Exploratory seeks to generate a-posteriori pattern or other items in the dataset and looks for potential relations between the variables.

**What is Anomaly Detection?**

Anomaly detection is the identification of items, events or observations that do not conform to an expected pattern or other items in the dataset. Anomalies are referred to as outliers, novelties, noise, deviations or exceptions. Anomaly detection is carried out using any of the following methods:

- Density-based techniques such k-nearest neighbor, local outlier factor
- Subspace and correlation based
- One-class SVM
- Replication neural networks
- Cluster-analysis based
- Fuzzy logic based

**4. LINEAR MODELS FOR REGRESSION**

Linear Regression. Linear-in-Parameter models.

The goal of regression is to predict the value of one or more continuous target variables

*t*, given the value of a D-dimensional vector

**x**of the input variables.

The simplest form of linear regression models are linear functions of the input variables. A more useful class of functions cal be obtained by taking linear combinations of a fixed set of non-linear functions of the input variables, known as the basis function.

The simplest linear model for regression is one that involves a linear combination of the input variables as shown:

*y(x, w) = w*

_{0}+ w_{1}x +...+ w_{D}x_{D}where

*x = (x*

_{1},..,x_{D})T**Principle of Least Squares (LS)**

The principle of Least Squares(LS) is a techniques for minimizing the error between a prediction for each data point and the corresponding target value.

The error function is given as the sum of squares of the errors as shown below:

**Principle of Maximum Likelihood (ML)**

Maximum likelihood is a procedure for deriving the value of one or more parameters for a given statistic which makes the known likelihood distribution maximum.

**Principle of Maximum A Posteriori (MAP)**

Maximum a posteriori probability is an estimate of an unknown quantity that is equal to the mode of the posterior distribution

**The Least-Square(LS) Solution**

The least square method in regression analysis tries to approximate the the solution to sets of equations by minimizing the sum of the squares of the residuals made in the results of each equation.

The residual refers to the difference between the observed value and the fitted value provided by the model.

**Problem of Overfitting**

Overfitting is a condition in regression where a statistical model begins to describe the random error in the data rather that the relationship between variables.

It is the production of an analysis that corresponds too closely to a particular set of data to such extent that it may fail to fit additional data or observations reliably.

**5. CLASSIFICATION**

Classification in machine learning has to do with identifying to which class or category a new observation would be assigned to. And this is done on the basis of a training data set containing observations whose class is known.

An example is to categorize emails into two classes: spam and non-spam. In this case, an incoming email is the observation, while the classes are 'spam' and 'non-spam'.

The goal of classification is to take and input vector x and to assign it to on of K discrete classes

*C*where

_{k}*k = 1,...,K.*

**6. NEURAL NETWORKS**

Neural network is an interconnected network of nodes called neurons connected by edges that have assigned weight. Neural networks are designed to mimic the behaviour of the biological network of the human brain.

Each connection in a neural network transmit signal to the other.

**7. DIMENSIONALITY REDUCTION**

Dimensionality reduction in machine learning is the process applied to reduce teh number of random variables under analysis by obtaining a set of principal variable known as principal components(PC).

Dimensionality reduction can be divided into two categories: feature selection which tries to find a subset(or features) of the original variables and feature extraction which transforms the data from a higher-dimensional space into fewer dimensions.

**Principal Component Analysis**

Principal Component Analysis is an example of feature extraction. PCA performs an eigen decomposition of the co-variance matrix of the original high-dimensional data. The result of this decomposition is a set of eigenvectors and a set of eigen values.

The eigenvectors that correspond to the largest eigenvalues can then be the principal components.

**8. CLUSTERING**

Clustering is a supervised learning method that aims at finding sub-groups within the data that has similar characteristics.

**K-Means Clustering**

K-means clustering is clustering method that aims to partition the data into k number of clusters in which each observation belongs to the cluster witht he nearest mean.

**How it Works**

Suppose we have a data set of {x1, ... , xN) which is a set of N observations for the variable x. We want to partition the data set into some number K of clusters.

Lets assume a set of D-dimensional vectors

**µ**k, where k = 1, ... , K. and

**µ**k is the centroid (or say mean) associated with the kth cluster.

The goal is to assign the data points to clusters, that is to a set of vectors {

**µ**k} in such a way that the sum of the squares of the distances of each data point to its closest

**µ**k, is minimum.

**9 .INDEPENDENT MODELS OF PROBABILITY DISTRIBUTION**

Conditional independence is a concept in probability theory that relates two event. Two events a and b are conditionally dependent given a third even c if the occurrence of a and the occurrence of b are independent events in their conditional probability distribution given c.

In other words, a and b are conditionally independent given c if and only if, given the knowledge that

*c*occurs, knowledge of whether

*a*occurs provides no information on the likelihood of b occurring and the knowledge of whether

*b*occurs provides no information on the likelihood of a occurring.

This can be represented as follows:

*p(a|b,c) = p(a|c)*

This means that a is conditionally independent of b given c

Markov Chain is a model that describes the sequence of possible events in which the probability of each of the event depends only on the state attained in the previous event.

**7. FULL BAYESIAN LEARNING: MARKOV CHAIN MONTE CARLO METHODS**Markov Chain is a model that describes the sequence of possible events in which the probability of each of the event depends only on the state attained in the previous event.