Classification

Classification is a very important aspect of machine learning. A lot of our decisions are based on classification - choosing between alternatives. The decision could be based on several abstract parameters. We know that the parameters contribute to the decision. But we have no idea how. We can train a machine learning model based on such data using the classification algorithms.

K-Nearest Neighbors (KNN)

Classification is an important subject in Supervised Learning. A major part of machine learning applications deal with binary outputs which require classification rather than regression. KNN Classifier is one of the many classification algorithms.

Like most Machine Learning algorithms, the KNN algorithm is also inspired by a tendency of the human mind - to go along with the crowd. Conceptually, KNN just looks at the known points around the query point and predicts that its outcome is similar to the points around it. More precisely, For any new point, it checks for the K points that are closest in terms of the defined distance metric.

Once these are identified, the outcome of each of those points is identified based on the training set. And the outcome of the new point is defined based on the highest bidder in the neighborhood. For example, if we look for the 5 nearest of a given test point, if 3 of those points say positive and two say negative, the outcome is predicted as positive since that is the highest bidder.

The KNN Classifier is a good tool for classification when the size of the data and the features are within control - else the computation can get very expensive. The classifier accuracy is based on the assumption that the similar points are geometrically close to each other - which may not always be the case. The distance metric is very important. What is near according to one metric may be far away by the other.

Consider for example, data sets in form of two concentric circles - the inner circle being positive and the outer circle negative. In such a case, inventing new distance metric may help to an extent. But the cost of computation increases very rapidly with the complexity of the distance metric. The cost also increases rapidly with the number of features.

But it is a very elegant and intuitive way of classifying when the data is good.

Support Vector Machine (SVM)

Support Vector Machine (SVM) is a classification technique. It tries to geometrically divide the data available. If the input data has N features, the data is plotted as points in an N dimensional space. Then, it identifies an N-1 dimensional structure that could separate the groups. The good separation is one which has maximum distance from the two groups. The distance from the group could be identified in various different forms - the distance from the closest points or the distance from the center, or mean of all distances, etc. That would depend upon the kind of data.

But things may not always be so simple. If the features are not separated well enough, we may not have a good plane passing through them. In that case, the scores would be very bad in spite of any tuning. In such a case, we have to rework the features to make sure the data gets segregated properly. We may have to consider a new distance metric that can separate the positive and negative clusters well enough. Or if nothing works, we might have to look for another algorithm.

Hence it is important to have a good idea about the data we have. That can help us choose the correct algorithm and save our time and computation resources.

In effect, this complements the Nearest Neighbor algorithm. Problems that are easier with Nearest Neighbor are difficult with the SVM and vice-versa.

Decision Tree

Decision Tree is an interesting concept that mimics a very common way our mind approaches a classification problem.

Suppose our training data set has n features, we can take up one feature at a time and classify data elements of that feature. The two nodes thus obtained can be classified further based on the remaining features. Thus, we get a tree structure where each node representing a given data feature. As we perform this classification, we expect to reach a state where each node has true on one side and false on the other - thus completing our model training.

Consider for example, our old breast cancer data set. We can classify the data on one feature at a time. Say we start with the size. We can define a threshold and anything less than the threshold goes to the lower node and anything higher than the threshold goes to the upper node. Then we look at the weight. And so on.. Over time, we have a Tree of nodes - each classifying the data based on a threshold of a given feature. If our thresholds are wisely chosen, then we can expect that each leaf node of a well trained tree will be able to correctly classify into a positive or negative.

Of course, we have two hyperparameters going into our assumption - the choice of order of features and the "wisely chosen" thresholds. There are several algorithms for identifying these. But, once they are chosen, the classification is not a difficult task.

The essential concept of decision tree is based on splitting the tree on the appropriate features at appropriate thresholds. But identifying these features and thresholds is very important. And overfitting, of course is the single dominant evil in any machine learning scenario. Researchers have identified several ways around these problems. Below is an introduction to the most important ones.

Linear v/s Tree models

When we have different algorithms for the same task, a natural question in any mind is - how do they compare? Which one is better? Which one should I use?

Well, the fact is that both are good at their own set of problems. There are some problems that fit much better in a linear model and there are some others that fit much better in a tree model. Intuitively, we can say that if the correlation between the input features and the output is simple and linear (in the sense that one increases/decreases uniformly with the other), then a Linear model would work much better. But if the correlation is pretty complex and not linear, then a Tree model has a better chance of working out.

Also, compared to Linear models, a Tree model is a lot easier to grasp intuitively. So, if you need humans to understand the model, then a Tree model is far better.

Optimize the Split

The first step to preparing for the Decision Tree is to identify the right set of features and split. Below are some of the important techniques to help you with that.

Gini Index

This works with categorical target variable, only Binary splits. The Gini Index is based on the concept of Purity of Population. A Population is pure if the probability of two random samples belonging to the same class is 1. You should identify the Gini index for different possible splits. The one with highest Gini Index should be the starting point.

Chi-Square

This is another way of identifying the more significant component. Chi-Square is computed from sum of squares of standardized differences between observed and expected frequencies of target variable. Specifically, the Chi-Square value is calculated as

Chi-Square = ((Actual Positive Count - Expected Positive Count)^2 / (Expected Positive Count)) ^ (1/2)

A low value of Chi-Square suggests that the Actual and Expected counts are similar. That means, the component hardly impacts the decision - the output is statistical. High difference between the Actual and Expected counts means that the component has enough influence to drive the decision significantly away from its statistical value. This more significant component should be used to start the decision.

Information Gain

Entropy has always been the known distraction in any walk of life. Most of the processes are targeted towards minimizing the entropy. In terms of information theory, higher the entropy, lower is the information. Most of our computational effort is in an attempt to gain information - so we try to minimize the entropy. The entropy of any split is computed as

entropy = - p * (log2 q) - q * (log2 p)

Where p and q are the probability of success and failure on that node. If both are 0.5 - meaning the node is redundant - the entropy is 1. On the other hand, if both are close to 0 - meaning we have almost no false negatives or false positives, this is 'the' decision node - the entropy is very low. Surely, it should be the first choice

Reduce Tree Size

Next, we look at the techniques for avoiding over fitting. When training a decision tree, it is possible that the tree grows huge to the extent that there is a node for each data element - or perhaps even more. In such a case, we have 100% accuracy for the training data and of course, it fails miserably outside the set. Thus, the fundamental reason for over fitting is the tree growing too big. The below techniques avoid this situation to in order to avoid over fitting. Logically, there are two ways to do this - don't let the tree grow and cut it down. That is what the below techniques do.

Constrain the Tree Size

The tree can be constrained by defining individual constraints on its growth. This is done by defining limits on the following:

Minimum samples for a node split

Defines the minimum number of samples (or observations) which are required in a node to be considered for splitting.
Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
Too high values can lead to under-fitting hence, it should be tuned using CV.

Minimum samples for a terminal node (leaf)

Defines the minimum samples (or observations) required in a terminal node or leaf.
Used to control over-fitting similar to min_samples_split.
Generally lower values should be chosen for imbalanced class problems because the regions in which the minority class will be in majority will be very small.

Maximum depth of tree (vertical depth)

The maximum depth of a tree.
Used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.
Should be tuned using CV.

Maximum number of terminal nodes

This defines the maximum number of terminal nodes or leaves in a tree
It can be defined instead of max_depth.
Since the trees are binary, a depth of ‘n’ would produce a maximum of 2n leaves.

Maximum features to consider for split

The number of features to consider while searching for a best split. These will be randomly selected.
As a thumb-rule, square root of the total number of features works great but we should check upto 30-40% of the total number of features.
Higher values can lead to over-fitting but depends on case to case.

Tree Pruning

Constraining the tree works reactively - while training. That means, the tree is constrained without checking up the full data. This may be quicker, but mathematically, this is not the best way. Pruning on the other hand lets the tree grow really huge. Then tries to reduce the tree by pruning low performing leaves - bottom up. Thus, the process is a lot more consistent and results in a much better model.

Bias & Variance

Imperfection manifests in any model as Bias and Variance. These are two metrics that show how good or bad is your model. In simple terms, Bias tells us if the entire set is offset from what it is meant to be. On the other hand, Variance tells us how good or bad is the compliance within the data set. It is possible that the average of the entire set matches what it is supposed to be.

But, that is not enough if individual points do not match. This is defined by the variance. On the other hand, it is possible that none of the points match, but all of them are shifted by a constant value. This would be denoted by high bias and low variance. A small tree has high Bias and low Variance. On the other hand, a big tree tends to produce low Bias but high Variance. Naturally, we would like to find the minimum value in this U shaped error curve.

Bagging is one of the techniques used to reduce the variance of our predictions. Essentially it combines multiple classifier models trained on different sub-samples of the given data set. The training data set is sampled multiple times to get multiple subsets. Once this is available, each is used to train a model independently. The models are then merged to get the final model. Thus, the steps for bagging are:

Create Multiple DataSets:

Sampling is done with replacement on the original data and new datasets are formed.
The new data sets can have a fraction of the columns as well as rows, which are generally hyper-parameters in a bagging model
Taking row and column fractions less than 1 helps in making robust models, less prone to overfitting

Build Multiple Classifiers:

Classifiers are built on each data set.
Generally the same classifier is modeled on each data set and predictions are made.

Combine Classifiers:

The predictions of all the classifiers are combined using a mean, median or mode value depending on the problem at hand.
The combined values are generally more robust than a single model. There are many different implementations of this concept of Bagging. Random Forest is one of them