Analyzing Binge Drinking Dataset with K nearest Neighbor Algorithm

One of the most interesting part of Machine learning is k-nearest neighbor algorithm often referred K-NN. It is also the basic framework of pattern recognition and it is non-parametric widely adapted for dataset classification and regression analysis. Whether actively used for classification or regression purposes, the input takes a K closet training samples in the feature area. The output relies on if K-NN is set for classification or regression.

Firstly, it should be noted that K-NN classification, the output is a class participatory. The object in question is classified by a majority nomination of its closest neighbors, and the class or object being put to the class frequent within its K nearest neighbors. K is usually a positive number, small in measure). For instance suppose K=5, then the object is equivalent to the class of that specific nearest neighbor.

Secondly, in K-NN, the output belongs to the value of the class or object. The true assessment is the mean of the range of its K nearest neighbors.

Furthermore, K-NN is a form of instance-based learning, or lazy learning, in the case of where the method is estimated locally and all calculation is postponed until classification. The K-NN algorithm happens to be one of the easiest of all machine learning algorithms. It is useful for classification and regression of huge datasets and a handy approach for assigning weight for the upkeep of its neighbors such that the closer neighbor have impact more to the average than the ones further away. One specific property of K-NN algorithm is that it is aware of native build of the data.


Typical K-NN Algorithm

The typical training of K-NN contains keeping of feature vectors, class labels of the training examples, during the classification phase, k is a hand-picked constant and probably untagged vector which is set aside by equivalent to the label most occurring amongst the K training samples closet to the demand point. For a non-stop variable, a distance metric in between the space of data is often referred to as Euclidean distance. When a test classification (discrete variable) is on stage, the overlap metric or Hamming distance is adopted. Again, in the premise of gene expression microarray data, for instance, K-NN comes into useful play with the integration of correlation coefficients such as Pearson and Spearman, Often, the classification exactness of K-NN can be better produced if the distance metric is mastered with professional algorithms such as Pearson and Spearman. Large Margin Nearest Neighbor or Neighborhood component analysis also contribute towards precision of K-NN if the distance metric is known with help of the aforementioned algorithms.

It must be noted here that one of the major drawbacks of K-NN is “skewing”. This is when a more occurring class goes towards fully occupying the prediction of the new dataset. This is because they pose to be frequent amongst K nearest neighbors due to their recurrence. Weight classification is one way of tacking this problem, by making provision for the distance from the dataset test point to each of the k nearest neighbors. Abstraction of the data in question is another way of overcoming skewing problem. This can be done via self-organizing map (SOM), where each node is a vertex of a cluster of seem alike vertexes, disregarding their individual weight in the incidental training data. Thereafter, K-NN can be utilized to the SOM.

Parameter Picking: If the class predicted to be the class of the nearest training sample (i.e. when k=1) is called nearest neighbor algorithm. For K-NN algorithm to be accurate, noisy, irrelevant features should be completely stripped off.

K-NN Classifiers

  1. The 1-nearest neighbor classifier: This is the most distinctive type of classifier which places a vertex say h, to the class nearest neighbor.


  1. Weighted nearest neighbor classifier: This is weight based classification such as described above. Bagged weight is also close to this structure.



  1. Metric Learning: K nearest neighbor classification activity can be optimized through supervised metric learning. Examples are Neighborhood component analysis, large margin nearest neighbor, supervised metric learning algorithms implement label info in order to pattern a new metric or pseudo-metric.


  1. Feature Extraction: This aspect handles input data to an algorithm that is too huge to be handled and probably redundant, perhaps in both in the units. If features are carefully parted, from an original dataset, this could well set aside relevant information from the input data in case of performing an optimized data mining task with highest quality of output of feature space.


  1. Dimension reduction: For Large datasets, it is imperative to reduce dimension as a means of feature extraction. Especially if the Euclidean distance does not contribute effectively to high dimensional datasets. This is often referred to as the obscenity of dimensionality. Feature extraction and dimension reduction can both be properly handled by principal component analysis (pca), under linear discriminant analysis or canonical analysis-cca, techniques as a pre-activity step, succeeded by clustering by K-NN on feature vectors towards reducing dimension space. This is known as low dimensional embedding.


  1. Decision Boundary: Nearest neighbor operational guidelines indirectly calculate decision boundary. Calculation decision boundary is also open, perfectly done such that algorithms is a method of boundary convolution.


  1. Data Reduction: This is one of the most troublesome issues to be dealt with when handling huge datasets. It is certain that only a partition of a given set are needed for precise classification. These precise data are known as prototypes and can be discovered by:


  1. Picking the class-outliers, by training datasets put as classified wrongly.
  2. Dividing the rest of the data into two sets: firstly the prototypes needed for classification decisions, secondly, for captivated points that can be rightly classified by K-NN utilizing prototypes. The captivated vertices can then be removed from the set.
  3. Selection of Class-Outliers: A set of training dataset in the midst of sample of classes is known as class outliers. They are usually caused by random error, incapacitated training examples of class, missing features,
  4. Condensed for data reduction: Designed to minimize the data set for K-NN classification. Makes out a group of prototypes from the incoming training dataset, in which a unit of the dataset say B, can classify the group as perfectly as the entire dataset. It’s operation is likened to iterative processes, that sequentially scanes all members of a dataset, searching for an element whose prototype possesses a different tag.