代码收藏家技术教程 2022-08-18

knn(k近邻算法)——python

1. 基本定义

2. 算法原理

2.1 算法优缺点

2.2 算法参数

2.3 变种

3.算法中的距离公式

4.案例实现

4.1 导入相关库

4.2 读取数据

4.3 读取变量名

4.4 定义X,Y数据

4.5 分离训练集和测试集

4.6 计算欧式距离

4.7 可视化距离矩阵

4.8 预测样本

4.9 查看正确率

4.10 交叉验证

5. scikit-learn的算法实现

5.1 对上述的再次实现：

5.2 另一种实现方式

1. 基本定义

k最近邻(k-Nearest Ne ighbor)算法是比较简单的机器学习算法。它采用测量不同特征值之间的距离方法进行分类。它的思想很简单：如果一个样本在特征空间中的多个最近邻(最相似〉的样本中的大多数都属于某一个类别，则该样本也属于这个类别。第一个字母k可以小写，表示外部定义的近邻数量。

简而言之，就是让机器自己按照每一个点的距离，距离近的为一类。

2. 算法原理

knn算法的核心思想是未标记样本的类别，由距离其最近的k个邻居投票来决定。
具体的，假设我们有一个已标记好的数据集。此时有一个未标记的数据样本，我们的任务是预测出这个数据样本所属的类别。knn的原理是，计算待标记样本和数据集中每个样本的距离，取距离最近的k个样本。待标记的样本所属类别就由这k个距离最近的样本投票产生。
假设X_test为待标记的样本，X_train为已标记的数据集，算法原理的伪代码如下：

遍历X_train中的所有样本，计算每个样本与X_test的距离，并把距离保存在Distance数组中。
对Distance数组进行排序，取距离最近的k个点，记为X_knn。
在X_knn中统计每个类别的个数，即class0在X_knn中有几个样本，class1在X_knn中有几个样本等。
待标记样本的类别，就是在X_knn中样本个数最多的那个类别。

2.1 算法优缺点

优点：准确性高，对异常值和噪声有较高的容忍度。

缺点：计算量较大，对内存的需求也较大。

2.2 算法参数

其算法参数是k，参数选择需要根据数据来决定。

k值越大，模型的偏差越大，对噪声数据越不敏感，当k值很大时，可能造成欠拟合；

k值越小，模型的方差就会越大，当k值太小，就会造成过拟合。

2.3 变种

knn算法有一些变种，其中之一是可以增加邻居的权重。默认情况下，在计算距离时，都是使用相同权重。实际上，可以针对不同的邻居指定不同的距离权重，如距离越近权重越高。这个可以通过指定算法的weights参数来实现。
另一个变种是，使用一定半径内的点取代距离最近的k个点。当数据采样不均匀时，可以有更好的性能。在scikit-learn里，RadiusNeighborsClassifier类实现了这个算法变种。

3.算法中的距离公式

与我们的线性回归不同，在这里我们并没有什么公式可以进行推导。KNN分类算法的核心就在于计算距离，随后按照距离分类。

在二维笛卡尔坐标系，相信初中同学应该对这个应该不陌生，他有一个更加常见的名字，直角坐标系。其中，计算两个点之间的距离公式，常用的有欧氏距离。点A(2,3)，点B(5,6)，那么AB的距离为

这，便是欧氏距离。但和我们平常经常遇到的还是有一些区别的，欧氏距离是可以计算多维数据的，也就是矩阵(Matrix)。这可以帮我们解决很多问题，那么公式也就变成了

4.案例实现

我们使用knn算法及其变种，对Pina印第安人的糖尿病进行预测。数据集可从下面下载。
链接：蓝奏云

4.1 导入相关库

# 导入相关模块
import numpy as np
from collections import Counter
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
import pandas as pd

4.2 读取数据

#读取数据
data=pd.read_excel('D:\桌面\knn.xlsx')
print(data)

label_need=data.keys()
print(label_need)

X = data[label_need].values[:,0:8]
y = data[label_need].values[:,8]
print(X)
print(y)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train,y_test = train_test_split(X, y, test_size=0.2)

# 打印训练集和测试集大小
print('X_train=', X_train.shape)
print('X_test=', X_test.shape)
print('y_train=', y_train.shape)
print('y_test=', y_test.shape)

# 测试实例样本量
num_test = X.shape[0]
# 训练实例样本量
num_train = X_train.shape[0]
# 基于训练和测试维度的欧氏距离初始化
dists = np.zeros((num_test, num_train)) 
# 测试样本与训练样本的矩阵点乘
M = np.dot(X, X_train.T)
# 测试样本矩阵平方
te = np.square(X).sum(axis=1)
# 训练样本矩阵平方
tr = np.square(X_train).sum(axis=1)
# 计算欧式距离
dists = np.sqrt(-2 * M + tr + np.matrix(te).T) 
print(dists)

dists = compute_distances(X_test, X_train)
plt.imshow(dists, interpolation='none')
plt.show()

# 测试样本量
num_test = dists.shape[0]
# 初始化测试集预测结果
y_pred = np.zeros(num_test) 
# 遍历   
for i in range(num_test):
    # 初始化最近邻列表
    closest_y = []
    # 按欧氏距离矩阵排序后取索引，并用训练集标签按排序后的索引取值
# 最后拉平列表
# 注意np.argsort函数的用法
    labels = y_train[np.argsort(dists[i, :])].flatten()
    # 取最近的k个值
    closest_y = labels[0:k]
    # 对最近的k个值进行计数统计
    # 这里注意collections模块中的计数器Counter的用法
    c = Counter(closest_y)
    # 取计数最多的那一个类别
    y_pred[i] = c.most_common(1)[0][0] 
print(y_pred)

4.9 查看正确率

查看实际和预测相符的个数：

# 找出预测正确的实例
num_correct = np.sum(y_test_pred == y_test)
print(num_correct)

计算正确率：

# 计算准确率
accuracy = float(num_correct) / X_test.shape[0]
print('Got %d/%d correct=>accuracy:%f'% (num_correct, X_test.shape[0], accuracy))

# 折交叉验证
num_folds = 5
# 候选k值
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]
X_train_folds = []
y_train_folds = []
# 训练数据划分
X_train_folds = np.array_split(X_train, num_folds)
# 训练标签划分
y_train_folds = np.array_split(y_train, num_folds)
k_to_accuracies = {}
# 遍历所有候选k值
for k in k_choices:
    # 五折遍历    
    for fold in range(num_folds): 
        # 对传入的训练集单独划出一个验证集作为测试集
        validation_X_test = X_train_folds[fold]
        validation_y_test = y_train_folds[fold]
        temp_X_train = np.concatenate(X_train_folds[:fold] + X_train_folds[fold + 1:])
        temp_y_train = np.concatenate(y_train_folds[:fold] + y_train_folds[fold + 1:])       
        # 计算距离
        temp_dists = compute_distances(validation_X_test, temp_X_train)
        temp_y_test_pred = predict_labels(temp_y_train, temp_dists, k=k)
        temp_y_test_pred = temp_y_test_pred.reshape((-1, 1))       
        # 查看分类准确率
        num_correct = np.sum(temp_y_test_pred == validation_y_test)
        num_test = validation_X_test.shape[0]
        accuracy = float(num_correct) / num_test
        k_to_accuracies[k] = k_to_accuracies.get(k,[]) + [accuracy]

打印不同 k 值不同折数下的分类准确率：

# 打印不同 k 值不同折数下的分类准确率
for k in sorted(k_to_accuracies):    
      for accuracy in k_to_accuracies[k]:
        print('k = %d, accuracy = %f' % (k, accuracy))

不同 k 值不同折数下的分类准确率的可视化：

for k in k_choices:
    # 取出第k个k值的分类准确率
    accuracies = k_to_accuracies[k]
    # 绘制不同k值准确率的散点图
    plt.scatter([k] * len(accuracies), accuracies)
# 计算准确率均值并排序
accuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())])
# 计算准确率标准差并排序
accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())])
# 绘制有置信区间的误差棒图
plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)
# 绘图标题
plt.title('Cross-validation on k')
# x轴标签
plt.xlabel('k')
# y轴标签
plt.ylabel('Cross-validation accuracy')
plt.show()

5.1 对上述的再次实现：

# 导入KneighborsClassifier模块
from sklearn.neighbors import KNeighborsClassifier
# 创建k近邻实例
neigh = KNeighborsClassifier(n_neighbors=10)
# k近邻模型拟合
neigh.fit(X_train, y_train)
# k近邻模型预测
y_pred = neigh.predict(X_test)
# # 预测结果数组重塑
# y_pred = y_pred.reshape((-1, 1))
# 统计预测正确的个数
num_correct = np.sum(y_pred == y_test)
print(num_correct)
# 计算准确率
accuracy = float(num_correct) / X_test.shape[0]
print('Got %d / %d correct => accuracy: %f' % (num_correct, X_test.shape[0], accuracy))

5.2 另一种实现方式

5.2.1 加载数据

import pandas as pd
data = pd.read_csv('D:\桌面\knn.csv')
print('dataset shape {}'.format(data.shape))
data.info()

5.2.2 分离训练集和测试集

X = data.iloc[:, 0:8]
Y = data.iloc[:, 8]
print('shape of X {}, shape of Y {}'.format(X.shape, Y.shape))

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train,Y_test = train_test_split(X, Y, test_size=0.2)

5.2.3 模型比较

使用普通的knn算法、带权重的knn以及指定半径的knn算法分别对数据集进行拟合并计算评分

from sklearn.neighbors import KNeighborsClassifier, RadiusNeighborsClassifier

# 构建3个模型
models = []
models.append(('KNN', KNeighborsClassifier(n_neighbors=2)))
models.append(('KNN with weights', KNeighborsClassifier(n_neighbors=2, weights='distance')))
models.append(('Radius Neighbors', RadiusNeighborsClassifier(n_neighbors=2, radius=500.0)))

# 分别训练3个模型，并计算得分
results = []
for name, model in models:
    model.fit(X_train, Y_train)
    results.append((name, model.score(X_test, Y_test)))
for i in range(len(results)):
    print('name: {}; score: {}'.format(results[i][0], results[i][1]))

权重算法，我们选择了距离越近，权重越高。RadiusNeighborsClassifier模型的半径选择了500.从输出可以看出，普通的knn算法还是最好。

问题来了，这个判断准确吗？答案是：不准确。

因为我们的训练集和测试集是随机分配的，不同的训练样本和测试样本组合可能导致计算出来的算法准确性有差异。

那么该如何解决呢？

我们可以多次随机分配训练集和交叉验证集，然后求模型评分的平均值。

scikit-learn提供了KFold和cross_val_score()函数来处理这种问题。

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

results = []
for name, model in models:
    kfold = KFold(n_splits=10)
    cv_result = cross_val_score(model, X, Y, cv=kfold)
    results.append((name, cv_result))
    
for i in range(len(results)):
    print('name: {}; cross_val_score: {}'.format(results[i][0], results[i][1].mean()))

上述代码，我们通过KFold把数据集分成10份，其中1份会作为交叉验证集来计算模型准确性，剩余9份作为训练集。cross_val_score()函数总共计算出10次不同训练集和交叉验证集组合得到的模型评分，最后求平均值。看起来，还是普通的knn算法性能更优一些。

5.2.4 模型训练及分析

据上面模型比较得到的结论，我们接下来使用普通的knn算法模型对数据集进行训练，并查看对训练样本的拟合情况以及对测试样本的预测准确性情况：

knn = KNeighborsClassifier(n_neighbors=2)
knn.fit(X_train, Y_train)
train_score = knn.score(X_train, Y_train)
test_score = knn.score(X_test, Y_test)
print('train score: {}; test score : {}'.format(train_score, test_score))

从这里可以看到两个问题。

对训练样本的拟合情况不佳，评分才0.84多一些，说明算法模型太简单了，无法很好地拟合训练样本。

模型准确性不好，0.66左右的预测准确性。

我们画出曲线，查看一下。

我们首先定义一下这个画图函数，代码如下：

from sklearn.model_selection import learning_curve
import numpy as np

def plot_learning_curve(plt, estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and training learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : int, cross-validation generator or an iterable, optional
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:
          - None, to use the default 3-fold cross-validation,
          - integer, to specify the number of folds.
          - An object to be used as a cross-validation generator.
          - An iterable yielding train/test splits.

        For integer/None inputs, if ``y`` is binary or multiclass,
        :class:`StratifiedKFold` used. If the estimator is not a classifier
        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.

        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validators that can be used here.

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).
    """
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o--', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

然后我们调用这个函数画一下图看看：

from sklearn.model_selection import ShuffleSplit

knn = KNeighborsClassifier(n_neighbors=2)
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
plt.figure(figsize=(10,6), dpi=200)
plot_learning_curve(plt, knn, 'Learn Curve for KNN Diabetes', X, Y, ylim=(0.0, 1.01), cv=cv)