探究泰坦尼克号生存率的数据分析与预测

文章目录

  • 1.导入库
  • 2.引入机器学习的模型
  • 2.1 逻辑回归模型
  • 2.2 随机森林
  • 2.3 支持向量机
  • 2.4 K最近邻
  • 2.5 决策树
  • 3. 对数据处理
  • 3.1 导入数据
  • 3.2 处理训练集缺失值
  • 3.2.1 查看维度
  • 3.2.2 查看缺失的个数
  • 3.2.3 处理缺失的上船港口(Embarked)
  • 处理缺失的年龄(Age)
  • 处理缺失的客舱号(Cabin)
  • 3.3 对训练集的数据进行处理
  • 3.3.1 打印前5行查看数据集
  • 3.3.2 处理Pclass(客舱等级)
  • 3.3.3 处理姓名
  • 3.3.4 处理性别
  • 3.3.5 处理年龄
  • 3.3.6 处理家庭
  • 3.3.7 处理船票和票价
  • 3.3.8 处理上船港口
  • 3.4 处理测试集
  • 3.5 查看相关性
  • 3.6 划分训练集和检验集
  • 4. 训练模型
  • 4.1 逻辑回归模型
  • 4.2 随机森林
  • 4.3 支持向量机
  • 4.4 K最近邻
  • 4.5 决策树
  • 5.测试模型
  • 5.1 逻辑回归模型
  • 5.2 随机森林
  • 5.3 支持向量机
  • 5.4 K最近邻
  • 5.5 决策树
  • 5.6 5种模型对比
  • 6. 预测模型
  • 6.1 逻辑回归模型
  • 6.2 随机森林
  • 6.3 支持向量机
  • 6.4 K最近邻
  • 6.5 决策树
  • 6.6 将上述结果交至kaggle进行评分
  • 7. 完整代码
  • 8. 总结
  • 1.导入库

    小编一般喜欢在在代码的刚开始便导入所有要使用的库,这样可以直观明了的将自己的思路展现出来,并且对代码的布局也更加美观。
    下面是泰坦尼克号需要使用的库:

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns 
    import random
    

    其中关于matplotlib,numpy,pandas工具的使用可以参考小编写的例外一篇博客
    机器学习入门基本使用工具(保姆式教学):matplotlib,numpy,pandas这一篇就够了
    这里是泰坦尼克号的数据集链接:https://pan.baidu.com/s/1C8mRRSkSdBsVRZ5zv_zCbA?pwd=w45i
    提取码:w45i

    2.引入机器学习的模型

    这里导入5个模型,用来比较各种模型之间的差距

    2.1 逻辑回归模型

    from sklearn.linear_model import LogisticRegression
    

    2.2 随机森林

    from sklearn.ensemble import RandomForestClassifier 
    

    2.3 支持向量机

    from sklearn.svm import SVC 
    

    2.4 K最近邻

    from sklearn.neighbors import KNeighborsClassifier 
    

    2.5 决策树

    from sklearn.tree import DecisionTreeClassifier 
    

    3. 对数据处理

    3.1 导入数据

    #训练数据集
    train_data = pd.read_csv(r"D:\data\python\taitanic\train.csv")
    #测试数据集
    test_data = pd.read_csv(r"D:\data\python\taitanic\test.csv")
    

    3.2 处理训练集缺失值

    3.2.1 查看维度

    >>>print('训练数据集:', train_data.shape, '测试数据集:', test_data.shape)
    >训练数据集: (891, 12) 测试数据集: (418, 11)
    

    3.2.2 查看缺失的个数

    >>>train_data.isnull().sum()
    >PassengerId      0
     Survived         0
     Pclass           0
     Name             0
     Sex              0
     Age            177
     SibSp            0
     Parch            0
     Ticket           0
     Fare             0
     Cabin          687
     Embarked         2
     dtype: int64
    

    3.2.3 处理缺失的上船港口(Embarked)

    查看上船港口数

    >>>plt.figure(figsize=(10,5),dpi=100)
    >>>train_data['Embarked'].value_counts().plot(kind='bar')
    >>>plt.savefig(r"D:\data\python\exercise\test2\1.png")
    

    可以看到从S港口上船的人数最多,而数据只缺失两个,所以选取频率最高的填充到空白数据中

    >>>train_data['Embarked'] = train_data['Embarked'].fillna('S')
    >>>train_data.isnull().sum()
    >PassengerId      0
     Survived         0
     Pclass           0
     Name             0
     Sex              0
     Age            177
     SibSp            0
     Parch            0
     Ticket           0
     Fare             0
     Cabin          687
     Embarked         0
     dtype: int64
    

    处理缺失的年龄(Age)

    处理空白年龄(Age),使用平均值填充

    >>>train_data['Age'] = train_data['Age'].fillna(int(train_data['Age'].mean()))
    >>>train_data.isnull().sum()
    >PassengerId      0
     Survived         0
     Pclass           0
     Name             0
     Sex              0
     Age              0
     SibSp            0
     Parch            0
     Ticket           0
     Fare             0
     Cabin          687
     Embarked         0
     dtype: int64
    

    处理缺失的客舱号(Cabin)

    至此除了Cabin(船舱号)其他缺失值已经补充完毕.Cabin这一列数据值缺失过多选择填充会导致得到的数据过于片面,因此,选择删去这一列

    >>>train_data.drop(columns = 'Cabin', axis=1,inplace=True)
    >>>train_data.isnull().sum()
    >PassengerId      0
     Survived         0
     Pclass           0
     Name             0
     Sex              0
     Age              0
     SibSp            0
     Parch            0
     Ticket           0
     Fare             0
     Embarked         0
     dtype: int64
    

    至此,训练集的缺失值已经处理完毕

    3.3 对训练集的数据进行处理

    3.3.1 打印前5行查看数据集

    >>>train_data.head()
    >
    	PassengerId	Survived	Pclass					Name								Sex		Age		SibSp	Parch	Ticket				Fare	Embarked
    0	1			0			3		Braund, Mr. Owen Harris								male	22.0	1		0		A/5 21171			7.2500	S
    1	2			1			1		Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1		0		PC 17599			71.2833	C
    2	3			1			3		Heikkinen, Miss. Laina								female	26.0	0		0		STON/O2. 3101282	7.9250	S
    3	4			1			1		Futrelle, Mrs. Jacques Heath (Lily May Peel)		female	35.0	1		0		113803				53.1000	S
    4	5			0			3		Allen, Mr. William Henry							male	35.0	0		0		373450				8.0500	S
    

    3.3.2 处理Pclass(客舱等级)

    使用get_dummies进行one-hot编码,列名前缀是Pclass

    >>>pclassdf1 = pd.DataFrame()
    >>>pclassdf1 = pd.get_dummies(train_data['Pclass'] , prefix='Pclass' )
    >>>train_data = pd.concat([train_data, pclassdf1], axis=1)
    >>>train_data.drop('Pclass',axis=1, inplace=True)
    >>>train_data.head()
    	PassengerId	Survived					Name								Sex		Age		SibSp	Parch	Ticket				Fare	Embarked	Pclass_1	Pclass_2	Pclass_3
    0	1			0			Braund, Mr. Owen Harris								male	22.0	1		0		A/5 21171			7.2500	S			0			0			1
    1	2			1			Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1		0		PC 17599			71.2833	C			1			0			0
    2	3			1			Heikkinen, Miss. Laina								female	26.0	0		0		STON/O2. 3101282	7.9250	S			0			0			1
    3	4			1			Futrelle, Mrs. Jacques Heath (Lily May Peel)		female	35.0	1		0		113803				53.1000	S			1			0			0
    4	5			0			Allen, Mr. William Henry							male	35.0	0		0		373450				8.0500	S			0			0			1
    

    3.3.3 处理姓名

    >>>def gettitle(name):
    >>>    str1 = name.split(',')[1] #Mr. Owen Harris
    >>>    str2 = str1.split('.')[0]#Mr
    >>>    str3 = str2.strip()
    >>>    return str3
    #存放提取后的特征
    >>>titledf1 = pd.DataFrame()
    #map函数:对Series每个数据应用自定义的函数计算
    >>>titledf1['Title'] = train_data['Name'].map(gettitle)
    #查看titledf的种类
    >>>titledf1['Title'].value_counts()
    >Mr              517
     Miss            182
     Mrs             125
     Master           40
     Dr                7
     Rev               6
     Col               2
     Major             2
     Mlle              2
     Capt              1
     Lady              1
     Jonkheer          1
     Mme               1
     Sir               1
     the Countess      1
     Ms                1
     Don               1
     Name: Title, dtype: int64
    
    #姓名中头衔字符串与定义头衔类别的对应关系
    >>>title_mapdict = {
    >>>                    "Capt":       "Officer",
    >>>                    "Col":        "Officer",
    >>>                    "Major":      "Officer",
    >>>                    "Jonkheer":   "Royalty",
    >>>                    "Don":        "Royalty",
    >>>                    "Sir":        "Royalty",
    >>>                    "Dr":         "Officer",
    >>>                    "Rev":        "Officer",
    >>>                    "the Countess":"Royalty",
    >>>                    "Dona":       "Royalty",
    >>>                    "Mme":        "Mrs",
    >>>                    "Mlle":       "Miss",
    >>>                    "Ms":         "Mrs",
    >>>                    "Mr":         "Mr",
    >>>                    "Mrs":        "Mrs",
    >>>                    "Miss":       "Miss",
    >>>                    "Master":     "Master",
    >>>                    "Lady":       "Royalty"
    >>>                    }
    #map函数:对Series每个数据应用自定义的函数计算
    >>>titledf1['Title'] = titledf1['Title'].map(title_mapdict)
    #使用get_dummies进行one-hot编码
    >>>titledf1 = pd.get_dummies(titledf1['Title'])
    >>>train_data = pd.concat([train_data, titledf1], axis=1)
    >>>train_data.drop('Name',axis=1, inplace=True)
    >>>train_data.head()
    >
    	PassengerId	Survived	Sex		Age		SibSp	Parch	Ticket				Fare	Embarked	Pclass_1	Pclass_2	Pclass_3	Master	Miss	Mr	Mrs	Officer	Royalty
    0	1			0			male	22.0	1		0		A/5 21171			7.2500	S			0			0			1			0		0		1	0	0		0
    1	2	 		1			female	38.0	1		0		PC 17599			71.2833	C			1			0			0			0		0		0	1	0		0
    2	3			1			female	26.0	0		0		STON/O2. 3101282	7.9250	S			0			0			1			0		1		0	0	0		0
    3	4			1			female	35.0	1		0		113803				53.1000	S			1			0			0			0		0		0	1	0		0
    4	5			0			male	35.0	0		0		373450				8.0500	S			0			0			1			0		0		1	0	0		0
    

    3.3.4 处理性别

    #处理性别
    >>>sex_mapdict = {'male': 1, 'female': 0}
    #map函数:对Series每个数据应用自定义的函数计算
    >>>train_data['Sex'] = train_data['Sex'].map(sex_mapdict)
    >>>sexdf1 = pd.DataFrame()
    #使用get_dummies进行one-hot编码,产生虚拟变量(dummy variables),列名前缀是Sex
    >>>sexdf1 = pd.get_dummies( train_data['Sex'], prefix='Sex')
    >>>train_data = pd.concat([train_data, sexdf1], axis=1)
    >>>train_data.drop('Sex',axis=1, inplace=True)
    >>>train_data.head()
    >
    
    	PassengerId	Survived	Age		SibSp	Parch	Ticket				Fare	Embarked	Pclass_1	Pclass_2	Pclass_3	Master	Miss	Mr	Mrs	Officer	Royalty	Sex_0	Sex_1
    0	1			0			22.0	1		0		A/5 21171			7.2500	S			0			0			1			0		0		1	0	0		0		0		1
    1	2			1			38.0	1		0		PC 17599			71.2833	C			1			0			0			0		0		0	1	0		0		1		0
    2	3			1			26.0	0		0		STON/O2. 3101282	7.9250	S			0			0			1			0		1		0	0	0		0		1		0
    3	4			1			35.0	1		0		113803				53.1000	S			1			0			0			0		0		0	1	0		0		1		0
    4	5			0			35.0	0		0		373450				8.0500	S			0			0			1			0		0		1	0	0		0		0		1
    

    3.3.5 处理年龄

    >>>plt.figure(figsize=(18,4))
    >>>train_data['Age']=train_data['Age'].astype(np.int)
    >>>average_age=train_data[['Age','Survived']].groupby('Age',as_index=False).mean()
    >>>sns.barplot(x='Age',y='Survived',data=average_age,palette='BuPu')
    >>>plt.savefig(r"D:\data\python\exercise\test2\2.png")
    

    >>>agedf1 = pd.DataFrame(train_data['Age'])
    >>>agedf1['Age_baby'] = agedf1['Age'].map(lambda s: 1 if 0 <= s < 15 else 0)
    >>>agedf1['Age_youth'] = agedf1['Age'].map(lambda s: 1 if 15 <= s < 35 else 0)
    >>>agedf1['Age_middle'] = agedf1['Age'].map(lambda s: 1 if 35 <= s <=60 else 0)
    >>>agedf1['Age_old'] = agedf1['Age'].map(lambda s: 1 if 60 <= s else 0)
    >>>train_data = pd.concat([train_data,agedf1], axis=1)
    >>>train_data.drop('Age',axis=1, inplace=True)
    >>>train_data.head()
    >	
    	PassengerId	Survived	SibSp	Parch	Ticket				Fare	Embarked	Pclass_1	Pclass_2	Pclass_3	...	Mr	Mrs	Officer	Royalty	Sex_0	Sex_1	Age_baby	Age_youth	Age_middle	Age_old
    0	1			0			1		0		A/5 21171			7.2500	S			0			0			1			...	1	0	0		0		0		1		0			1			0			0
    1	2			1			1		0		PC 17599			71.2833	C			1			0			0			...	0	1	0		0		1		0		0			0			1			0
    2	3			1			0		0		STON/O2. 3101282	7.9250	S			0			0			1			...	0	0	0		0		1		0		0			1			0			0
    3	4			1			1		0		113803				53.1000	S			1			0			0			...	0	1	0		0		1		0		0			0			1			0
    4	5			0			0		0		373450				8.0500	S			0			0			1			...	1	0	0		0		0		1		0			0			1			0
    
    

    3.3.6 处理家庭

    #存放家庭信息
    >>>familydf1 = pd.DataFrame()
    #家庭人数
    >>>familydf1['FamilySize'] = train_data['Parch'] + train_data['SibSp'] + 1
    #家庭类别
    #if 条件为真的时候返回if前面内容,否则返回0
    >>>familydf1['Family_Single'] = familydf1['FamilySize'].map(lambda s: 1 if s == 1 else 0)
    >>>familydf1['Family_Small'] = familydf1['FamilySize'].map(lambda s: 1 if 2 <= s <= 4 else 0)
    >>>familydf1['Family_Large'] = familydf1['FamilySize'].map(lambda s: 1 if 5 <= s else 0)
    >>>familydf1.drop('FamilySize',axis=1, inplace=True)
    >>>train_data = pd.concat([train_data, familydf1], axis=1)
    >>>train_data.drop('Parch',axis=1, inplace=True)
    >>>train_data.drop('SibSp',axis=1, inplace=True)
    >>>train_data.head()
    >	
    	PassengerId	Survived	Ticket				Fare	Embarked	Pclass_1	Pclass_2	Pclass_3	Master	Miss	...	Royalty	Sex_0	Sex_1	Age_baby	Age_youth	Age_middle	Age_old	Family_Single	Family_Small	Family_Large
    0	1			0			A/5 21171			7.2500	S			0			0			1			0		0		...	0		0		1		0			1			0			0		0				1				0
    1	2			1			PC 17599			71.2833	C			1			0			0			0		0		...	0		1		0		0			0			1			0		0				1				0
    2	3			1			STON/O2. 3101282	7.9250	S			0			0			1			0		1		...	0		1		0		0			1			0			0		1				0				0
    3	4			1			113803				53.1000	S			1			0			0			0		0		...	0		1		0		0			0			1			0		0				1				0
    4	5			0			373450				8.0500	S			0			0			1			0		0		...	0		0		1		0			0			1			0		1				0				0
    

    3.3.7 处理船票和票价

    船票在这里没什么用处,选择删除

    train_data.drop('Ticket',axis=1, inplace=True)
    
    >>>faredf1=pd.DataFrame(train_data['Fare'])
    >>>faredf1['Fare_First']=faredf1['Fare'].map(lambda f: 1 if f >= 30 and f <= 870 else 0)
    >>>faredf1['Fare_Second']=faredf1['Fare'].map(lambda f: 1 if f >= 12 and f < 30 else 0)
    >>>faredf1['Fare_Third']=faredf1['Fare'].map(lambda f: 1 if f >= 3 and f < 12 else 0)
    >>>train_data = pd.concat([train_data,faredf1], axis=1)
    >>>train_data.drop('Fare',axis=1, inplace=True)
    >>>train_data.head()
    >	
    	PassengerId	Survived	Embarked	Pclass_1	Pclass_2	Pclass_3	Master	Miss	Mr	Mrs	...	Age_baby	Age_youth	Age_middle	Age_old	Family_Single	Family_Small	Family_Large	Fare_First	Fare_Second	Fare_Third
    0	1			0			S			0			0			1			0		0		1	0	...	0			1			0			0		0				1				0				0			0			1
    1	2			1			C			1			0			0			0		0		0	1	...	0			0			1			0		0				1				0				1			0			0
    2	3			1			S			0			0			1			0		1		0	0	...	0			1			0			0		1				0				0				0			0			1
    3	4			1			S			1			0			0			0		0		0	1	...	0			0			1			0		0				1				0				1			0			0
    4	5			0			S			0			0			1			0		0		1	0	...	0			0			1			0		1				0				0				0			0			1
    

    3.3.8 处理上船港口

    #处理上船港口
    >>>embarkeddf1 = pd.DataFrame()
    #使用get_dummies进行one-hot编码,产生虚拟变量(dummy variables),列名前缀是Embarked
    >>>embarkeddf1 = pd.get_dummies(train_data['Embarked'], prefix='Embarked')
    >>>train_data = pd.concat([train_data, embarkeddf1], axis=1)
    >>>train_data.drop('Embarked',axis=1, inplace=True)
    >>>train_data.head()
    >	
    	PassengerId	Survived	Pclass_1	Pclass_2	Pclass_3	Master	Miss	Mr	Mrs	Officer	...	Age_old	Family_Single	Family_Small	Family_Large	Fare_First	Fare_Second	Fare_Third	Embarked_C	Embarked_Q	Embarked_S
    0	1			0			0			0			1			0		0		1	0	0		...	0		0				1				0				0			0			1			0			0			1
    1	2			1			1			0			0			0		0		0	1	0		...	0		0				1				0				1			0			0			1			0			0	
    2	3			1			0			0			1			0		1		0	0	0		...	0		1				0				0				0			0			1			0			0			1
    3	4			1			1			0			0			0		0		0	1	0		...	0		0				1				0				1			0			0			0			0			1
    4	5			0			0			0			1			0		0		1	0	0		...	0		1				0				0				0			0			1			0			0			1
    

    至此所有的训练集数据均处理完了

    3.4 处理测试集

    按照如上操作再处理测试集,为了方便预测结果需要将测试集的数据处理的与训练集相同

    首先,查看缺失的数量

    >>>test_data.isnull().sum()
    >PassengerId      0
     Pclass           0
     Name             0
     Sex              0
     Age             86
     SibSp            0
     Parch            0
     Ticket           0
     Fare             1
     Cabin          327
     Embarked         0
     dtype: int64
    
    #处理测试集数据
    >>>test_data['Age'] = test_data['Age'].fillna(int(test_data['Age'].mean()))
    >>>test_data['Fare'] = test_data['Fare'].fillna(test_data['Fare'].mean())
    >>>test_data.drop(columns = 'Cabin', axis=1,inplace=True)
    >>>test_data.isnull().sum()
    >PassengerId    0
     Pclass         0
     Name           0
     Sex            0
     Age            0
     SibSp          0
     Parch          0
     Ticket         0
     Fare           0
     Embarked       0
     dtype: int64
    
    #处理年龄
    >>>agedf2 = pd.DataFrame(test_data['Age'])
    >>>agedf2['Age_baby'] = agedf2['Age'].map(lambda s: 1 if 0 <= s < 15 else 0)
    >>>agedf2['Age_youth'] = agedf2['Age'].map(lambda s: 1 if 15 <= s < 35 else 0)
    >>>agedf2['Age_middle'] = agedf2['Age'].map(lambda s: 1 if 35 <= s <=60 else 0)
    >>>agedf2['Age_old'] = agedf2['Age'].map(lambda s: 1 if 60 <= s else 0)
    >>>test_data = pd.concat([test_data, agedf2], axis=1)
    >>>test_data.drop('Age',axis=1, inplace=True)
    #处理性别
    >>>test_data['Sex'] = test_data['Sex'].map(sex_mapdict)
    >>>sexdf2 = pd.DataFrame()
    >>>sexdf2 = pd.get_dummies( test_data['Sex'], prefix='Sex')
    >>>test_data = pd.concat([test_data, sexdf2], axis=1)
    >>>test_data.drop('Sex',axis=1, inplace=True)
    #处理上船港口
    >>>embarkeddf2 = pd.DataFrame()
    >>>embarkeddf2 = pd.get_dummies(test_data['Embarked'], prefix='Embarked')
    >>>test_data = pd.concat([test_data, embarkeddf2], axis=1)
    >>>test_data.drop('Embarked',axis=1, inplace=True)
    #处理用户阶级
    >>>pclassdf2 = pd.DataFrame()
    >>>pclassdf2 = pd.get_dummies( test_data['Pclass'] , prefix='Pclass' )
    >>>test_data = pd.concat([test_data, pclassdf2], axis=1)
    >>>test_data.drop('Pclass',axis=1, inplace=True)
    #存放提取后的特征
    >>>titledf2 = pd.DataFrame()
    >>>titledf2['Title'] = test_data['Name'].map(gettitle)
    >>>titledf2['Title'] = titledf2['Title'].map(title_mapdict)
    >>>titledf2 = pd.get_dummies(titledf2['Title'])
    >>>test_data = pd.concat([test_data, titledf2], axis=1)
    >>>test_data.drop('Name',axis=1, inplace=True)
    #存放家庭信息
    >>>familydf2 = pd.DataFrame()
    >>>familydf2['FamilySize'] = test_data['Parch'] + test_data['SibSp'] + 1
    >>>familydf2['Family_Single'] = familydf2['FamilySize'].map(lambda s: 1 if s == 1 else 0)
    >>>familydf2['Family_Small'] = familydf2['FamilySize'].map(lambda s: 1 if 2 <= s <= 4 else 0)
    >>>familydf2['Family_Large'] = familydf2['FamilySize'].map(lambda s: 1 if 5 <= s else 0)
    >>>familydf2.drop('FamilySize',axis=1, inplace=True)
    >>>test_data = pd.concat([test_data, familydf2], axis=1)
    >>>test_data.drop('Parch',axis=1, inplace=True)
    >>>test_data.drop('SibSp',axis=1, inplace=True)
    #处理船票
    >>>faredf2=pd.DataFrame(test_data['Fare'])
    >>>faredf2['Fare_First']=faredf2['Fare'].map(lambda f: 1 if f >= 30 and f <= 870 else 0)
    >>>faredf2['Fare_Second']=faredf2['Fare'].map(lambda f: 1 if f >= 12 and f < 30 else 0)
    >>>faredf2['Fare_Third']=faredf2['Fare'].map(lambda f: 1 if f >= 3 and f < 12 else 0)
    >>>test_data = pd.concat([test_data, faredf2], axis=1)
    >>>test_data.drop('Fare',axis=1, inplace=True)
    >>>test_data.drop('Ticket',axis=1, inplace=True)
    >>>test_data.head()
    >	
    	PassengerId	Age_baby	Age_youth	Age_middle	Age_old	Sex_0	Sex_1	Embarked_C	Embarked_Q	Embarked_S	...	Mr	Mrs	Officer	Royalty	Family_Single	Family_Small	Family_Large	Fare_First	Fare_Second	Fare_Third
    0	892			0			1			0			0		0		1		0			1			0			...	1	0	0		0		1				0				0				0			0			1
    1	893			0			0			1			0		1		0		0			0			1			...	0	1	0		0		0				1				0				0			0			1
    2	894			0			0			0			1		0		1		0			1			0			...	1	0	0		0		1				0				0				0			0			1
    3	895			0			1			0			0		0		1		0			0			1			...	1	0	0		0		1				0				0				0			0			1
    4	896			0			1			0			0		1		0		0			0			1			...	0	1	0		0		0				1				0				0			1			0
    

    3.5 查看相关性

    >>>corrdf = train_data.corr()
    >				PassengerId	Survived	Pclass_1	Pclass_2	Pclass_3	Master		Miss		Mr			Mrs			Officer		...	Age_old		Family_Single	Family_Small	Family_Large	Fare_First	Fare_Second	Fare_Third	Embarked_C	Embarked_Q	Embarked_S
    PassengerId		1.000000   -0.005007	0.034303   -0.000086   -0.029486    -0.026151   -0.067846	 0.038850	 0.010197	 0.055299	...	 0.006611	 0.057462		-0.028976		-0.057055		 0.022603	-0.028772	-0.002661	-0.001205	-0.033606	 0.022204
    Survived	   -0.005007	1.000000	0.285904	0.093349   -0.322308	 0.085221	 0.332795	-0.549199	 0.344935	-0.031316	...	-0.040857	-0.203367		  0.279855		-0.125147		  0.254274	  0.066213	-0.270267	  0.168240	  0.003650	-0.149683
    Pclass_1		0.034303	0.285904	1.000000   -0.288585   -0.626738	-0.084700	 0.021958	-0.097288	 0.091483	 0.104919	...	 0.166443	-0.113364	 	0.168568		-0.092945	 	0.683722	-0.177457	-0.458268	 0.296423	-0.155342	-0.161921
    Pclass_2	   -0.000086	0.093349   -0.288585	1.000000   -0.565210	 0.009903	-0.027381	-0.088569	 0.125093	 0.084401	...	-0.022555	-0.039070	 	0.104546		-0.117721	   -0.153508	 0.404474	-0.259749	-0.125416	-0.127301	 0.189980
    Pclass_3	   -0.029486   -0.322308   -0.626738   -0.565210	1.000000	0.064918	0.003366	0.155907	-0.180630	-0.159089	...	-0.125051	0.129472		-0.230325		0.175890	   -0.464164	-0.176287	0.606245	-0.153329	0.237449	-0.015104
    Master		   -0.026151	0.085221   -0.084700	0.009903	0.064918	1.000000	-0.110602	-0.254903	-0.088394	-0.031131	...	-0.037588	-0.267024		0.102668		0.324136		0.063844	0.099667	-0.144515	-0.035225	0.010478	0.024264
    Miss		   -0.067846	0.332795	0.021958   -0.027381	0.003366	-0.110602	1.000000	-0.599803	-0.207996	-0.073253	...	-0.071973	-0.050402		-0.007684		0.111105		0.077733	-0.008436	-0.044849	0.037613	0.168720	-0.139126
    Mr				0.038850   -0.549199   -0.097288   -0.088569	0.155907	-0.254903	-0.599803	1.000000	-0.479363	-0.168826	...	0.066390	0.396920		-0.292792		-0.223221	   -0.201253	-0.168402	0.316688	-0.072567	-0.078338	0.112870
    Mrs				0.010197	0.344935	0.091483	0.125093   -0.180630	-0.088394	-0.207996	-0.479363	1.000000	-0.058544	...	-0.013465	-0.357826		0.365088		0.014670		0.121511	0.168896	-0.255565	0.066101	-0.091121	-0.000565
    Officer			0.055299   -0.031316	0.104919	0.084401   -0.159089	-0.031131	-0.073253	-0.168826	-0.058544	1.000000	...	0.069897	0.035074		-0.015279		-0.039269		0.056671	0.058263	-0.101410	-0.008034	0.012618	-0.000902
    Royalty			0.031602	0.033391	0.132798   -0.038324   -0.083230	-0.016287	-0.038324	-0.088324	-0.030628	-0.010787	...	-0.013024	-0.000414		0.011568		-0.020544		0.055989	-0.018161	-0.061567	0.079020	-0.023105	-0.054685
    Sex_0		   -0.042939	0.543351	0.098013	0.064746   -0.137143	-0.159934	0.691548	-0.867334	0.552686	-0.089228	...	-0.072063	-0.303646		0.260747		0.102954		0.161102	0.116775	-0.230803	0.082853	0.074115	-0.119224
    Sex_1			0.042939   -0.543351   -0.098013   -0.064746	0.137143	0.159934	-0.691548	0.867334	-0.552686	0.089228	...	0.072063	0.303646		-0.260747		-0.102954   	-0.161102	-0.116775	0.230803	-0.082853	-0.074115	0.119224
    Age_baby	   -0.026833	0.122978   -0.128886	0.028373	0.087957	0.623234	0.214762	-0.340037	-0.114928	-0.044477	...	-0.053701	-0.349033		0.172907		0.352281		0.071520	0.142821	-0.189052	0.002974	-0.038734	0.021770
    Age_youth	   -0.003044   -0.091170   -0.231081   -0.042761	0.233902	-0.249201	0.021126	0.188732	-0.083297	-0.111574	...	-0.235597	0.228904		-0.152384	     -0.159111	   -0.205044	-0.128593	0.305773	-0.023941	0.134290	-0.063535
    Age_middle		0.024216	0.039188	0.290830	0.039091   -0.282395	-0.121518	-0.142926	-0.013824	0.177937	0.143963	...	-0.034638	-0.050408		0.079582		-0.049864		0.157938	0.065555	-0.212457	0.032550	-0.125563	0.050502
    Age_old			0.006611   -0.040857	0.166443   -0.022555   -0.125051	-0.037588	-0.071973	0.066390	-0.013465	0.069897	...	1.000000	0.045377		-0.035810	 	-0.021206		0.090135	-0.044792	-0.033278	0.001665	-0.005860	0.002229
    Family_Single	0.057462   -0.203367   -0.113364   -0.039070	0.129472	-0.267024	-0.050402	0.396920	-0.357826	0.035074	...	0.045377	1.000000		-0.859931		-0.336825   	-0.339394	-0.240334	0.506354	-0.095298	0.086464	0.029074
    Family_Small   -0.028976	0.279855	0.168568	0.104546   -0.230325	0.102668	-0.007684	-0.292792	0.365088	-0.015279	...	-0.035810	-0.859931		1.000000		-0.190940		0.228243	0.242009	-0.411264	0.158586	-0.087093	-0.084120
    Family_Large   -0.057055   -0.125147   -0.092945   -0.117721	0.175890	0.324136	0.111105	-0.223221	0.014670	-0.039269	...	-0.021206	-0.336825		-0.190940		1.000000		0.231664	0.015760	-0.215131	-0.109274	-0.005620	0.099265
    Fare_First 		0.022603	0.254274	0.683722   -0.153508   -0.464164	0.063844	0.077733	-0.201253	0.121511	0.056671	...	0.090135	-0.339394		0.228243		0.231664		1.000000	-0.408891	-0.497615	0.237676	-0.168737	-0.102027
    Fare_Second	   -0.028772	0.066213   -0.177457	0.404474   -0.176287	0.099667	-0.008436	-0.168402	0.168896	0.058263	...	-0.044792	-0.240334		0.242009		0.015760   	   -0.408891	1.000000	-0.551912	-0.033551	-0.026076	0.045802
    Fare_Third	   -0.002661   -0.270267   -0.458268   -0.259749	0.606245	-0.144515	-0.044849	0.316688	-0.255565	-0.101410	...	-0.033278	0.506354		-0.411264		-0.215131   	-0.497615	-0.551912	1.000000	-0.166809	0.187891	0.027891
    Embarked_C	   -0.001205	0.168240	0.296423   -0.125416   -0.153329	-0.035225	0.037613	-0.072567	0.066101	-0.008034	...	0.001665	-0.095298		0.158586		-0.109274		0.237676	-0.033551	-0.166809	1.000000	-0.148258	-0.782742
    Embarked_Q     -0.033606	0.003650   -0.155342   -0.127301	0.237449	0.010478	0.168720	-0.078338	-0.091121	0.012618	...	-0.005860	0.086464		-0.087093		-0.005620	-	0.168737	-0.026076	0.187891	-0.148258	1.000000	-0.499421
    Embarked_S	 	0.022204   -0.149683   -0.161921	0.189980   -0.015104	0.024264	-0.139126	0.112870	-0.000565	-0.000902	...	0.002229	0.029074		-0.084120		0.099265	   -0.102027	0.045802	0.027891	-0.782742	-0.499421	1.000000
    
    >>>corrdf['Survived'].round(4).abs().sort_values(ascending=False)
    >Survived         1.0000
     Mr               0.5492
     Sex_1            0.5434
     Sex_0            0.5434
     Mrs              0.3449
     Miss             0.3328
     Pclass_3         0.3223
     Pclass_1         0.2859
     Family_Small     0.2799
     Fare_Third       0.2703
     Fare_First       0.2543
     Family_Single    0.2034
     Embarked_C       0.1682
     Embarked_S       0.1497
     Family_Large     0.1251
     Age_baby         0.1230
     Pclass_2         0.0933
     Age_youth        0.0912
     Master           0.0852
     Fare_Second      0.0662
     Age_old          0.0409
     Age_middle       0.0392
     Royalty          0.0334
     Officer          0.0313
     PassengerId      0.0050
     Embarked_Q       0.0037
     Name: Survived, dtype: float64
    

    3.6 划分训练集和检验集

    >>>source_y = train_data.Survived
    >>>source_x = train_data.drop(['Survived'],axis=1)
    >>>train_x, test_x, train_y, test_y = train_test_split(source_x, source_y , train_size=0.8,test_size=0.2)
    

    4. 训练模型

    4.1 逻辑回归模型

    >>>model_lr = LogisticRegression()
    >>>model_lr.fit(train_x, train_y)
    

    4.2 随机森林

    >>>model_rfc = RandomForestClassifier()
    >>>model_rfc.fit(train_x, train_y)
    

    4.3 支持向量机

    >>>model_svm = SVC()
    >>>model_svm.fit(train_x, train_y)
    

    4.4 K最近邻

    >>>model_knn = KNeighborsClassifier()
    >>>model_knn.fit(train_x, train_y)
    

    4.5 决策树

    >>>model_dtree = DecisionTreeClassifier()
    >>>model_dtree.fit(train_x, train_y)
    

    5.测试模型

    5.1 逻辑回归模型

    >>>accuracy_lr = model_lr.score(test_x,test_y)
    >>>print("逻辑回归的测试结果:", accuracy_lr)
    >逻辑回归的测试结果: 0.7988826815642458
    

    5.2 随机森林

    >>>accuracy_rfc = model_rfc.score(test_x,test_y)
    >>>print("随机森林的测试结果:", accuracy_rfc)
    >随机森林的测试结果: 0.770949720670391
    

    5.3 支持向量机

    >>>accuracy_svm = model_svm.score(test_x,test_y)
    >>>print("支持向量机的测试结果:", accuracy_svm)
    >支持向量机的测试结果: 0.5698324022346368
    

    5.4 K最近邻

    >>>accuracy_knn = model_knn.score(test_x,test_y)
    >>>print("K最近邻分类器的测试结果:", accuracy_knn)
    >K最近邻分类器的测试结果: 0.553072625698324
    

    5.5 决策树

    >>>accuracy_dtree= model_dtree.score(test_x,test_y)
    >>>print("决策树模型的测试结果:", accuracy_dtree)
    >决策树模型的测试结果: 0.7541899441340782
    

    5.6 5种模型对比

    >>>import seaborn as sns
    >>>import matplotlib.pyplot as plt
    >>>sns.set(rc={'figure.figsize':(15,6)}) 
    >>>accuracys = [accuracy_lr, accuracy_rfc, accuracy_svm, accuracy_knn, accuracy_dtree]
    >>>models = ['Logistic', 'RF', 'SVM', 'KNN', 'Dtree']
    >>>bar = sns.barplot(x=models, y=accuracys)
    # 显示数值标签
    >>>for x, y in enumerate(accuracys):
    >>>    plt.text(x, y, '%s'% round(y,3), ha='center')
    >>>plt.xlabel("Model")
    >>>plt.ylabel("Accuracy")
    >>>plt.savefig(r"D:\data\python\exercise\test2\3.png")
    >>>plt.show()
    

    6. 预测模型

    6.1 逻辑回归模型

    >>>pred_lr = model_lr.predict(pred_x)
    >>>pred_lr = pred_lr.astype(int)
    >>>passenger_id = test_data.iloc[:, 0]
    #逻辑回归的预测结果
    >>>preddf1 = pd.DataFrame({'PassengerId': passenger_id,'Survived': pred_lr})
    >>>preddf1.to_csv(r'D:\data\python\taitanic\titanic_pred_model_lr.csv', index=False)
    

    6.2 随机森林

    >>>pred_rfc = model_rfc.predict(pred_x)
    >>>pred_rfc = pred_rfc.astype(int)
    #随机森林的预测结果
    >>>preddf2 = pd.DataFrame({'PassengerId': passenger_id,'Survived': pred_rfc})
    >>>preddf2.to_csv(r'D:\data\python\taitanic\titanic_pred_model_rfc.csv', index=False)
    

    6.3 支持向量机

    >>>pred_svm = model_svm.predict(pred_x)
    >>>pred_svm = pred_svm.astype(int)
    #支持向量机的预测结果
    >>>preddf3 = pd.DataFrame({'PassengerId': passenger_id,'Survived': pred_svm})
    >>>preddf3.to_csv(r'D:\data\python\taitanic\titanic_pred_model_svm.csv', index=False)
    

    6.4 K最近邻

    >>>pred_knn = model_knn.predict(pred_x)
    >>>pred_knn = pred_knn.astype(int)
    #K最近邻分类器的预测结果
    >>>preddf4 = pd.DataFrame({'PassengerId': passenger_id,'Survived': pred_knn})
    >>>preddf4.to_csv(r'D:\data\python\taitanic\titanic_pred_model_knn.csv', index=False)
    

    6.5 决策树

    >>>pred_dtree= model_dtree.predict(pred_x)
    >>>pred_dtree = pred_dtree.astype(int)
    #决策树模型的预测结果
    >>>preddf5 = pd.DataFrame({'PassengerId': passenger_id,'Survived': pred_dtree})
    >>>preddf5.to_csv(r'D:\data\python\taitanic\titanic_pred_model_dtree.csv', index=False)
    

    6.6 将上述结果交至kaggle进行评分

    得到结果如下

    可以看到逻辑回归的分数依然是最高的

    7. 完整代码

    建议使用jupyter notebook完成以下内容

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns 
    import random
    
    from sklearn.linear_model import LogisticRegression # 逻辑回归
    from sklearn.ensemble import RandomForestClassifier # 随机森林
    from sklearn.svm import SVC # 支持向量机
    from sklearn.neighbors import KNeighborsClassifier # K最近邻
    from sklearn.tree import DecisionTreeClassifier # 决策树
    
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    #训练数据集
    train_data = pd.read_csv(r"D:\data\python\taitanic\train.csv")
    #测试数据集
    test_data = pd.read_csv(r"D:\data\python\taitanic\test.csv")
    print('训练数据集:', train_data.shape, '测试数据集:', test_data.shape)
    
    train_data.isnull().sum()
    
    plt.figure(figsize=(10,5),dpi=100)
    train_data['Embarked'].value_counts().plot(kind='bar')
    plt.savefig(r"D:\data\python\exercise\test2\1.png")
    
    #选取频率最高的填充到空白数据中
    train_data['Embarked'] = train_data['Embarked'].fillna('S')
    train_data.isnull().sum()
    
    #处理空白年龄(Age),使用平均值填充
    train_data['Age'] = train_data['Age'].fillna(int(train_data['Age'].mean()))
    train_data.isnull().sum()
    
    #至此除了Cabin(船舱号)其他缺失值已经补充完毕.Cabin这一列数据值缺失过多选择填充会导致得到的数据过于片面,因此,选择删去这一列
    train_data.drop(columns = 'Cabin', axis=1,inplace=True)
    
    train_data.head()
    
    #处理用户阶级
    pclassdf1 = pd.DataFrame()
    #使用get_dummies进行one-hot编码,列名前缀是Pclass
    pclassdf1 = pd.get_dummies( train_data['Pclass'] , prefix='Pclass' )
    train_data = pd.concat([train_data, pclassdf1], axis=1)
    train_data.drop('Pclass',axis=1, inplace=True)
    train_data.head()
    
    def gettitle(name):
        str1 = name.split(',')[1] #Mr. Owen Harris
        str2 = str1.split('.')[0]#Mr
        str3 = str2.strip()
        return str3
    #存放提取后的特征
    titledf1 = pd.DataFrame()
    #map函数:对Series每个数据应用自定义的函数计算
    titledf1['Title'] = train_data['Name'].map(gettitle)
    #查看titledf的种类
    titledf1['Title'].value_counts()
    
    #姓名中头衔字符串与定义头衔类别的对应关系
    title_mapdict = {
                        "Capt":       "Officer",
                        "Col":        "Officer",
                        "Major":      "Officer",
                        "Jonkheer":   "Royalty",
                        "Don":        "Royalty",
                        "Sir":        "Royalty",
                        "Dr":         "Officer",
                        "Rev":        "Officer",
                        "the Countess":"Royalty",
                        "Dona":       "Royalty",
                        "Mme":        "Mrs",
                        "Mlle":       "Miss",
                        "Ms":         "Mrs",
                        "Mr":         "Mr",
                        "Mrs":        "Mrs",
                        "Miss":       "Miss",
                        "Master":     "Master",
                        "Lady":       "Royalty"
                        }
    #map函数:对Series每个数据应用自定义的函数计算
    titledf1['Title'] = titledf1['Title'].map(title_mapdict)
    #使用get_dummies进行one-hot编码
    titledf1 = pd.get_dummies(titledf1['Title'])
    train_data = pd.concat([train_data, titledf1], axis=1)
    train_data.drop('Name',axis=1, inplace=True)
    train_data.head()
    
    #处理性别
    sex_mapdict = {'male': 1, 'female': 0}
    #map函数:对Series每个数据应用自定义的函数计算
    train_data['Sex'] = train_data['Sex'].map(sex_mapdict)
    sexdf1 = pd.DataFrame()
    #使用get_dummies进行one-hot编码,产生虚拟变量(dummy variables),列名前缀是Sex
    sexdf1 = pd.get_dummies( train_data['Sex'], prefix='Sex')
    train_data = pd.concat([train_data, sexdf1], axis=1)
    train_data.drop('Sex',axis=1, inplace=True)
    train_data.head()
    
    plt.figure(figsize=(18,4))
    train_data['Age']=train_data['Age'].astype(np.int)
    average_age=train_data[['Age','Survived']].groupby('Age',as_index=False).mean()
    sns.barplot(x='Age',y='Survived',data=average_age,palette='BuPu')
    plt.savefig(r"D:\data\python\exercise\test2\2.png")
    
    agedf1 = pd.DataFrame(train_data['Age'])
    agedf1['Age_baby'] = agedf1['Age'].map(lambda s: 1 if 0 <= s < 15 else 0)
    agedf1['Age_youth'] = agedf1['Age'].map(lambda s: 1 if 15 <= s < 35 else 0)
    agedf1['Age_middle'] = agedf1['Age'].map(lambda s: 1 if 35 <= s <=60 else 0)
    agedf1['Age_old'] = agedf1['Age'].map(lambda s: 1 if 60 <= s else 0)
    train_data = pd.concat([train_data,agedf1], axis=1)
    train_data.drop('Age',axis=1, inplace=True)
    train_data.head()
    
    #存放家庭信息
    familydf1 = pd.DataFrame()
    
    #家庭人数
    familydf1['FamilySize'] = train_data['Parch'] + train_data['SibSp'] + 1
    
    #家庭类别
    #if 条件为真的时候返回if前面内容,否则返回0
    familydf1['Family_Single'] = familydf1['FamilySize'].map(lambda s: 1 if s == 1 else 0)
    
    familydf1['Family_Small'] = familydf1['FamilySize'].map(lambda s: 1 if 2 <= s <= 4 else 0)
    
    familydf1['Family_Large'] = familydf1['FamilySize'].map(lambda s: 1 if 5 <= s else 0)
    familydf1.drop('FamilySize',axis=1, inplace=True)
    train_data = pd.concat([train_data, familydf1], axis=1)
    train_data.drop('Parch',axis=1, inplace=True)
    train_data.drop('SibSp',axis=1, inplace=True)
    train_data.head()
    
    #船票在这里没什么用处,选择删除
    train_data.drop('Ticket',axis=1, inplace=True)
    
    faredf1=pd.DataFrame(train_data['Fare'])
    faredf1['Fare_First']=faredf1['Fare'].map(lambda f: 1 if f >= 30 and f <= 870 else 0)
    faredf1['Fare_Second']=faredf1['Fare'].map(lambda f: 1 if f >= 12 and f < 30 else 0)
    faredf1['Fare_Third']=faredf1['Fare'].map(lambda f: 1 if f >= 3 and f < 12 else 0)
    train_data = pd.concat([train_data,faredf1], axis=1)
    train_data.drop('Fare',axis=1, inplace=True)
    train_data.head()
    
    #处理上船港口
    embarkeddf1 = pd.DataFrame()
    #使用get_dummies进行one-hot编码,产生虚拟变量(dummy variables),列名前缀是Embarked
    embarkeddf1 = pd.get_dummies(train_data['Embarked'], prefix='Embarked')
    train_data = pd.concat([train_data, embarkeddf1], axis=1)
    train_data.drop('Embarked',axis=1, inplace=True)
    train_data.head()
    
    test_data.isnull().sum()
    
    #处理测试集数据
    test_data['Age'] = test_data['Age'].fillna(int(test_data['Age'].mean()))
    test_data['Fare'] = test_data['Fare'].fillna(test_data['Fare'].mean())
    test_data.drop(columns = 'Cabin', axis=1,inplace=True)
    test_data.isnull().sum()
    
    agedf2 = pd.DataFrame(test_data['Age'])
    agedf2['Age_baby'] = agedf2['Age'].map(lambda s: 1 if 0 <= s < 15 else 0)
    agedf2['Age_youth'] = agedf2['Age'].map(lambda s: 1 if 15 <= s < 35 else 0)
    agedf2['Age_middle'] = agedf2['Age'].map(lambda s: 1 if 35 <= s <=60 else 0)
    agedf2['Age_old'] = agedf2['Age'].map(lambda s: 1 if 60 <= s else 0)
    test_data = pd.concat([test_data, agedf2], axis=1)
    test_data.drop('Age',axis=1, inplace=True)
    #处理性别
    test_data['Sex'] = test_data['Sex'].map(sex_mapdict)
    sexdf2 = pd.DataFrame()
    sexdf2 = pd.get_dummies( test_data['Sex'], prefix='Sex')
    test_data = pd.concat([test_data, sexdf2], axis=1)
    test_data.drop('Sex',axis=1, inplace=True)
    #处理上船港口
    embarkeddf2 = pd.DataFrame()
    embarkeddf2 = pd.get_dummies(test_data['Embarked'], prefix='Embarked')
    test_data = pd.concat([test_data, embarkeddf2], axis=1)
    test_data.drop('Embarked',axis=1, inplace=True)
    #处理用户阶级
    pclassdf2 = pd.DataFrame()
    pclassdf2 = pd.get_dummies( test_data['Pclass'] , prefix='Pclass' )
    test_data = pd.concat([test_data, pclassdf2], axis=1)
    test_data.drop('Pclass',axis=1, inplace=True)
    #存放提取后的特征
    titledf2 = pd.DataFrame()
    titledf2['Title'] = test_data['Name'].map(gettitle)
    titledf2['Title'] = titledf2['Title'].map(title_mapdict)
    titledf2 = pd.get_dummies(titledf2['Title'])
    test_data = pd.concat([test_data, titledf2], axis=1)
    test_data.drop('Name',axis=1, inplace=True)
    #存放家庭信息
    familydf2 = pd.DataFrame()
    familydf2['FamilySize'] = test_data['Parch'] + test_data['SibSp'] + 1
    familydf2['Family_Single'] = familydf2['FamilySize'].map(lambda s: 1 if s == 1 else 0)
    familydf2['Family_Small'] = familydf2['FamilySize'].map(lambda s: 1 if 2 <= s <= 4 else 0)
    familydf2['Family_Large'] = familydf2['FamilySize'].map(lambda s: 1 if 5 <= s else 0)
    familydf2.drop('FamilySize',axis=1, inplace=True)
    test_data = pd.concat([test_data, familydf2], axis=1)
    test_data.drop('Parch',axis=1, inplace=True)
    test_data.drop('SibSp',axis=1, inplace=True)
    #处理船票
    faredf2=pd.DataFrame(test_data['Fare'])
    faredf2['Fare_First']=faredf2['Fare'].map(lambda f: 1 if f >= 30 and f <= 870 else 0)
    faredf2['Fare_Second']=faredf2['Fare'].map(lambda f: 1 if f >= 12 and f < 30 else 0)
    faredf2['Fare_Third']=faredf2['Fare'].map(lambda f: 1 if f >= 3 and f < 12 else 0)
    test_data = pd.concat([test_data, faredf2], axis=1)
    test_data.drop('Fare',axis=1, inplace=True)
    test_data.drop('Ticket',axis=1, inplace=True)
    test_data.head()
    
    corrdf = train_data.corr()
    corrdf
    
    corrdf['Survived'].round(4).abs().sort_values(ascending=False)
    
    source_y = train_data.Survived
    source_x = train_data.drop(['Survived'],axis=1)
    train_x, test_x, train_y, test_y = train_test_split(source_x, 
               source_y , train_size=0.8,test_size=0.2)
    
    pred_x=test_data
    
    model_lr = LogisticRegression()
    model_lr.fit(train_x, train_y)
    pred_lr = model_lr.predict(pred_x)
    pred_lr = pred_lr.astype(int)
    accuracy_lr = model_lr.score(test_x,test_y)
    print("逻辑回归的测试结果:", accuracy_lr)
    
    model_rfc = RandomForestClassifier()
    model_rfc.fit(train_x, train_y)
    pred_rfc = model_rfc.predict(pred_x)
    pred_rfc = pred_rfc.astype(int)
    accuracy_rfc = model_rfc.score(test_x,test_y)
    print("随机森林的预试结果:", accuracy_rfc)
    
    model_svm = SVC()
    model_svm.fit(train_x, train_y)
    pred_svm = model_svm.predict(pred_x)
    pred_svm = pred_svm.astype(int)
    accuracy_svm = model_svm.score(test_x,test_y)
    print("支持向量机的测试结果:", accuracy_svm)
    
    model_knn = KNeighborsClassifier()
    model_knn.fit(train_x, train_y)
    pred_knn = model_knn.predict(pred_x)
    pred_knn = pred_knn.astype(int)
    accuracy_knn = model_knn.score(test_x,test_y)
    print("K最近邻分类器的测试结果:", accuracy_knn)
    
    model_dtree = DecisionTreeClassifier()
    model_dtree.fit(train_x, train_y)
    pred_dtree= model_dtree.predict(pred_x)
    pred_dtree = pred_dtree.astype(int)
    accuracy_dtree= model_dtree.score(test_x,test_y)
    print("决策树模型的测试结果:", accuracy_dtree)
    
    import seaborn as sns
    import matplotlib.pyplot as plt
     
    sns.set(rc={'figure.figsize':(15,6)}) # 设置画布大小
    accuracys = [accuracy_lr, accuracy_rfc, accuracy_svm, accuracy_knn, accuracy_dtree]
    models = ['Logistic', 'RF', 'SVM', 'KNN', 'Dtree']
    bar = sns.barplot(x=models, y=accuracys)
     
    # 显示数值标签
    for x, y in enumerate(accuracys):
        plt.text(x, y, '%s'% round(y,3), ha='center')
    
    plt.xlabel("Model")
    plt.ylabel("Accuracy")
    plt.savefig(r"D:\data\python\exercise\test2\3.png")
    plt.show()
    
    #数据框:乘客id,预测生存情况的值
    passenger_id = test_data.iloc[:, 0]
    #逻辑回归的预测结果
    preddf1 = pd.DataFrame({'PassengerId': passenger_id,'Survived': pred_lr})
    preddf1.to_csv(r'D:\data\python\taitanic\titanic_pred_model_lr.csv', index=False)
    #随机森林的预测结果
    preddf2 = pd.DataFrame({'PassengerId': passenger_id,'Survived': pred_rfc})
    preddf2.to_csv(r'D:\data\python\taitanic\titanic_pred_model_rfc.csv', index=False)
    #支持向量机的预测结果
    preddf3 = pd.DataFrame({'PassengerId': passenger_id,'Survived': pred_svm})
    preddf3.to_csv(r'D:\data\python\taitanic\titanic_pred_model_svm.csv', index=False)
    #K最近邻分类器的预测结果
    preddf4 = pd.DataFrame({'PassengerId': passenger_id,'Survived': pred_knn})
    preddf4.to_csv(r'D:\data\python\taitanic\titanic_pred_model_knn.csv', index=False)
    #决策树模型的预测结果
    preddf5 = pd.DataFrame({'PassengerId': passenger_id,'Survived': pred_dtree})
    preddf5.to_csv(r'D:\data\python\taitanic\titanic_pred_model_dtree.csv', index=False)
    

    8. 总结

    以上就是泰坦尼克号的生存预测分析。
    如果你觉得这篇文章对你有用,建议点赞收藏。
    欢迎各位读者指正错误,请在评论区留言。或者发表自己的看法,小编不胜感激。

    物联沃分享整理
    物联沃-IOTWORD物联网 » 探究泰坦尼克号生存率的数据分析与预测

    发表评论