代码收藏家技术教程 2023-06-15

探究泰坦尼克号生存率的数据分析与预测

文章目录

3.2.3 处理缺失的上船港口（Embarked）

6.6 将上述结果交至kaggle进行评分

7. 完整代码

8. 总结

1.导入库

小编一般喜欢在在代码的刚开始便导入所有要使用的库，这样可以直观明了的将自己的思路展现出来，并且对代码的布局也更加美观。
下面是泰坦尼克号需要使用的库：

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
import random

其中关于matplotlib,numpy,pandas工具的使用可以参考小编写的例外一篇博客
机器学习入门基本使用工具（保姆式教学）:matplotlib,numpy,pandas这一篇就够了
这里是泰坦尼克号的数据集链接：https://pan.baidu.com/s/1C8mRRSkSdBsVRZ5zv_zCbA?pwd=w45i
提取码：w45i

2.引入机器学习的模型

这里导入5个模型，用来比较各种模型之间的差距

2.1 逻辑回归模型

from sklearn.linear_model import LogisticRegression

2.2 随机森林

from sklearn.ensemble import RandomForestClassifier

2.3 支持向量机

from sklearn.svm import SVC

2.4 K最近邻

from sklearn.neighbors import KNeighborsClassifier

2.5 决策树

from sklearn.tree import DecisionTreeClassifier

3. 对数据处理

3.1 导入数据

#训练数据集
train_data = pd.read_csv(r"D:\data\python\taitanic\train.csv")
#测试数据集
test_data = pd.read_csv(r"D:\data\python\taitanic\test.csv")

3.2 处理训练集缺失值

3.2.1 查看维度

>>>print('训练数据集：', train_data.shape, '测试数据集：', test_data.shape)
>训练数据集： (891, 12) 测试数据集： (418, 11)

3.2.2 查看缺失的个数

>>>train_data.isnull().sum()
>PassengerId      0
 Survived         0
 Pclass           0
 Name             0
 Sex              0
 Age            177
 SibSp            0
 Parch            0
 Ticket           0
 Fare             0
 Cabin          687
 Embarked         2
 dtype: int64

3.2.3 处理缺失的上船港口（Embarked）

查看上船港口数

>>>plt.figure(figsize=(10,5),dpi=100)
>>>train_data['Embarked'].value_counts().plot(kind='bar')
>>>plt.savefig(r"D:\data\python\exercise\test2\1.png")

可以看到从S港口上船的人数最多，而数据只缺失两个，所以选取频率最高的填充到空白数据中

>>>train_data['Embarked'] = train_data['Embarked'].fillna('S')
>>>train_data.isnull().sum()
>PassengerId      0
 Survived         0
 Pclass           0
 Name             0
 Sex              0
 Age            177
 SibSp            0
 Parch            0
 Ticket           0
 Fare             0
 Cabin          687
 Embarked         0
 dtype: int64

处理缺失的年龄（Age）

处理空白年龄(Age)，使用平均值填充

>>>train_data['Age'] = train_data['Age'].fillna(int(train_data['Age'].mean()))
>>>train_data.isnull().sum()
>PassengerId      0
 Survived         0
 Pclass           0
 Name             0
 Sex              0
 Age              0
 SibSp            0
 Parch            0
 Ticket           0
 Fare             0
 Cabin          687
 Embarked         0
 dtype: int64

处理缺失的客舱号（Cabin）

至此除了Cabin（船舱号）其他缺失值已经补充完毕.Cabin这一列数据值缺失过多选择填充会导致得到的数据过于片面，因此，选择删去这一列

>>>train_data.drop(columns = 'Cabin', axis=1,inplace=True)
>>>train_data.isnull().sum()
>PassengerId      0
 Survived         0
 Pclass           0
 Name             0
 Sex              0
 Age              0
 SibSp            0
 Parch            0
 Ticket           0
 Fare             0
 Embarked         0
 dtype: int64

至此，训练集的缺失值已经处理完毕

3.3 对训练集的数据进行处理

3.3.1 打印前5行查看数据集

>>>train_data.head()
>
	PassengerId	Survived	Pclass					Name								Sex		Age		SibSp	Parch	Ticket				Fare	Embarked
0	1			0			3		Braund, Mr. Owen Harris								male	22.0	1		0		A/5 21171			7.2500	S
1	2			1			1		Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1		0		PC 17599			71.2833	C
2	3			1			3		Heikkinen, Miss. Laina								female	26.0	0		0		STON/O2. 3101282	7.9250	S
3	4			1			1		Futrelle, Mrs. Jacques Heath (Lily May Peel)		female	35.0	1		0		113803				53.1000	S
4	5			0			3		Allen, Mr. William Henry							male	35.0	0		0		373450				8.0500	S

3.3.2 处理Pclass(客舱等级)

使用get_dummies进行one-hot编码，列名前缀是Pclass

>>>pclassdf1 = pd.DataFrame()
>>>pclassdf1 = pd.get_dummies(train_data['Pclass'] , prefix='Pclass' )
>>>train_data = pd.concat([train_data, pclassdf1], axis=1)
>>>train_data.drop('Pclass',axis=1, inplace=True)
>>>train_data.head()
	PassengerId	Survived					Name								Sex		Age		SibSp	Parch	Ticket				Fare	Embarked	Pclass_1	Pclass_2	Pclass_3
0	1			0			Braund, Mr. Owen Harris								male	22.0	1		0		A/5 21171			7.2500	S			0			0			1
1	2			1			Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1		0		PC 17599			71.2833	C			1			0			0
2	3			1			Heikkinen, Miss. Laina								female	26.0	0		0		STON/O2. 3101282	7.9250	S			0			0			1
3	4			1			Futrelle, Mrs. Jacques Heath (Lily May Peel)		female	35.0	1		0		113803				53.1000	S			1			0			0
4	5			0			Allen, Mr. William Henry							male	35.0	0		0		373450				8.0500	S			0			0			1

3.3.3 处理姓名

>>>def gettitle(name):
>>>    str1 = name.split(',')[1] #Mr. Owen Harris
>>>    str2 = str1.split('.')[0]#Mr
>>>    str3 = str2.strip()
>>>    return str3
#存放提取后的特征
>>>titledf1 = pd.DataFrame()
#map函数：对Series每个数据应用自定义的函数计算
>>>titledf1['Title'] = train_data['Name'].map(gettitle)
#查看titledf的种类
>>>titledf1['Title'].value_counts()
>Mr              517
 Miss            182
 Mrs             125
 Master           40
 Dr                7
 Rev               6
 Col               2
 Major             2
 Mlle              2
 Capt              1
 Lady              1
 Jonkheer          1
 Mme               1
 Sir               1
 the Countess      1
 Ms                1
 Don               1
 Name: Title, dtype: int64

#姓名中头衔字符串与定义头衔类别的对应关系
>>>title_mapdict = {
>>>                    "Capt":       "Officer",
>>>                    "Col":        "Officer",
>>>                    "Major":      "Officer",
>>>                    "Jonkheer":   "Royalty",
>>>                    "Don":        "Royalty",
>>>                    "Sir":        "Royalty",
>>>                    "Dr":         "Officer",
>>>                    "Rev":        "Officer",
>>>                    "the Countess":"Royalty",
>>>                    "Dona":       "Royalty",
>>>                    "Mme":        "Mrs",
>>>                    "Mlle":       "Miss",
>>>                    "Ms":         "Mrs",
>>>                    "Mr":         "Mr",
>>>                    "Mrs":        "Mrs",
>>>                    "Miss":       "Miss",
>>>                    "Master":     "Master",
>>>                    "Lady":       "Royalty"
>>>                    }
#map函数：对Series每个数据应用自定义的函数计算
>>>titledf1['Title'] = titledf1['Title'].map(title_mapdict)
#使用get_dummies进行one-hot编码
>>>titledf1 = pd.get_dummies(titledf1['Title'])
>>>train_data = pd.concat([train_data, titledf1], axis=1)
>>>train_data.drop('Name',axis=1, inplace=True)
>>>train_data.head()
>
	PassengerId	Survived	Sex		Age		SibSp	Parch	Ticket				Fare	Embarked	Pclass_1	Pclass_2	Pclass_3	Master	Miss	Mr	Mrs	Officer	Royalty
0	1			0			male	22.0	1		0		A/5 21171			7.2500	S			0			0			1			0		0		1	0	0		0
1	2	 		1			female	38.0	1		0		PC 17599			71.2833	C			1			0			0			0		0		0	1	0		0
2	3			1			female	26.0	0		0		STON/O2. 3101282	7.9250	S			0			0			1			0		1		0	0	0		0
3	4			1			female	35.0	1		0		113803				53.1000	S			1			0			0			0		0		0	1	0		0
4	5			0			male	35.0	0		0		373450				8.0500	S			0			0			1			0		0		1	0	0		0

3.3.4 处理性别

#处理性别
>>>sex_mapdict = {'male': 1, 'female': 0}
#map函数：对Series每个数据应用自定义的函数计算
>>>train_data['Sex'] = train_data['Sex'].map(sex_mapdict)
>>>sexdf1 = pd.DataFrame()
#使用get_dummies进行one-hot编码，产生虚拟变量（dummy variables），列名前缀是Sex
>>>sexdf1 = pd.get_dummies( train_data['Sex'], prefix='Sex')
>>>train_data = pd.concat([train_data, sexdf1], axis=1)
>>>train_data.drop('Sex',axis=1, inplace=True)
>>>train_data.head()
>

	PassengerId	Survived	Age		SibSp	Parch	Ticket				Fare	Embarked	Pclass_1	Pclass_2	Pclass_3	Master	Miss	Mr	Mrs	Officer	Royalty	Sex_0	Sex_1
0	1			0			22.0	1		0		A/5 21171			7.2500	S			0			0			1			0		0		1	0	0		0		0		1
1	2			1			38.0	1		0		PC 17599			71.2833	C			1			0			0			0		0		0	1	0		0		1		0
2	3			1			26.0	0		0		STON/O2. 3101282	7.9250	S			0			0			1			0		1		0	0	0		0		1		0
3	4			1			35.0	1		0		113803				53.1000	S			1			0			0			0		0		0	1	0		0		1		0
4	5			0			35.0	0		0		373450				8.0500	S			0			0			1			0		0		1	0	0		0		0		1

3.3.5 处理年龄

>>>plt.figure(figsize=(18,4))
>>>train_data['Age']=train_data['Age'].astype(np.int)
>>>average_age=train_data[['Age','Survived']].groupby('Age',as_index=False).mean()
>>>sns.barplot(x='Age',y='Survived',data=average_age,palette='BuPu')
>>>plt.savefig(r"D:\data\python\exercise\test2\2.png")

>>>agedf1 = pd.DataFrame(train_data['Age'])
>>>agedf1['Age_baby'] = agedf1['Age'].map(lambda s: 1 if 0 <= s < 15 else 0)
>>>agedf1['Age_youth'] = agedf1['Age'].map(lambda s: 1 if 15 <= s < 35 else 0)
>>>agedf1['Age_middle'] = agedf1['Age'].map(lambda s: 1 if 35 <= s <=60 else 0)
>>>agedf1['Age_old'] = agedf1['Age'].map(lambda s: 1 if 60 <= s else 0)
>>>train_data = pd.concat([train_data,agedf1], axis=1)
>>>train_data.drop('Age',axis=1, inplace=True)
>>>train_data.head()
>	
	PassengerId	Survived	SibSp	Parch	Ticket				Fare	Embarked	Pclass_1	Pclass_2	Pclass_3	...	Mr	Mrs	Officer	Royalty	Sex_0	Sex_1	Age_baby	Age_youth	Age_middle	Age_old
0	1			0			1		0		A/5 21171			7.2500	S			0			0			1			...	1	0	0		0		0		1		0			1			0			0
1	2			1			1		0		PC 17599			71.2833	C			1			0			0			...	0	1	0		0		1		0		0			0			1			0
2	3			1			0		0		STON/O2. 3101282	7.9250	S			0			0			1			...	0	0	0		0		1		0		0			1			0			0
3	4			1			1		0		113803				53.1000	S			1			0			0			...	0	1	0		0		1		0		0			0			1			0
4	5			0			0		0		373450				8.0500	S			0			0			1			...	1	0	0		0		0		1		0			0			1			0

3.3.6 处理家庭

#存放家庭信息
>>>familydf1 = pd.DataFrame()
#家庭人数
>>>familydf1['FamilySize'] = train_data['Parch'] + train_data['SibSp'] + 1
#家庭类别
#if 条件为真的时候返回if前面内容，否则返回0
>>>familydf1['Family_Single'] = familydf1['FamilySize'].map(lambda s: 1 if s == 1 else 0)
>>>familydf1['Family_Small'] = familydf1['FamilySize'].map(lambda s: 1 if 2 <= s <= 4 else 0)
>>>familydf1['Family_Large'] = familydf1['FamilySize'].map(lambda s: 1 if 5 <= s else 0)
>>>familydf1.drop('FamilySize',axis=1, inplace=True)
>>>train_data = pd.concat([train_data, familydf1], axis=1)
>>>train_data.drop('Parch',axis=1, inplace=True)
>>>train_data.drop('SibSp',axis=1, inplace=True)
>>>train_data.head()
>	
	PassengerId	Survived	Ticket				Fare	Embarked	Pclass_1	Pclass_2	Pclass_3	Master	Miss	...	Royalty	Sex_0	Sex_1	Age_baby	Age_youth	Age_middle	Age_old	Family_Single	Family_Small	Family_Large
0	1			0			A/5 21171			7.2500	S			0			0			1			0		0		...	0		0		1		0			1			0			0		0				1				0
1	2			1			PC 17599			71.2833	C			1			0			0			0		0		...	0		1		0		0			0			1			0		0				1				0
2	3			1			STON/O2. 3101282	7.9250	S			0			0			1			0		1		...	0		1		0		0			1			0			0		1				0				0
3	4			1			113803				53.1000	S			1			0			0			0		0		...	0		1		0		0			0			1			0		0				1				0
4	5			0			373450				8.0500	S			0			0			1			0		0		...	0		0		1		0			0			1			0		1				0				0

3.3.7 处理船票和票价

船票在这里没什么用处，选择删除

train_data.drop('Ticket',axis=1, inplace=True)

>>>faredf1=pd.DataFrame(train_data['Fare'])
>>>faredf1['Fare_First']=faredf1['Fare'].map(lambda f: 1 if f >= 30 and f <= 870 else 0)
>>>faredf1['Fare_Second']=faredf1['Fare'].map(lambda f: 1 if f >= 12 and f < 30 else 0)
>>>faredf1['Fare_Third']=faredf1['Fare'].map(lambda f: 1 if f >= 3 and f < 12 else 0)
>>>train_data = pd.concat([train_data,faredf1], axis=1)
>>>train_data.drop('Fare',axis=1, inplace=True)
>>>train_data.head()
>	
	PassengerId	Survived	Embarked	Pclass_1	Pclass_2	Pclass_3	Master	Miss	Mr	Mrs	...	Age_baby	Age_youth	Age_middle	Age_old	Family_Single	Family_Small	Family_Large	Fare_First	Fare_Second	Fare_Third
0	1			0			S			0			0			1			0		0		1	0	...	0			1			0			0		0				1				0				0			0			1
1	2			1			C			1			0			0			0		0		0	1	...	0			0			1			0		0				1				0				1			0			0
2	3			1			S			0			0			1			0		1		0	0	...	0			1			0			0		1				0				0				0			0			1
3	4			1			S			1			0			0			0		0		0	1	...	0			0			1			0		0				1				0				1			0			0
4	5			0			S			0			0			1			0		0		1	0	...	0			0			1			0		1				0				0				0			0			1

3.3.8 处理上船港口

#处理上船港口
>>>embarkeddf1 = pd.DataFrame()
#使用get_dummies进行one-hot编码，产生虚拟变量（dummy variables），列名前缀是Embarked
>>>embarkeddf1 = pd.get_dummies(train_data['Embarked'], prefix='Embarked')
>>>train_data = pd.concat([train_data, embarkeddf1], axis=1)
>>>train_data.drop('Embarked',axis=1, inplace=True)
>>>train_data.head()
>	
	PassengerId	Survived	Pclass_1	Pclass_2	Pclass_3	Master	Miss	Mr	Mrs	Officer	...	Age_old	Family_Single	Family_Small	Family_Large	Fare_First	Fare_Second	Fare_Third	Embarked_C	Embarked_Q	Embarked_S
0	1			0			0			0			1			0		0		1	0	0		...	0		0				1				0				0			0			1			0			0			1
1	2			1			1			0			0			0		0		0	1	0		...	0		0				1				0				1			0			0			1			0			0	
2	3			1			0			0			1			0		1		0	0	0		...	0		1				0				0				0			0			1			0			0			1
3	4			1			1			0			0			0		0		0	1	0		...	0		0				1				0				1			0			0			0			0			1
4	5			0			0			0			1			0		0		1	0	0		...	0		1				0				0				0			0			1			0			0			1

至此所有的训练集数据均处理完了

3.4 处理测试集

按照如上操作再处理测试集，为了方便预测结果需要将测试集的数据处理的与训练集相同

首先，查看缺失的数量

>>>test_data.isnull().sum()
>PassengerId      0
 Pclass           0
 Name             0
 Sex              0
 Age             86
 SibSp            0
 Parch            0
 Ticket           0
 Fare             1
 Cabin          327
 Embarked         0
 dtype: int64

#处理测试集数据
>>>test_data['Age'] = test_data['Age'].fillna(int(test_data['Age'].mean()))
>>>test_data['Fare'] = test_data['Fare'].fillna(test_data['Fare'].mean())
>>>test_data.drop(columns = 'Cabin', axis=1,inplace=True)
>>>test_data.isnull().sum()
>PassengerId    0
 Pclass         0
 Name           0
 Sex            0
 Age            0
 SibSp          0
 Parch          0
 Ticket         0
 Fare           0
 Embarked       0
 dtype: int64

#处理年龄
>>>agedf2 = pd.DataFrame(test_data['Age'])
>>>agedf2['Age_baby'] = agedf2['Age'].map(lambda s: 1 if 0 <= s < 15 else 0)
>>>agedf2['Age_youth'] = agedf2['Age'].map(lambda s: 1 if 15 <= s < 35 else 0)
>>>agedf2['Age_middle'] = agedf2['Age'].map(lambda s: 1 if 35 <= s <=60 else 0)
>>>agedf2['Age_old'] = agedf2['Age'].map(lambda s: 1 if 60 <= s else 0)
>>>test_data = pd.concat([test_data, agedf2], axis=1)
>>>test_data.drop('Age',axis=1, inplace=True)
#处理性别
>>>test_data['Sex'] = test_data['Sex'].map(sex_mapdict)
>>>sexdf2 = pd.DataFrame()
>>>sexdf2 = pd.get_dummies( test_data['Sex'], prefix='Sex')
>>>test_data = pd.concat([test_data, sexdf2], axis=1)
>>>test_data.drop('Sex',axis=1, inplace=True)
#处理上船港口
>>>embarkeddf2 = pd.DataFrame()
>>>embarkeddf2 = pd.get_dummies(test_data['Embarked'], prefix='Embarked')
>>>test_data = pd.concat([test_data, embarkeddf2], axis=1)
>>>test_data.drop('Embarked',axis=1, inplace=True)
#处理用户阶级
>>>pclassdf2 = pd.DataFrame()
>>>pclassdf2 = pd.get_dummies( test_data['Pclass'] , prefix='Pclass' )
>>>test_data = pd.concat([test_data, pclassdf2], axis=1)
>>>test_data.drop('Pclass',axis=1, inplace=True)
#存放提取后的特征
>>>titledf2 = pd.DataFrame()
>>>titledf2['Title'] = test_data['Name'].map(gettitle)
>>>titledf2['Title'] = titledf2['Title'].map(title_mapdict)
>>>titledf2 = pd.get_dummies(titledf2['Title'])
>>>test_data = pd.concat([test_data, titledf2], axis=1)
>>>test_data.drop('Name',axis=1, inplace=True)
#存放家庭信息
>>>familydf2 = pd.DataFrame()
>>>familydf2['FamilySize'] = test_data['Parch'] + test_data['SibSp'] + 1
>>>familydf2['Family_Single'] = familydf2['FamilySize'].map(lambda s: 1 if s == 1 else 0)
>>>familydf2['Family_Small'] = familydf2['FamilySize'].map(lambda s: 1 if 2 <= s <= 4 else 0)
>>>familydf2['Family_Large'] = familydf2['FamilySize'].map(lambda s: 1 if 5 <= s else 0)
>>>familydf2.drop('FamilySize',axis=1, inplace=True)
>>>test_data = pd.concat([test_data, familydf2], axis=1)
>>>test_data.drop('Parch',axis=1, inplace=True)
>>>test_data.drop('SibSp',axis=1, inplace=True)
#处理船票
>>>faredf2=pd.DataFrame(test_data['Fare'])
>>>faredf2['Fare_First']=faredf2['Fare'].map(lambda f: 1 if f >= 30 and f <= 870 else 0)
>>>faredf2['Fare_Second']=faredf2['Fare'].map(lambda f: 1 if f >= 12 and f < 30 else 0)
>>>faredf2['Fare_Third']=faredf2['Fare'].map(lambda f: 1 if f >= 3 and f < 12 else 0)
>>>test_data = pd.concat([test_data, faredf2], axis=1)
>>>test_data.drop('Fare',axis=1, inplace=True)
>>>test_data.drop('Ticket',axis=1, inplace=True)
>>>test_data.head()
>	
	PassengerId	Age_baby	Age_youth	Age_middle	Age_old	Sex_0	Sex_1	Embarked_C	Embarked_Q	Embarked_S	...	Mr	Mrs	Officer	Royalty	Family_Single	Family_Small	Family_Large	Fare_First	Fare_Second	Fare_Third
0	892			0			1			0			0		0		1		0			1			0			...	1	0	0		0		1				0				0				0			0			1
1	893			0			0			1			0		1		0		0			0			1			...	0	1	0		0		0				1				0				0			0			1
2	894			0			0			0			1		0		1		0			1			0			...	1	0	0		0		1				0				0				0			0			1
3	895			0			1			0			0		0		1		0			0			1			...	1	0	0		0		1				0				0				0			0			1
4	896			0			1			0			0		1		0		0			0			1			...	0	1	0		0		0				1				0				0			1			0

3.5 查看相关性

>>>corrdf = train_data.corr()
>				PassengerId	Survived	Pclass_1	Pclass_2	Pclass_3	Master		Miss		Mr			Mrs			Officer		...	Age_old		Family_Single	Family_Small	Family_Large	Fare_First	Fare_Second	Fare_Third	Embarked_C	Embarked_Q	Embarked_S
PassengerId		1.000000   -0.005007	0.034303   -0.000086   -0.029486    -0.026151   -0.067846	 0.038850	 0.010197	 0.055299	...	 0.006611	 0.057462		-0.028976		-0.057055		 0.022603	-0.028772	-0.002661	-0.001205	-0.033606	 0.022204
Survived	   -0.005007	1.000000	0.285904	0.093349   -0.322308	 0.085221	 0.332795	-0.549199	 0.344935	-0.031316	...	-0.040857	-0.203367		  0.279855		-0.125147		  0.254274	  0.066213	-0.270267	  0.168240	  0.003650	-0.149683
Pclass_1		0.034303	0.285904	1.000000   -0.288585   -0.626738	-0.084700	 0.021958	-0.097288	 0.091483	 0.104919	...	 0.166443	-0.113364	 	0.168568		-0.092945	 	0.683722	-0.177457	-0.458268	 0.296423	-0.155342	-0.161921
Pclass_2	   -0.000086	0.093349   -0.288585	1.000000   -0.565210	 0.009903	-0.027381	-0.088569	 0.125093	 0.084401	...	-0.022555	-0.039070	 	0.104546		-0.117721	   -0.153508	 0.404474	-0.259749	-0.125416	-0.127301	 0.189980
Pclass_3	   -0.029486   -0.322308   -0.626738   -0.565210	1.000000	0.064918	0.003366	0.155907	-0.180630	-0.159089	...	-0.125051	0.129472		-0.230325		0.175890	   -0.464164	-0.176287	0.606245	-0.153329	0.237449	-0.015104
Master		   -0.026151	0.085221   -0.084700	0.009903	0.064918	1.000000	-0.110602	-0.254903	-0.088394	-0.031131	...	-0.037588	-0.267024		0.102668		0.324136		0.063844	0.099667	-0.144515	-0.035225	0.010478	0.024264
Miss		   -0.067846	0.332795	0.021958   -0.027381	0.003366	-0.110602	1.000000	-0.599803	-0.207996	-0.073253	...	-0.071973	-0.050402		-0.007684		0.111105		0.077733	-0.008436	-0.044849	0.037613	0.168720	-0.139126
Mr				0.038850   -0.549199   -0.097288   -0.088569	0.155907	-0.254903	-0.599803	1.000000	-0.479363	-0.168826	...	0.066390	0.396920		-0.292792		-0.223221	   -0.201253	-0.168402	0.316688	-0.072567	-0.078338	0.112870
Mrs				0.010197	0.344935	0.091483	0.125093   -0.180630	-0.088394	-0.207996	-0.479363	1.000000	-0.058544	...	-0.013465	-0.357826		0.365088		0.014670		0.121511	0.168896	-0.255565	0.066101	-0.091121	-0.000565
Officer			0.055299   -0.031316	0.104919	0.084401   -0.159089	-0.031131	-0.073253	-0.168826	-0.058544	1.000000	...	0.069897	0.035074		-0.015279		-0.039269		0.056671	0.058263	-0.101410	-0.008034	0.012618	-0.000902
Royalty			0.031602	0.033391	0.132798   -0.038324   -0.083230	-0.016287	-0.038324	-0.088324	-0.030628	-0.010787	...	-0.013024	-0.000414		0.011568		-0.020544		0.055989	-0.018161	-0.061567	0.079020	-0.023105	-0.054685
Sex_0		   -0.042939	0.543351	0.098013	0.064746   -0.137143	-0.159934	0.691548	-0.867334	0.552686	-0.089228	...	-0.072063	-0.303646		0.260747		0.102954		0.161102	0.116775	-0.230803	0.082853	0.074115	-0.119224
Sex_1			0.042939   -0.543351   -0.098013   -0.064746	0.137143	0.159934	-0.691548	0.867334	-0.552686	0.089228	...	0.072063	0.303646		-0.260747		-0.102954   	-0.161102	-0.116775	0.230803	-0.082853	-0.074115	0.119224
Age_baby	   -0.026833	0.122978   -0.128886	0.028373	0.087957	0.623234	0.214762	-0.340037	-0.114928	-0.044477	...	-0.053701	-0.349033		0.172907		0.352281		0.071520	0.142821	-0.189052	0.002974	-0.038734	0.021770
Age_youth	   -0.003044   -0.091170   -0.231081   -0.042761	0.233902	-0.249201	0.021126	0.188732	-0.083297	-0.111574	...	-0.235597	0.228904		-0.152384	     -0.159111	   -0.205044	-0.128593	0.305773	-0.023941	0.134290	-0.063535
Age_middle		0.024216	0.039188	0.290830	0.039091   -0.282395	-0.121518	-0.142926	-0.013824	0.177937	0.143963	...	-0.034638	-0.050408		0.079582		-0.049864		0.157938	0.065555	-0.212457	0.032550	-0.125563	0.050502
Age_old			0.006611   -0.040857	0.166443   -0.022555   -0.125051	-0.037588	-0.071973	0.066390	-0.013465	0.069897	...	1.000000	0.045377		-0.035810	 	-0.021206		0.090135	-0.044792	-0.033278	0.001665	-0.005860	0.002229
Family_Single	0.057462   -0.203367   -0.113364   -0.039070	0.129472	-0.267024	-0.050402	0.396920	-0.357826	0.035074	...	0.045377	1.000000		-0.859931		-0.336825   	-0.339394	-0.240334	0.506354	-0.095298	0.086464	0.029074
Family_Small   -0.028976	0.279855	0.168568	0.104546   -0.230325	0.102668	-0.007684	-0.292792	0.365088	-0.015279	...	-0.035810	-0.859931		1.000000		-0.190940		0.228243	0.242009	-0.411264	0.158586	-0.087093	-0.084120
Family_Large   -0.057055   -0.125147   -0.092945   -0.117721	0.175890	0.324136	0.111105	-0.223221	0.014670	-0.039269	...	-0.021206	-0.336825		-0.190940		1.000000		0.231664	0.015760	-0.215131	-0.109274	-0.005620	0.099265
Fare_First 		0.022603	0.254274	0.683722   -0.153508   -0.464164	0.063844	0.077733	-0.201253	0.121511	0.056671	...	0.090135	-0.339394		0.228243		0.231664		1.000000	-0.408891	-0.497615	0.237676	-0.168737	-0.102027
Fare_Second	   -0.028772	0.066213   -0.177457	0.404474   -0.176287	0.099667	-0.008436	-0.168402	0.168896	0.058263	...	-0.044792	-0.240334		0.242009		0.015760   	   -0.408891	1.000000	-0.551912	-0.033551	-0.026076	0.045802
Fare_Third	   -0.002661   -0.270267   -0.458268   -0.259749	0.606245	-0.144515	-0.044849	0.316688	-0.255565	-0.101410	...	-0.033278	0.506354		-0.411264		-0.215131   	-0.497615	-0.551912	1.000000	-0.166809	0.187891	0.027891
Embarked_C	   -0.001205	0.168240	0.296423   -0.125416   -0.153329	-0.035225	0.037613	-0.072567	0.066101	-0.008034	...	0.001665	-0.095298		0.158586		-0.109274		0.237676	-0.033551	-0.166809	1.000000	-0.148258	-0.782742
Embarked_Q     -0.033606	0.003650   -0.155342   -0.127301	0.237449	0.010478	0.168720	-0.078338	-0.091121	0.012618	...	-0.005860	0.086464		-0.087093		-0.005620	-	0.168737	-0.026076	0.187891	-0.148258	1.000000	-0.499421
Embarked_S	 	0.022204   -0.149683   -0.161921	0.189980   -0.015104	0.024264	-0.139126	0.112870	-0.000565	-0.000902	...	0.002229	0.029074		-0.084120		0.099265	   -0.102027	0.045802	0.027891	-0.782742	-0.499421	1.000000

>>>corrdf['Survived'].round(4).abs().sort_values(ascending=False)
>Survived         1.0000
 Mr               0.5492
 Sex_1            0.5434
 Sex_0            0.5434
 Mrs              0.3449
 Miss             0.3328
 Pclass_3         0.3223
 Pclass_1         0.2859
 Family_Small     0.2799
 Fare_Third       0.2703
 Fare_First       0.2543
 Family_Single    0.2034
 Embarked_C       0.1682
 Embarked_S       0.1497
 Family_Large     0.1251
 Age_baby         0.1230
 Pclass_2         0.0933
 Age_youth        0.0912
 Master           0.0852
 Fare_Second      0.0662
 Age_old          0.0409
 Age_middle       0.0392
 Royalty          0.0334
 Officer          0.0313
 PassengerId      0.0050
 Embarked_Q       0.0037
 Name: Survived, dtype: float64

3.6 划分训练集和检验集

>>>source_y = train_data.Survived
>>>source_x = train_data.drop(['Survived'],axis=1)
>>>train_x, test_x, train_y, test_y = train_test_split(source_x, source_y , train_size=0.8,test_size=0.2)

4. 训练模型

4.1 逻辑回归模型

>>>model_lr = LogisticRegression()
>>>model_lr.fit(train_x, train_y)

4.2 随机森林

>>>model_rfc = RandomForestClassifier()
>>>model_rfc.fit(train_x, train_y)

4.3 支持向量机

>>>model_svm = SVC()
>>>model_svm.fit(train_x, train_y)

4.4 K最近邻

>>>model_knn = KNeighborsClassifier()
>>>model_knn.fit(train_x, train_y)

4.5 决策树

>>>model_dtree = DecisionTreeClassifier()
>>>model_dtree.fit(train_x, train_y)

5.测试模型

5.1 逻辑回归模型

>>>accuracy_lr = model_lr.score(test_x,test_y)
>>>print("逻辑回归的测试结果：", accuracy_lr)
>逻辑回归的测试结果： 0.7988826815642458

5.2 随机森林

>>>accuracy_rfc = model_rfc.score(test_x,test_y)
>>>print("随机森林的测试结果：", accuracy_rfc)
>随机森林的测试结果： 0.770949720670391

5.3 支持向量机

>>>accuracy_svm = model_svm.score(test_x,test_y)
>>>print("支持向量机的测试结果：", accuracy_svm)
>支持向量机的测试结果： 0.5698324022346368

5.4 K最近邻

>>>accuracy_knn = model_knn.score(test_x,test_y)
>>>print("K最近邻分类器的测试结果：", accuracy_knn)
>K最近邻分类器的测试结果： 0.553072625698324

5.5 决策树

>>>accuracy_dtree= model_dtree.score(test_x,test_y)
>>>print("决策树模型的测试结果：", accuracy_dtree)
>决策树模型的测试结果： 0.7541899441340782

5.6 5种模型对比

>>>import seaborn as sns
>>>import matplotlib.pyplot as plt
>>>sns.set(rc={'figure.figsize':(15,6)}) 
>>>accuracys = [accuracy_lr, accuracy_rfc, accuracy_svm, accuracy_knn, accuracy_dtree]
>>>models = ['Logistic', 'RF', 'SVM', 'KNN', 'Dtree']
>>>bar = sns.barplot(x=models, y=accuracys)
# 显示数值标签
>>>for x, y in enumerate(accuracys):
>>>    plt.text(x, y, '%s'% round(y,3), ha='center')
>>>plt.xlabel("Model")
>>>plt.ylabel("Accuracy")
>>>plt.savefig(r"D:\data\python\exercise\test2\3.png")
>>>plt.show()

6. 预测模型

6.1 逻辑回归模型

>>>pred_lr = model_lr.predict(pred_x)
>>>pred_lr = pred_lr.astype(int)
>>>passenger_id = test_data.iloc[:, 0]
#逻辑回归的预测结果
>>>preddf1 = pd.DataFrame({'PassengerId': passenger_id,'Survived': pred_lr})
>>>preddf1.to_csv(r'D:\data\python\taitanic\titanic_pred_model_lr.csv', index=False)

6.2 随机森林

>>>pred_rfc = model_rfc.predict(pred_x)
>>>pred_rfc = pred_rfc.astype(int)
#随机森林的预测结果
>>>preddf2 = pd.DataFrame({'PassengerId': passenger_id,'Survived': pred_rfc})
>>>preddf2.to_csv(r'D:\data\python\taitanic\titanic_pred_model_rfc.csv', index=False)

6.3 支持向量机

>>>pred_svm = model_svm.predict(pred_x)
>>>pred_svm = pred_svm.astype(int)
#支持向量机的预测结果
>>>preddf3 = pd.DataFrame({'PassengerId': passenger_id,'Survived': pred_svm})
>>>preddf3.to_csv(r'D:\data\python\taitanic\titanic_pred_model_svm.csv', index=False)

6.4 K最近邻

>>>pred_knn = model_knn.predict(pred_x)
>>>pred_knn = pred_knn.astype(int)
#K最近邻分类器的预测结果
>>>preddf4 = pd.DataFrame({'PassengerId': passenger_id,'Survived': pred_knn})
>>>preddf4.to_csv(r'D:\data\python\taitanic\titanic_pred_model_knn.csv', index=False)

6.5 决策树

>>>pred_dtree= model_dtree.predict(pred_x)
>>>pred_dtree = pred_dtree.astype(int)
#决策树模型的预测结果
>>>preddf5 = pd.DataFrame({'PassengerId': passenger_id,'Survived': pred_dtree})
>>>preddf5.to_csv(r'D:\data\python\taitanic\titanic_pred_model_dtree.csv', index=False)

6.6 将上述结果交至kaggle进行评分

得到结果如下

可以看到逻辑回归的分数依然是最高的

7. 完整代码

建议使用jupyter notebook完成以下内容

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
import random

from sklearn.linear_model import LogisticRegression # 逻辑回归
from sklearn.ensemble import RandomForestClassifier # 随机森林
from sklearn.svm import SVC # 支持向量机
from sklearn.neighbors import KNeighborsClassifier # K最近邻
from sklearn.tree import DecisionTreeClassifier # 决策树

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
#训练数据集
train_data = pd.read_csv(r"D:\data\python\taitanic\train.csv")
#测试数据集
test_data = pd.read_csv(r"D:\data\python\taitanic\test.csv")
print('训练数据集：', train_data.shape, '测试数据集：', test_data.shape)

train_data.isnull().sum()

plt.figure(figsize=(10,5),dpi=100)
train_data['Embarked'].value_counts().plot(kind='bar')
plt.savefig(r"D:\data\python\exercise\test2\1.png")

#选取频率最高的填充到空白数据中
train_data['Embarked'] = train_data['Embarked'].fillna('S')
train_data.isnull().sum()

#处理空白年龄(Age)，使用平均值填充
train_data['Age'] = train_data['Age'].fillna(int(train_data['Age'].mean()))
train_data.isnull().sum()

#至此除了Cabin（船舱号）其他缺失值已经补充完毕.Cabin这一列数据值缺失过多选择填充会导致得到的数据过于片面，因此，选择删去这一列
train_data.drop(columns = 'Cabin', axis=1,inplace=True)

train_data.head()

#处理用户阶级
pclassdf1 = pd.DataFrame()
#使用get_dummies进行one-hot编码，列名前缀是Pclass
pclassdf1 = pd.get_dummies( train_data['Pclass'] , prefix='Pclass' )
train_data = pd.concat([train_data, pclassdf1], axis=1)
train_data.drop('Pclass',axis=1, inplace=True)
train_data.head()

def gettitle(name):
    str1 = name.split(',')[1] #Mr. Owen Harris
    str2 = str1.split('.')[0]#Mr
    str3 = str2.strip()
    return str3
#存放提取后的特征
titledf1 = pd.DataFrame()
#map函数：对Series每个数据应用自定义的函数计算
titledf1['Title'] = train_data['Name'].map(gettitle)
#查看titledf的种类
titledf1['Title'].value_counts()

#姓名中头衔字符串与定义头衔类别的对应关系
title_mapdict = {
                    "Capt":       "Officer",
                    "Col":        "Officer",
                    "Major":      "Officer",
                    "Jonkheer":   "Royalty",
                    "Don":        "Royalty",
                    "Sir":        "Royalty",
                    "Dr":         "Officer",
                    "Rev":        "Officer",
                    "the Countess":"Royalty",
                    "Dona":       "Royalty",
                    "Mme":        "Mrs",
                    "Mlle":       "Miss",
                    "Ms":         "Mrs",
                    "Mr":         "Mr",
                    "Mrs":        "Mrs",
                    "Miss":       "Miss",
                    "Master":     "Master",
                    "Lady":       "Royalty"
                    }
#map函数：对Series每个数据应用自定义的函数计算
titledf1['Title'] = titledf1['Title'].map(title_mapdict)
#使用get_dummies进行one-hot编码
titledf1 = pd.get_dummies(titledf1['Title'])
train_data = pd.concat([train_data, titledf1], axis=1)
train_data.drop('Name',axis=1, inplace=True)
train_data.head()

#处理性别
sex_mapdict = {'male': 1, 'female': 0}
#map函数：对Series每个数据应用自定义的函数计算
train_data['Sex'] = train_data['Sex'].map(sex_mapdict)
sexdf1 = pd.DataFrame()
#使用get_dummies进行one-hot编码，产生虚拟变量（dummy variables），列名前缀是Sex
sexdf1 = pd.get_dummies( train_data['Sex'], prefix='Sex')
train_data = pd.concat([train_data, sexdf1], axis=1)
train_data.drop('Sex',axis=1, inplace=True)
train_data.head()

plt.figure(figsize=(18,4))
train_data['Age']=train_data['Age'].astype(np.int)
average_age=train_data[['Age','Survived']].groupby('Age',as_index=False).mean()
sns.barplot(x='Age',y='Survived',data=average_age,palette='BuPu')
plt.savefig(r"D:\data\python\exercise\test2\2.png")

agedf1 = pd.DataFrame(train_data['Age'])
agedf1['Age_baby'] = agedf1['Age'].map(lambda s: 1 if 0 <= s < 15 else 0)
agedf1['Age_youth'] = agedf1['Age'].map(lambda s: 1 if 15 <= s < 35 else 0)
agedf1['Age_middle'] = agedf1['Age'].map(lambda s: 1 if 35 <= s <=60 else 0)
agedf1['Age_old'] = agedf1['Age'].map(lambda s: 1 if 60 <= s else 0)
train_data = pd.concat([train_data,agedf1], axis=1)
train_data.drop('Age',axis=1, inplace=True)
train_data.head()

#存放家庭信息
familydf1 = pd.DataFrame()

#家庭人数
familydf1['FamilySize'] = train_data['Parch'] + train_data['SibSp'] + 1

#家庭类别
#if 条件为真的时候返回if前面内容，否则返回0
familydf1['Family_Single'] = familydf1['FamilySize'].map(lambda s: 1 if s == 1 else 0)

familydf1['Family_Small'] = familydf1['FamilySize'].map(lambda s: 1 if 2 <= s <= 4 else 0)

familydf1['Family_Large'] = familydf1['FamilySize'].map(lambda s: 1 if 5 <= s else 0)
familydf1.drop('FamilySize',axis=1, inplace=True)
train_data = pd.concat([train_data, familydf1], axis=1)
train_data.drop('Parch',axis=1, inplace=True)
train_data.drop('SibSp',axis=1, inplace=True)
train_data.head()

#船票在这里没什么用处，选择删除
train_data.drop('Ticket',axis=1, inplace=True)

faredf1=pd.DataFrame(train_data['Fare'])
faredf1['Fare_First']=faredf1['Fare'].map(lambda f: 1 if f >= 30 and f <= 870 else 0)
faredf1['Fare_Second']=faredf1['Fare'].map(lambda f: 1 if f >= 12 and f < 30 else 0)
faredf1['Fare_Third']=faredf1['Fare'].map(lambda f: 1 if f >= 3 and f < 12 else 0)
train_data = pd.concat([train_data,faredf1], axis=1)
train_data.drop('Fare',axis=1, inplace=True)
train_data.head()

#处理上船港口
embarkeddf1 = pd.DataFrame()
#使用get_dummies进行one-hot编码，产生虚拟变量（dummy variables），列名前缀是Embarked
embarkeddf1 = pd.get_dummies(train_data['Embarked'], prefix='Embarked')
train_data = pd.concat([train_data, embarkeddf1], axis=1)
train_data.drop('Embarked',axis=1, inplace=True)
train_data.head()

test_data.isnull().sum()

#处理测试集数据
test_data['Age'] = test_data['Age'].fillna(int(test_data['Age'].mean()))
test_data['Fare'] = test_data['Fare'].fillna(test_data['Fare'].mean())
test_data.drop(columns = 'Cabin', axis=1,inplace=True)
test_data.isnull().sum()

agedf2 = pd.DataFrame(test_data['Age'])
agedf2['Age_baby'] = agedf2['Age'].map(lambda s: 1 if 0 <= s < 15 else 0)
agedf2['Age_youth'] = agedf2['Age'].map(lambda s: 1 if 15 <= s < 35 else 0)
agedf2['Age_middle'] = agedf2['Age'].map(lambda s: 1 if 35 <= s <=60 else 0)
agedf2['Age_old'] = agedf2['Age'].map(lambda s: 1 if 60 <= s else 0)
test_data = pd.concat([test_data, agedf2], axis=1)
test_data.drop('Age',axis=1, inplace=True)
#处理性别
test_data['Sex'] = test_data['Sex'].map(sex_mapdict)
sexdf2 = pd.DataFrame()
sexdf2 = pd.get_dummies( test_data['Sex'], prefix='Sex')
test_data = pd.concat([test_data, sexdf2], axis=1)
test_data.drop('Sex',axis=1, inplace=True)
#处理上船港口
embarkeddf2 = pd.DataFrame()
embarkeddf2 = pd.get_dummies(test_data['Embarked'], prefix='Embarked')
test_data = pd.concat([test_data, embarkeddf2], axis=1)
test_data.drop('Embarked',axis=1, inplace=True)
#处理用户阶级
pclassdf2 = pd.DataFrame()
pclassdf2 = pd.get_dummies( test_data['Pclass'] , prefix='Pclass' )
test_data = pd.concat([test_data, pclassdf2], axis=1)
test_data.drop('Pclass',axis=1, inplace=True)
#存放提取后的特征
titledf2 = pd.DataFrame()
titledf2['Title'] = test_data['Name'].map(gettitle)
titledf2['Title'] = titledf2['Title'].map(title_mapdict)
titledf2 = pd.get_dummies(titledf2['Title'])
test_data = pd.concat([test_data, titledf2], axis=1)
test_data.drop('Name',axis=1, inplace=True)
#存放家庭信息
familydf2 = pd.DataFrame()
familydf2['FamilySize'] = test_data['Parch'] + test_data['SibSp'] + 1
familydf2['Family_Single'] = familydf2['FamilySize'].map(lambda s: 1 if s == 1 else 0)
familydf2['Family_Small'] = familydf2['FamilySize'].map(lambda s: 1 if 2 <= s <= 4 else 0)
familydf2['Family_Large'] = familydf2['FamilySize'].map(lambda s: 1 if 5 <= s else 0)
familydf2.drop('FamilySize',axis=1, inplace=True)
test_data = pd.concat([test_data, familydf2], axis=1)
test_data.drop('Parch',axis=1, inplace=True)
test_data.drop('SibSp',axis=1, inplace=True)
#处理船票
faredf2=pd.DataFrame(test_data['Fare'])
faredf2['Fare_First']=faredf2['Fare'].map(lambda f: 1 if f >= 30 and f <= 870 else 0)
faredf2['Fare_Second']=faredf2['Fare'].map(lambda f: 1 if f >= 12 and f < 30 else 0)
faredf2['Fare_Third']=faredf2['Fare'].map(lambda f: 1 if f >= 3 and f < 12 else 0)
test_data = pd.concat([test_data, faredf2], axis=1)
test_data.drop('Fare',axis=1, inplace=True)
test_data.drop('Ticket',axis=1, inplace=True)
test_data.head()

corrdf = train_data.corr()
corrdf

corrdf['Survived'].round(4).abs().sort_values(ascending=False)

source_y = train_data.Survived
source_x = train_data.drop(['Survived'],axis=1)
train_x, test_x, train_y, test_y = train_test_split(source_x, 
           source_y , train_size=0.8,test_size=0.2)

pred_x=test_data

model_lr = LogisticRegression()
model_lr.fit(train_x, train_y)
pred_lr = model_lr.predict(pred_x)
pred_lr = pred_lr.astype(int)
accuracy_lr = model_lr.score(test_x,test_y)
print("逻辑回归的测试结果：", accuracy_lr)

model_rfc = RandomForestClassifier()
model_rfc.fit(train_x, train_y)
pred_rfc = model_rfc.predict(pred_x)
pred_rfc = pred_rfc.astype(int)
accuracy_rfc = model_rfc.score(test_x,test_y)
print("随机森林的预试结果：", accuracy_rfc)

model_svm = SVC()
model_svm.fit(train_x, train_y)
pred_svm = model_svm.predict(pred_x)
pred_svm = pred_svm.astype(int)
accuracy_svm = model_svm.score(test_x,test_y)
print("支持向量机的测试结果：", accuracy_svm)

model_knn = KNeighborsClassifier()
model_knn.fit(train_x, train_y)
pred_knn = model_knn.predict(pred_x)
pred_knn = pred_knn.astype(int)
accuracy_knn = model_knn.score(test_x,test_y)
print("K最近邻分类器的测试结果：", accuracy_knn)

model_dtree = DecisionTreeClassifier()
model_dtree.fit(train_x, train_y)
pred_dtree= model_dtree.predict(pred_x)
pred_dtree = pred_dtree.astype(int)
accuracy_dtree= model_dtree.score(test_x,test_y)
print("决策树模型的测试结果：", accuracy_dtree)

import seaborn as sns
import matplotlib.pyplot as plt
 
sns.set(rc={'figure.figsize':(15,6)}) # 设置画布大小
accuracys = [accuracy_lr, accuracy_rfc, accuracy_svm, accuracy_knn, accuracy_dtree]
models = ['Logistic', 'RF', 'SVM', 'KNN', 'Dtree']
bar = sns.barplot(x=models, y=accuracys)
 
# 显示数值标签
for x, y in enumerate(accuracys):
    plt.text(x, y, '%s'% round(y,3), ha='center')

plt.xlabel("Model")
plt.ylabel("Accuracy")
plt.savefig(r"D:\data\python\exercise\test2\3.png")
plt.show()

#数据框：乘客id，预测生存情况的值
passenger_id = test_data.iloc[:, 0]
#逻辑回归的预测结果
preddf1 = pd.DataFrame({'PassengerId': passenger_id,'Survived': pred_lr})
preddf1.to_csv(r'D:\data\python\taitanic\titanic_pred_model_lr.csv', index=False)
#随机森林的预测结果
preddf2 = pd.DataFrame({'PassengerId': passenger_id,'Survived': pred_rfc})
preddf2.to_csv(r'D:\data\python\taitanic\titanic_pred_model_rfc.csv', index=False)
#支持向量机的预测结果
preddf3 = pd.DataFrame({'PassengerId': passenger_id,'Survived': pred_svm})
preddf3.to_csv(r'D:\data\python\taitanic\titanic_pred_model_svm.csv', index=False)
#K最近邻分类器的预测结果
preddf4 = pd.DataFrame({'PassengerId': passenger_id,'Survived': pred_knn})
preddf4.to_csv(r'D:\data\python\taitanic\titanic_pred_model_knn.csv', index=False)
#决策树模型的预测结果
preddf5 = pd.DataFrame({'PassengerId': passenger_id,'Survived': pred_dtree})
preddf5.to_csv(r'D:\data\python\taitanic\titanic_pred_model_dtree.csv', index=False)

8. 总结

以上就是泰坦尼克号的生存预测分析。
如果你觉得这篇文章对你有用，建议点赞收藏。
欢迎各位读者指正错误，请在评论区留言。或者发表自己的看法，小编不胜感激。

python Python

物联沃分享整理
物联沃-IOTWORD物联网 » 探究泰坦尼克号生存率的数据分析与预测

代码收藏家普通

分享到：

文章目录

1.导入库

2.引入机器学习的模型

2.1 逻辑回归模型

2.2 随机森林

2.3 支持向量机

2.4 K最近邻

2.5 决策树

3. 对数据处理

3.1 导入数据

3.2 处理训练集缺失值

3.2.1 查看维度

3.2.2 查看缺失的个数

3.2.3 处理缺失的上船港口（Embarked）

处理缺失的年龄（Age）

处理缺失的客舱号（Cabin）

3.3 对训练集的数据进行处理

3.3.1 打印前5行查看数据集

3.3.2 处理Pclass(客舱等级)

3.3.3 处理姓名

3.3.4 处理性别

3.3.5 处理年龄

3.3.6 处理家庭

3.3.7 处理船票和票价

3.3.8 处理上船港口

3.4 处理测试集

3.5 查看相关性

3.6 划分训练集和检验集

4. 训练模型

4.1 逻辑回归模型

4.2 随机森林

4.3 支持向量机

4.4 K最近邻

4.5 决策树

5.测试模型

5.1 逻辑回归模型

5.2 随机森林

5.3 支持向量机

5.4 K最近邻

5.5 决策树

5.6 5种模型对比

6. 预测模型

6.1 逻辑回归模型

6.2 随机森林

6.3 支持向量机

6.4 K最近邻

6.5 决策树

6.6 将上述结果交至kaggle进行评分

7. 完整代码

8. 总结

代码收藏家 普通

相关推荐

发表回复 取消回复

代码收藏家普通

发表回复取消回复