Kaggle泰坦尼克号生存模型——250个特征量的融合模型,排名8%

Kaggle泰坦尼克号生存模型——250个特征量的融合模型,排名8%

@猴子 ,求第三关门票

本文参考了Kaggle的Kernels板块中,网友分享的项目算法。



Kernels



1. 数据总览

Titanic生存预测中提供了两组数据:train.csv 和test.csv,分别是训练集和测试集。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

train_data = pd.read_csv('I://model/titanic/train.csv')
test_data = pd.read_csv('I://model/titanic/test.csv')
train_data.info()
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

存活比例

train_data['Survived'].value_counts().plot.pie(autopct='%1.1f%%')



image.png

2. 数据关系分析

(1)性别与生存的关系

train_data.groupby(['Sex','Survived'])['Survived'].count()

Sex     Survived
female  0            81
        1           233
male    0           468
        1           109
Name: Survived, dtype: int64

train_data[['Sex','Survived']].groupby(['Sex']).mean().plot.bar()



不同性别的生存率

(2)船舱等级与生存的关系

train_data[['Pclass','Survived']].groupby(['Pclass']).mean().plot.bar(color=['r','g','b'])



不同等级船舱的生存率

train_data[['Sex','Pclass','Survived']].groupby(['Pclass','Sex']).mean().plot.bar()



不同等级船舱的男女生存率

train_data.groupby(['Sex','Pclass','Survived'])['Survived'].count()

Sex     Pclass  Survived
female  1       0             3
                1            91
        2       0             6
                1            70
        3       0            72
                1            72
male    1       0            77
                1            45
        2       0            91
                1            17
        3       0           300
                1            47
Name: Survived, dtype: int64

从上图和表中明显可以看到,虽然泰坦尼克号逃生总体符合妇女优先,但是对各个等级船舱还是有区别的,而且一等舱中的男子凭借自身的社会地位强行混入了救生艇。如白星航运公司主席伊斯梅(他否决了配备48艘救生艇的想法,认为少点也没关系)则抛下他的乘客、他的船员、他的船,在最后一刻跳进可折叠式救生艇C(共有39名乘客)。

(3)年龄与存活的关系

f,ax=plt.subplots(1,2,figsize=(18,8))
sns.violinplot("Pclass","Age", hue="Survived", data=train_data,split=True,ax=ax[0])
ax[0].set_title('Pclass and Age vs Survived')
ax[0].set_yticks(range(0,110,10))
sns.violinplot("Sex","Age", hue="Survived", data=train_data,split=True,ax=ax[1])
ax[1].set_title('Sex and Age vs Survived')
ax[1].set_yticks(range(0,110,10))
plt.show()



image.png

(4)称呼与存活关系

在数据的Name项中包含了对该乘客的称呼,如Mr、Miss、Mrs等,这些信息包含了乘客的年龄、性别、也有可能包含社会地位,如Dr、Lady、Major、Master等称呼。
这一项不方便用图表展示,但是在特征工程中,我们会将其加入到特征中。

(5)登船港口与存活关系

泰坦尼克号从英国的南安普顿港出发,途径法国瑟堡和爱尔兰昆士敦,一部分在瑟堡或昆士敦下船的人逃过了一劫。

sns.countplot('Embarked',hue='Survived',data=train_data)
plt.title('Embarked and Survived')



image.png

(6)船上亲友人数与存活关系

f,ax=plt.subplots(1,2,figsize=(18,8))
train_data[['Parch','Survived']].groupby(['Parch']).mean().plot.bar(ax=ax[0])
ax[0].set_title('Parch and Survived')
train_data[['SibSp','Survived']].groupby(['SibSp']).mean().plot.bar(ax=ax[1])
ax[1].set_title('SibSp and Survived')



image.png

从图中可以看到,孤身一人存活率很低,但是如果亲友太多,难以估计周全,也很危险。

(7)其他因素

剩余因素还有船票价格、船舱号和船票号,这三个因素都可能会影响乘客在船中的位置从而影响逃生顺序,但是因为这三个因素与生存之间看不出明显规律,所以在后期模型融合时,将这些因素交给模型来决定其重要性。

3. 特征工程

首先将train和test合并一起进行特征工程处理:

   train_data_org = pd.read_csv('train.csv')
    test_data_org = pd.read_csv('test.csv')
    test_data_org['Survived'] = 0
    combined_train_test = train_data_org.append(test_data_org)

特征工程即从各项参数中提取出可能影响到最终结果的特征,作为模型的预测依据。特征工程一般应先从含有缺失值即NaN的项开始。

(1)Embarked

先填充缺失值,对缺失的Embarked以众数来填补

   if combined_train_test['Embarked'].isnull().sum() != 0:
        combined_train_test['Embarked'].fillna(combined_train_test['Embarked'].mode().iloc[0], inplace=True)

再将Embarked的三个上船港口分为3列,每一列均只包含0和1两个值

emb_dummies_df = pd.get_dummies(combined_train_test['Embarked'],prefix=combined_train_test[['Embarked']].columns[0])
combined_train_test = pd.concat([combined_train_test, emb_dummies_df], axis=1)

(2)Sex

无缺失值,直接分列

   sex_dummies_df = pd.get_dummies(combined_train_test['Sex'], prefix=combined_train_test[['Sex']].columns[0])
    combined_train_test = pd.concat([combined_train_test, sex_dummies_df], axis=1)

(3)Name

从名字中提取出称呼:

   combined_train_test['Title'] = combined_train_test['Name'].str.extract('.+,(.+)').str.extract( '^(.+?)\.').str.strip()

将各式称呼统一:

   title_Dict = {}
    title_Dict.update(dict.fromkeys(['Capt', 'Col', 'Major', 'Dr', 'Rev'], 'Officer'))
    title_Dict.update(dict.fromkeys(['Jonkheer', 'Don', 'Sir', 'the Countess', 'Dona', 'Lady'], 'Royalty'))
    title_Dict.update(dict.fromkeys(['Mme', 'Ms', 'Mrs'], 'Mrs'))
    title_Dict.update(dict.fromkeys(['Mlle', 'Miss'], 'Miss'))
    title_Dict.update(dict.fromkeys(['Mr'], 'Mr'))
    title_Dict.update(dict.fromkeys(['Master'], 'Master'))

    combined_train_test['Title'] = combined_train_test['Title'].map(title_Dict)

分列

   title_dummies_df = pd.get_dummies(combined_train_test['Title'], prefix=combined_train_test[['Title']].columns[0])
    combined_train_test = pd.concat([combined_train_test, title_dummies_df], axis=1)

(4)Fare

填充NaN,按一二三等舱各自的均价来填充。

   if combined_train_test['Fare'].isnull().sum() != 0:
        combined_train_test['Fare'] = combined_train_test[['Fare']].fillna(combined_train_test.groupby('Pclass').transform('mean'))

泰坦尼克号中有家庭团体票(分析Ticket号可以得到),所以需要将团体票分到每个人。

   combined_train_test['Group_Ticket'] = combined_train_test['Fare'].groupby(by=combined_train_test['Ticket']).transform('count')
    combined_train_test['Fare'] = combined_train_test['Fare'] / combined_train_test['Group_Ticket']
    combined_train_test.drop(['Group_Ticket'], axis=1, inplace=True)

票价分级

   def fare_category(fare):
        if fare <= 4:
            return 0
        elif fare <= 10:
            return 1
        elif fare <= 30:
            return 2
        elif fare <= 45:
            return 3
        else:
            return 4
   combined_train_test['Fare_Category'] = combined_train_test['Fare'].map(fare_category)

分列(这一项分列与不分列均可)

   fare_cat_dummies_df = pd.get_dummies(combined_train_test['Fare_Category'],prefix=combined_train_test[['Fare_Category']].columns[0])
    combined_train_test = pd.concat([combined_train_test, fare_cat_dummies_df], axis=1)

(5)Pclass

Pclass项本身已经不需要处理,为了更好地利用这一项,我们假设一二三等舱各自内部的票价也与逃生方式相关,从而分出高价一等舱、低价一等舱……这样的分类。

   Pclass_1_mean_fare = combined_train_test['Fare'].groupby(by=combined_train_test['Pclass']).mean().get([1]).values[0]
    Pclass_2_mean_fare = combined_train_test['Fare'].groupby(by=combined_train_test['Pclass']).mean().get([2]).values[0]
    Pclass_3_mean_fare = combined_train_test['Fare'].groupby(by=combined_train_test['Pclass']).mean().get([3]).values[0]
    # 建立Pclass_Fare Category
    combined_train_test['Pclass_Fare_Category'] = combined_train_test.apply(pclass_fare_category, args=(Pclass_1_mean_fare, Pclass_2_mean_fare, Pclass_3_mean_fare), axis=1)
    p_fare = LabelEncoder()
    p_fare.fit(np.array(['Pclass_1_Low_Fare', 'Pclass_1_High_Fare', 'Pclass_2_Low_Fare', 'Pclass_2_High_Fare', 'Pclass_3_Low_Fare','Pclass_3_High_Fare']))#给每一项添加标签
    combined_train_test['Pclass_Fare_Category'] = p_fare.transform(combined_train_test['Pclass_Fare_Category'])#转换成数值

(6)Parch and SibSp

这两组数据都能显著影响到Survived,但是影响方式不完全相同,所以将这两项合并成FamilySize组的同时保留这两项。

   combined_train_test['Family_Size'] = combined_train_test['Parch'] + combined_train_test['SibSp'] + 1
    combined_train_test['Family_Size_Category'] = combined_train_test['Family_Size'].map(family_size_category)
    le_family = LabelEncoder()
    le_family.fit(np.array(['Single', 'Small_Family', 'Large_Family']))
    combined_train_test['Family_Size_Category'] = le_family.transform(combined_train_test['Family_Size_Category'])
    fam_size_cat_dummies_df = pd.get_dummies(combined_train_test['Family_Size_Category'],
                                             prefix=combined_train_test[['Family_Size_Category']].columns[0])
    combined_train_test = pd.concat([combined_train_test, fam_size_cat_dummies_df], axis=1)

(7)Age

因为Age项缺失较多,所以不能直接将其填充为众数或者平均数。常见有两种填充法,一是根据Title项中的Mr、Master、Miss等称呼的平均年龄填充,或者综合几项(Sex、Title、Pclass)的Age均值。二是利用其他组特征量,采用机器学习算法来预测Age,本例采用的是第二种方法。
将Age完整的项作为训练集、将Age缺失的项作为测试集。

   missing_age_df = pd.DataFrame(combined_train_test[['Age', 'Parch', 'Sex', 'SibSp', 'Family_Size', 'Family_Size_Category',
                             'Title', 'Fare', 'Fare_Category', 'Pclass', 'Embarked']])
    missing_age_df = pd.get_dummies(missing_age_df,columns=['Title', 'Family_Size_Category', 'Fare_Category', 'Sex', 'Pclass' ,'Embarked'])
    missing_age_train = missing_age_df[missing_age_df['Age'].notnull()]
    missing_age_test = missing_age_df[missing_age_df['Age'].isnull()]

建立融合模型

   def fill_missing_age(missing_age_train, missing_age_test):
        missing_age_X_train = missing_age_train.drop(['Age'], axis=1)
        missing_age_Y_train = missing_age_train['Age']
        missing_age_X_test = missing_age_test.drop(['Age'], axis=1)
        #模型1
        gbm_reg = ensemble.GradientBoostingRegressor(random_state=42)
        gbm_reg_param_grid = {'n_estimators': [2000], 'max_depth': [3],'learning_rate': [0.01], 'max_features': [3]}
        gbm_reg_grid = model_selection.GridSearchCV(gbm_reg, gbm_reg_param_grid, cv=10, n_jobs=25, verbose=1,  scoring='neg_mean_squared_error')
        gbm_reg_grid.fit(missing_age_X_train, missing_age_Y_train)
        print('Age feature Best GB Params:' + str(gbm_reg_grid.best_params_))
        print('Age feature Best GB Score:' + str(gbm_reg_grid.best_score_))
        print('GB Train Error for "Age" Feature Regressor:'+ str(gbm_reg_grid.score(missing_age_X_train, missing_age_Y_train)))
        missing_age_test['Age_GB'] = gbm_reg_grid.predict(missing_age_X_test)
        print(missing_age_test['Age_GB'][:4])
        #模型2
        lrf_reg = LinearRegression()
        lrf_reg_param_grid = {'fit_intercept': [True], 'normalize': [True]}
        lrf_reg_grid = model_selection.GridSearchCV(lrf_reg, lrf_reg_param_grid, cv=10, n_jobs=25, verbose=1, scoring='neg_mean_squared_error')
        lrf_reg_grid.fit(missing_age_X_train, missing_age_Y_train)
        print('Age feature Best LR Params:' + str(lrf_reg_grid.best_params_))
        print('Age feature Best LR Score:' + str(lrf_reg_grid.best_score_))
        print('LR Train Error for "Age" Feature Regressor' + str(lrf_reg_grid.score(missing_age_X_train, missing_age_Y_train)))
        missing_age_test['Age_LRF'] = lrf_reg_grid.predict(missing_age_X_test)
        print(missing_age_test['Age_LRF'][:4])
       #将两个模型预测后的均值作为最终预测结果
        print('shape1',missing_age_test['Age'].shape,missing_age_test[['Age_GB','Age_LRF']].mode(axis=1).shape)
        #missing_age_test['Age'] = missing_age_test[['Age_GB','Age_LRF']].mode(axis=1)
        missing_age_test['Age'] = np.mean([missing_age_test['Age_GB'],missing_age_test['Age_LRF']])
        print(missing_age_test['Age'][:4])
        drop_col_not_req(missing_age_test, ['Age_GB', 'Age_LRF'])

        return missing_age_test

填充Age

   combined_train_test.loc[(combined_train_test.Age.isnull()), 'Age'] = fill_missing_age(missing_age_train,missing_age_test)

(8)Ticket

将Ticket中的字母与数字分开,分为Ticket_Letter和Ticket_Number两项。

   combined_train_test['Ticket_Letter'] = combined_train_test['Ticket'].str.split().str[0]
    combined_train_test['Ticket_Letter'] = combined_train_test['Ticket_Letter'].apply(lambda x:np.nan if x.isnumeric() else x)
    combined_train_test['Ticket_Number'] = combined_train_test['Ticket'].apply(lambda x: pd.to_numeric(x,errors='coerce'))
    combined_train_test['Ticket_Number'].fillna(0,inplace=True)
    combined_train_test = pd.get_dummies(combined_train_test,columns=['Ticket','Ticket_Letter'])

(9)Cabin

Cabin项缺失太多,只能将有无Cain作为特征值进行建模

   combined_train_test['Cabin_Letter'] = combined_train_test['Cabin'].apply(lambda x:str(x)[0] if pd.notnull(x) else x)
    combined_train_test = pd.get_dummies(combined_train_test,columns=['Cabin','Cabin_Letter'])

完成之后再将train和test分开:

   train_data = combined_train_test[:891]
    test_data = combined_train_test[891:]
    titanic_train_data_X = train_data.drop(['Survived'],axis=1)
    titanic_train_data_Y = train_data['Survived']
    titanic_test_data_X = test_data.drop(['Survived'],axis=1)

4. 模型融合

模型融合分两步进行:

(1)用几个模型筛选出较为重要的特征:

   def get_top_n_features(titanic_train_data_X, titanic_train_data_Y, top_n_features):
        # 随机森林
        rf_est = RandomForestClassifier(random_state=42)
        rf_param_grid = {'n_estimators': [500], 'min_samples_split': [2, 3], 'max_depth': [20]}
        rf_grid = model_selection.GridSearchCV(rf_est, rf_param_grid, n_jobs=25, cv=10, verbose=1)
        rf_grid.fit(titanic_train_data_X,titanic_train_data_Y)
        #将feature按Importance排序
        feature_imp_sorted_rf = pd.DataFrame({'feature': list(titanic_train_data_X), 'importance': rf_grid.best_estimator_.feature_importances_}).sort_values('importance', ascending=False)
        features_top_n_rf = feature_imp_sorted_rf.head(top_n_features)['feature']
        print('Sample 25 Features from RF Classifier')
        print(str(features_top_n_rf[:25]))

        # AdaBoost
        ada_est = ensemble.AdaBoostClassifier(random_state=42)
        ada_param_grid = {'n_estimators': [500], 'learning_rate': [0.5, 0.6]}
        ada_grid = model_selection.GridSearchCV(ada_est, ada_param_grid, n_jobs=25, cv=10, verbose=1)
        ada_grid.fit(titanic_train_data_X, titanic_train_data_Y)
        #排序
        feature_imp_sorted_ada = pd.DataFrame({'feature': list(titanic_train_data_X),'importance': ada_grid.best_estimator_.feature_importances_}).sort_values( 'importance', ascending=False)
        features_top_n_ada = feature_imp_sorted_ada.head(top_n_features)['feature']

        # ExtraTree
        et_est = ensemble.ExtraTreesClassifier(random_state=42)
        et_param_grid = {'n_estimators': [500], 'min_samples_split': [3, 4], 'max_depth': [15]}
        et_grid = model_selection.GridSearchCV(et_est, et_param_grid, n_jobs=25, cv=10, verbose=1)
        et_grid.fit(titanic_train_data_X, titanic_train_data_Y)
        #排序
        feature_imp_sorted_et = pd.DataFrame({'feature': list(titanic_train_data_X), 'importance': et_grid.best_estimator_.feature_importances_}).sort_values('importance', ascending=False)
        features_top_n_et = feature_imp_sorted_et.head(top_n_features)['feature']
        print('Sample 25 Features from ET Classifier:')
        print(str(features_top_n_et[:25]))

        # 将三个模型挑选出来的前features_top_n_et合并
        features_top_n = pd.concat([features_top_n_rf, features_top_n_ada, features_top_n_et], ignore_index=True).drop_duplicates()

        return features_top_n

(2)根据筛选出的特征值挑选训练集和测试集

   feature_to_pick = 250
    feature_top_n = get_top_n_features(titanic_train_data_X,titanic_train_data_Y,feature_to_pick)
    titanic_train_data_X = titanic_train_data_X[feature_top_n]
    del titanic_train_data_X['Ticket_Number']#后来发现删除Ticket_Number后效果更好了
    titanic_test_data_X = titanic_test_data_X[feature_top_n]
    del titanic_test_data_X['Ticket_Number']

(3)利用votingClassifer建立最终预测模型

   rf_est = ensemble.RandomForestClassifier(n_estimators = 750, criterion = 'gini', max_features = 'sqrt',
                                             max_depth = 3, min_samples_split = 4, min_samples_leaf = 2,
                                             n_jobs = 50, random_state = 42, verbose = 1)
    gbm_est = ensemble.GradientBoostingClassifier(n_estimators=900, learning_rate=0.0008, loss='exponential',
                                                  min_samples_split=3, min_samples_leaf=2, max_features='sqrt',
                                                  max_depth=3, random_state=42, verbose=1)
    et_est = ensemble.ExtraTreesClassifier(n_estimators=750, max_features='sqrt', max_depth=35, n_jobs=50,
                                           criterion='entropy', random_state=42, verbose=1)
    voting_est = ensemble.VotingClassifier(estimators = [('rf', rf_est),('gbm', gbm_est),('et', et_est)],
                                       voting = 'soft', weights = [3,5,2],
                                       n_jobs = 50)
    voting_est.fit(titanic_train_data_X,titanic_train_data_Y)
ps:不想用VotingClassifier的也可以自己根据这几个模型的测试准确率给几个模型的结果自定义权重,将最终的加权平均值作为预测结果,本人亲测自定义权重的效果不必VotingClassifier差。

(4)预测及生成提交文件

   titanic_test_data_X['Survived'] = voting_est.predict(titanic_test_data_X)
    submission = pd.DataFrame({'PassengerId':test_data_org.loc[:,'PassengerId'],
                               'Survived':titanic_test_data_X.loc[:,'Survived']})
    submission.to_csv('submission_result.csv',index=False,sep=',')

至此全部结束。
以上代码运行部分参考了Kernels的分享内容,最好运行结果为80.8%,排名8%,代码比较繁琐,共有400余行,对各种因素考虑得比较周全,各种函数写法也相当正规,适合给新手学习之用。

因为水平有限,若有错误、欢迎指正。

发布于 2017-10-29