鱼C论坛

 找回密码
 立即注册
查看: 2030|回复: 2

[技术交流] 机器学习系列------超参数(完成本次机器学习任务)

[复制链接]
发表于 2018-6-10 14:00:01 | 显示全部楼层 |阅读模式

马上注册,结交更多好友,享用更多功能^_^

您需要 登录 才可以下载或查看,没有账号?立即注册

x
        还是继续复制过来以前的代码:
  1. import pandas as pd
  2. import os
  3. import numpy as np
  4. from sklearn.pipeline import Pipeline
  5. from sklearn.preprocessing import StandardScaler
  6. from sklearn.preprocessing import Imputer
  7. from sklearn.base import BaseEstimator,TransformerMixin
  8. from sklearn.preprocessing import LabelBinarizer
  9. from sklearn.pipeline import FeatureUnion
  10. HOUSING_PATH="datasets/housing"
  11. def load_housing_data(housing_path=HOUSING_PATH):
  12.     csv_path=os.path.join(housing_path,"housing.csv")
  13.     return pd.read_csv(csv_path)
  14. housing=load_housing_data()



  15. housing["income_cat"]=np.ceil(housing["median_income"]/1.5)
  16. housing["income_cat"].where(housing["income_cat"]<5,5.0,inplace=True)



  17. from sklearn.model_selection import StratifiedShuffleSplit
  18. split=StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state=42)
  19. for train_index,test_index in split.split(housing,housing["income_cat"]):
  20.     strat_train_set=housing.loc[train_index]
  21.     strat_test_set=housing.loc[test_index]



  22. for set in (strat_train_set,strat_test_set):
  23.     set.drop(["income_cat"],axis=1,inplace=True)
  24.    
  25. housing=strat_train_set.drop("median_house_value",axis=1)
  26. housing_labels=strat_train_set["median_house_value"].copy()
  27. housing_num=housing.drop("ocean_proximity",axis=1)

  28. rooms_ix,bedroom_ix,population_ix,household_ix=3,4,5,6
  29. class CombinedAttributesAdder(BaseEstimator,TransformerMixin):
  30.     def __init__(self,add_bedrooms_per_room=True):
  31.         self.add_bedrooms_per_room=add_bedrooms_per_room
  32.     def fit(self,X,y=None):
  33.         return self
  34.     def transform(self,X,y=None):
  35.         rooms_per_household=X[:,rooms_ix]/X[:,household_ix]
  36.         population_per_household=X[:,population_ix]/X[:,household_ix]
  37.         if self.add_bedrooms_per_room:
  38.             bedrooms_per_room=X[:,bedroom_ix]/X[:,rooms_ix]
  39.             return np.c_[X,rooms_per_household,population_per_household,bedrooms_per_room]
  40.         else:
  41.             return np,c_[X,rooms_per_household,population_per_household]                 
  42.         
  43. class MyLabelBinarizer(TransformerMixin):
  44.     def __init__(self, *args, **kwargs):
  45.         self.encoder = LabelBinarizer(*args, **kwargs)
  46.     def fit(self, x, y=0):
  47.         self.encoder.fit(x)
  48.         return self
  49.     def transform(self, x, y=0):
  50.         return self.encoder.transform(x)
  51.    
  52. class DataFrameSelector(BaseEstimator,TransformerMixin):
  53.     def __init__(self,attribute_names):
  54.         self.attribute_names=attribute_names
  55.     def fit(self,X,y=None):
  56.         return self
  57.     def transform(self,X):
  58.         return X[self.attribute_names].values

  59. num_attribs=list(housing_num)
  60. cat_attribs=["ocean_proximity"]
  61. num_pipline=Pipeline([
  62.         ('selector',DataFrameSelector(num_attribs)),
  63.         ('imputer',Imputer(strategy="median")),
  64.         ('attribus_adder',CombinedAttributesAdder()),
  65.         ('std_scaler',StandardScaler()),
  66.     ])
  67. cat_pipline=Pipeline([
  68.         ('selector',DataFrameSelector(cat_attribs)),
  69.         ('label_binarizer',MyLabelBinarizer()),
  70.     ])
  71. full_pipline=FeatureUnion(transformer_list=[
  72.         ("num_pipline",num_pipline),
  73.         ("cat_pipline",cat_pipline),
  74.     ])
  75. housing_prepared=full_pipline.fit_transform(housing)
复制代码

        然后为我们的模型设置超参数(就是随机森林里树的个数,训练集有几组特征值等等一系列的参数):
  1. from sklearn.model_selection import GridSearchCV
  2. from sklearn.ensemble import RandomForestRegressor

  3. param_grid=[
  4.     {'n_estimators':[3,10,30],'max_features':[2,4,6,8]},
  5.     {'bootstrap':[False],'n_estimators':[3,10],'max_features':[2,3,4]}
  6. ]

  7. forest_reg=RandomForestRegressor()
  8. grid_search=GridSearchCV(forest_reg,param_grid,cv=5,scoring='neg_mean_squared_error')
  9. grid_search.fit(housing_prepared,housing_labels)
复制代码

        然后看看以上方法为我们预测出来的超参数是什么:
  1. grid_search.best_params_
复制代码

        写完上面这行执行后会出现如下数据:
{'max_features': 8, 'n_estimators': 30}
        这就是我们要用的最优参数,最大特征值是8组,决策树的个数是30个,会使我们预测的数据更准确。继续看下对比后的结果:
  1. cvres=grid_search.cv_results_
  2. for mean_score,params in zip(cvres["mean_test_score"],cvres["params"]):
  3.     print(np.sqrt(-mean_score),params)
复制代码

        执行后会显示以下结果:
64539.9065965 {'n_estimators': 3, 'max_features': 2}
55401.3638545 {'n_estimators': 10, 'max_features': 2}
52849.5504873 {'n_estimators': 30, 'max_features': 2}
60488.5008218 {'n_estimators': 3, 'max_features': 4}
53228.6746741 {'n_estimators': 10, 'max_features': 4}
50746.7001414 {'n_estimators': 30, 'max_features': 4}
59350.5028588 {'n_estimators': 3, 'max_features': 6}
52568.7835547 {'n_estimators': 10, 'max_features': 6}
50013.8405865 {'n_estimators': 30, 'max_features': 6}
59451.5995633 {'n_estimators': 3, 'max_features': 8}
52092.3880668 {'n_estimators': 10, 'max_features': 8}
49857.1401575 {'n_estimators': 30, 'max_features': 8}
61897.0944086 {'n_estimators': 3, 'bootstrap': False, 'max_features': 2}
54548.6814397 {'n_estimators': 10, 'bootstrap': False, 'max_features': 2}
60637.0802455 {'n_estimators': 3, 'bootstrap': False, 'max_features': 3}
53100.3544235 {'n_estimators': 10, 'bootstrap': False, 'max_features': 3}
58487.3207429 {'n_estimators': 3, 'bootstrap': False, 'max_features': 4}
51741.4498429 {'n_estimators': 10, 'bootstrap': False, 'max_features': 4}
        果然最优的就是刚才的那个,第一个均差最大的是最差的。然后看看特征的重要程度:
  1. feature_importances=grid_search.best_estimator_.feature_importances_
  2. feature_importances
复制代码

        这个输出结果不是太容易看,只是一组数据这里就不复制了,然后我们用更直观的方法看看:
  1. from sklearn.preprocessing import LabelEncoder
  2. encoder=LabelEncoder()
  3. housing_cat=housing["ocean_proximity"]
  4. housing_cat_encoded=encoder.fit_transform(housing_cat)
  5. extra_attribs=["rooms_per_household","pop_per_household","bedrooms_per_room"]
  6. cat_one_hot_attribs=list(encoder.classes_)
  7. attribus=num_attribs+extra_attribs+cat_one_hot_attribs
  8. sorted(zip(feature_importances,attribus),reverse=True)
复制代码

        输出为:
[(0.35216413184624645, 'median_income'),
(0.1600064373659322, 'INLAND'),
(0.10973914260172225, 'pop_per_household'),
(0.073196611077795654, 'longitude'),
(0.069993781986234183, 'bedrooms_per_room'),
(0.063532534714522193, 'latitude'),
(0.050927579668676295, 'rooms_per_household'),
(0.043936553336933998, 'housing_median_age'),
(0.015821739675359821, 'total_rooms'),
(0.014974729578331942, 'total_bedrooms'),
(0.014716810447382345, 'households'),
(0.014198191755119289, 'population'),
(0.01174446919617937, '<1H OCEAN'),
(0.0028095259721804206, 'NEAR OCEAN'),
(0.0021732837648423087, 'NEAR BAY'),
(6.4477012541252311e-05, 'ISLAND')]
        这里面0.00几的就不是什么好特征了,应该给删掉,但是因为太麻烦我就不管他了,下面我们要用电脑给我们预测出的参数进行机器学习:
  1. from sklearn.metrics import mean_squared_error
  2. final_model=grid_search.best_estimator_
  3. x_test=strat_test_set.drop("median_house_value",axis=1)
  4. y_test=strat_test_set["median_house_value"].copy()

  5. x_test_prepared=full_pipline.transform(x_test)
  6. final_prediction=final_model.predict(x_test_prepared)

  7. final_mse=mean_squared_error(y_test,final_prediction)
  8. final_rmse=np.sqrt(final_mse)
复制代码

        然后我们再看看方差:
  1. final_rmse
复制代码

        我这里显示的是47568.875366719221,又比以前的强了。

本帖被以下淘专辑推荐:

想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

发表于 2019-1-11 13:17:35 | 显示全部楼层
发现学python  机器学习的坛友很少啊!爬虫的多
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2019-2-16 14:06:11 | 显示全部楼层
不错,有收获。
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

小黑屋|手机版|Archiver|鱼C工作室 ( 粤ICP备18085999号-1 | 粤公网安备 44051102000585号)

GMT+8, 2024-5-6 18:27

Powered by Discuz! X3.4

© 2001-2023 Discuz! Team.

快速回复 返回顶部 返回列表