這篇整理分析流程.特徵工程.演算法選擇.常見問題

分析流程

資料清洗. 轉換. 補值(1,2會同時進行)
EDA(繪製資料分佈圖 box圖相關係數等)
feature engineer(篩選減少特徵與組合增加特徵)
modeling
調參數
評估
ensemble組合model

以下為範例

In [18]:

from sklearn.grid_search import GridSearchCV

param_grid = {'polynomialfeatures__degree': np.arange(21),
              'linearregression__fit_intercept': [True, False],
              'linearregression__normalize': [True, False]}

grid = GridSearchCV(PolynomialRegression(), param_grid, cv=7)

Notice that like a normal estimator, this has not yet been applied to any data. Calling the fit() method will fit the model at each grid point, keeping track of the scores along the way:

In [19]:

grid.fit(X, y);

Now that this is fit, we can ask for the best parameters as follows:

In [20]:

grid.best_params_

Out[20]:

{'linearregression__fit_intercept': False,
 'linearregression__normalize': True,
 'polynomialfeatures__degree': 4}

Finally, if we wish, we can use the best model and show the fit to our data using code from before:

In [21]:

model = grid.best_estimator_

plt.scatter(X.ravel(), y)
lim = plt.axis()
y_test = model.fit(X, y).predict(X_test)
plt.plot(X_test.ravel(), y_test, hold=True);
plt.axis(lim);

more data? Yes !!!, 資料越多越能代表母體的資料分佈, overfitting優先考慮取得更多資料, 如果沒有才減少model複雜度,當raw data足以代表母體資料, 就直接overfitting也很準, 增加data也會使得normalization效果降低, 但目的都一樣
驗證資料, 只用test data tune 參數會過度樂觀, training data在切出驗證資料, 找出最好的參數, 再用所有training data 以及剛剛的參數 fit model, 再拿重來沒看過的test data評估
數值feature要轉成onehot or切斷可以參考問題的domain knowleage怎麼切
precision and recall : 警察抓小偷 >> precision 一抓人就要抓對, recall 有可疑就抓, 目的是壞人都要抓到, 抓錯沒關係

以下是實作中訓練技巧