Kaggle Learn 学习笔记(2)

Machine Learning

Posted by Wenlong Shen on September 27, 2018

How Models Work

预测,是机器学习的主要任务之一:利用training data,建立model进行fitting/training,最后对新的数据集进行predict。

Explore Your Data


import pandas as pd
# read the data and store data in DataFrame
data = pd.read_csv(data_file_path) 
# print a summary of the data in Melbourne data

describe()可以得到count, mean, std, min, max等信息。

Your First Machine Learning Model



from sklearn.tree import DecisionTreeRegressor
# Define model. Specify a number for random_state to ensure same results each run
model = DecisionTreeRegressor(random_state=1)
# Fit model
model.fit(X, y)
# Make predictions
predicted = model.predict(X)

Model Validation


from sklearn.metrics import mean_absolute_error
predicted = model.predict(X)
mean_absolute_error(y, predicted)

模型是利用training data进行训练的,在其上使用评价指标缺乏泛化能力,并且易造成过拟合,我们需要validation data来反映模型的真实能力。

from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
melbourne_model = DecisionTreeRegressor()
model.fit(train_X, train_y)
val_predictions = model.predict(val_X)

Underfitting and Overfitting

过拟合和欠拟合是常见的问题,这也是我们进行参数学习时重要的取舍指标。 kl2

# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
	my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
	print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Handling Missing Values


missing_val_count_by_column = (data.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])


data_without_missing_values = original_data.dropna(axis=1)
# another way
cols_with_missing = [col for col in original_data.columns if original_data[col].isnull().any()]
reduced_original_data = original_data.drop(cols_with_missing, axis=1)
reduced_test_data = test_data.drop(cols_with_missing, axis=1)


from sklearn.impute import SimpleImputer
my_imputer = SimpleImputer()
data_with_imputed_values = my_imputer.fit_transform(original_data)


# make copy to avoid changing original data (when Imputing)
new_data = original_data.copy()
# make new columns indicating what will be imputed
cols_with_missing = [col for col in new_data.columns if new_data[col].isnull().any()]
for col in cols_with_missing:
	new_data[col + '_was_missing'] = new_data[col].isnull()
# Imputation
my_imputer = SimpleImputer()
new_data = pd.DataFrame(my_imputer.fit_transform(new_data))
new_data.columns = original_data.columns

Using Categorical Data with One Hot Encoding

离散型的分类数据不能简单地用函数曲线拟合,常见的处理方法之一是使用One-Hot Encoding,即将每一个离散变量作为一个新的特征编码。

one_hot_encoded_train = pd.get_dummies(train)
# Ensure the test data is encoded in the same manner as the training data with the align command
one_hot_encoded_train = pd.get_dummies(train)
one_hot_encoded_test = pd.get_dummies(test)
final_train, final_test = one_hot_encoded_train.align(one_hot_encoded_test, join='left', axis=1)


大拿模型,我的第一块Kaggle银牌。其核心算法是Gradient Boosted Decision Trees。 kl2

from xgboost import XGBRegressor
my_model = XGBRegressor()
my_model.fit(train_X, train_y, verbose=False)
predictions = my_model.predict(test_X)
from sklearn.metrics import mean_absolute_error
print("Mean Absolute Error : " + str(mean_absolute_error(predictions, test_y)))


my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
my_model.fit(train_X, train_y, early_stopping_rounds=5, eval_set=[(test_X, test_y)], verbose=False)

Partial Dependence Plots


from sklearn.ensemble.partial_dependence import partial_dependence, plot_partial_dependence
# scikit-learn originally implemented partial dependence plots only for Gradient Boosting models
# this was due to an implementation detail, and a future release will support all model types.
my_model = GradientBoostingRegressor()
# fit the model as usual
my_model.fit(X, y)
# Here we make the plot
my_plots = plot_partial_dependence(my_model, 
	features=[0, 2],			# column numbers of plots we want to show
	X=X,				# raw predictors data.
	feature_names=['A', 'B', 'C'], 	# labels on graphs
	grid_resolution=10) 		# number of values to plot on x axis



from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Imputer

my_pipeline = make_pipeline(Imputer(), RandomForestRegressor())
my_pipeline.fit(train_X, train_y)
predictions = my_pipeline.predict(test_X)
# Here is the code to do the same thing without pipelines
my_imputer = Imputer()
my_model = RandomForestRegressor()
imputed_train_X = my_imputer.fit_transform(train_X)
imputed_test_X = my_imputer.transform(test_X)
my_model.fit(imputed_train_X, train_y)
predictions = my_model.predict(imputed_test_X)



利用train_test_split分割出test是办法之一,但并不是最好的,毕竟依然存在随机性,分出多少数据也是问题,分出来了又造成数据浪费。在整体数据量不是特别大的时候,Cross-Validation不失为一个好办法: kl2 接着上一节的pipeline,这里有:

from sklearn.model_selection import cross_val_score
scores = cross_val_score(my_pipeline, X, y, scoring='neg_mean_absolute_error')

Data Leakage
