Kaggle入门

2023-05-14

Coding / 机器学习 / Kaggle

Word count: 1.3k | Reading time≈ 5 min

Kaggle入门

参考：Kaggle入门，看这一篇就够了 - 知乎 (zhihu.com)

scikit-learn

参考：https://scikit-learn.org/stable/index.html

第一个机器学习模型

The steps to building and using a model are:

Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
Fit: Capture patterns from provided data. This is the heart of modeling.
Predict: Just what it sounds like
Evaluate: Determine how accurate the model’s predictions are.

定义与拟合

from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1)

# Fit model
melbourne_model.fit(X, y)

Specifying a number for random_state ensures you get the same results in each run. This is considered a good practice. You use any number, and model quality won’t depend meaningfully on exactly what value you choose.

预测

print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))	# 预测前几行数据

Model Validation

许多人在测量预测精度时犯了一个巨大的错误。他们使用训练数据进行预测，并将这些预测与训练数据中的目标值进行比较。

有许多度量来总结模型质量，其中之一是平均绝对误差（也称为MAE）。

$要么mean\_absolute\_error=actual\_val \ - \ predict\_val \\ 或者mean\_absolute\_error=predict\_val \ - \ actual\_val$

from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)	# 计算样本内得分

我们刚刚计算的度量可以称为“样本内”得分。我们使用了一个房屋“样本”来构建模型并对其进行评估。以下是为什么这很糟糕的原因。

想象一下，在大型房地产市场中，门的颜色与房价无关。

然而，在你用来建立模型的数据样本中，所有带绿色门的房子都非常昂贵。该模型的工作是找到预测房价的模式，因此它将看到这种模式，并且它将始终预测带绿色门的房屋的高价格。

由于该模式是从训练数据中得出的，因此该模型在训练数据中看起来是准确的。

但是，如果当模型看到新数据时，这种模式不成立，那么在实践中使用该模型将非常不准确。

由于模型的实际价值来自于对新数据的预测，我们衡量的是未用于构建模型的数据的性能。做到这一点最简单的方法是从模型构建过程中排除一些数据，然后使用这些数据在以前从未见过的数据上测试模型的准确性。这些数据称为验证数据。

scikit-learn库有一个函数train_test_split将数据分成两部分。我们将使用其中的一些数据作为训练数据来拟合模型，我们将使用其他数据作为验证数据来计算mean_absolute_error。

from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))	# 计算“样本”外得分

Underfitting and Overfitting

Overfitting

When we divide the houses amongst many leaves, we also have fewer houses in each leaf. Leaves with very few houses will make predictions that are quite close to those homes’ actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).This is a phenomenon called overfitting.

Underfitting

At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason). When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called underfitting.

Coding基础知识

axis：指定数据操作的维度

对数据进行操作时，经常需要在横轴方向或者数轴方向对数据进行操作，这时需要设定参数axis的值：

axis = 0 代表对横轴操作，也就是第0轴；
axis = 1 代表对纵轴操作，也就是第1轴；

In [1]: import numpy as np
#生成一个3行4列的数组
In [2]: a = np.arange(12).reshape(3,4)
In [3]: a
Out[3]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
#axis= 0 对a的横轴进行操作，在运算的过程中其运算的方向表现为纵向运算
In [4]: a.sum(axis = 0)
Out[4]: array([12, 15, 18, 21])
#axis= 1 对a的纵轴进行操作，在运算的过程中其运算的方向表现为横向运算
In [5]: a.sum(axis = 1)
Out[5]: array([ 6, 22, 38])

axis=0：表示沿着第一个维度（行）进行操作

axis=1：表示沿着第二个维度（列）进行操作

axis=2：表示沿着第三个维度进行操作，以此类推

pandas处理特定数据行或列

pandas删除/选取含有特定数值的行或列

参考：(113条消息) pandas.DataFrame删除/选取含有特定数值的行或列_pandas删除特定值的行_luocheng7430的博客-CSDN博客

pandas处理重复行

参考：(113条消息) pandas: DataFrame 删除重复的行_pandas 删除重复行_大白羊的进阶之路的博客-CSDN博客

Copyright： Copyright is owned by the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.