使用sklearn对数据特征进行归一化代码

Terence888 · 发表于 2024-10-18 20:58:38

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

import sklearn.preprocessing as preprocessing
scaler = preprocessing.StandardScaler()
age_scale_param = scaler.fit(df['Age'])
df['Age_scaled'] = scaler.fit_transform(df['Age'], age_scale_param)
fare_scale_param = scaler.fit(df['Fare'])
df['Fare_scaled'] = scaler.fit_transform(df['Fare'], fare_scale_param)
df

复制代码

其中，

age_scale_param = scaler.fit(df['Age'])

复制代码

和

scaler.fit_transform(df['Age'], age_scale_param)

复制代码

表示什么意思

FishC · 发表于 2024-10-18 20:58:46

在您的代码中，似乎混淆了一些概念，导致了一些不正确的用法。我来解释一下正确的使用方式。

在使用`sklearn.preprocessing`进行数据特征归一化时，通常采用以下步骤：

1. **创建标准化对象**：`scaler = preprocessing.StandardScaler()`

2. **拟合数据并转换数据**：通常使用`fit_transform`方法一步完成，或者分两步，也就是先`fit`再`transform`。具体解释如下：

- `age_scale_param = scaler.fit(df)`: 这一步是用来计算数据的均值和标准差等统计量（这一行不需要将结果赋值给`age_scale_param`，因为`fit`方法本身并没有返回值，它仅仅是在内部调整scaler对象的状态）。

- `df_scaled = scaler.fit_transform(df)`: 这一步将原始数据根据前面计算的均值和标准差进行归一化。这一步相当于将`fit`和`transform`结合在一次调用中实现。

您的代码里，`scaler.fit_transform(df, age_scale_param)`和`fare_scale_param = scaler.fit(df)`重复了不必要的操作，而且语法上不合适。正确的方法是直接对数据调用`fit_transform`，结果如下：

import sklearn.preprocessing as preprocessing
# 创建标准化对象
scaler = preprocessing.StandardScaler()
# 对数据进行标准化
df_scaled = scaler.fit_transform(df)
# 查看归一化后的数据
print(df_scaled)

复制代码

这样就完成了对`df`的标准化处理，`df_scaled`就是归一化后的数据。每个特征都会被平移至均值为0，尺度为单位方差。

以上回复来自 FishC AI 小助理，如未能正确解答您的问题，请继续追问。

Terence888 · 发表于 2024-10-18 21:03:15

FishC 发表于 2024-10-18 20:58
在您的代码中，似乎混淆了一些概念，导致了一些不正确的用法。我来解释一下正确的使用方式。

在使用`skl ...

data_train['Age_scaled'] = scaler.fit_transform(data_train['Age'])

复制代码

上述代码报错如下

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[62], line 6
3 import sklearn.preprocessing as preprocessing
4 scaler = preprocessing.StandardScaler()
----> 6 data_train['Age_scaled'] = scaler.fit_transform(data_train['Age'])
File /opt/conda/lib/python3.10/site-packages/sklearn/utils/_set_output.py:140, in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)
138 @wraps(f)
139 def wrapped(self, X, *args, **kwargs):
--> 140 data_to_wrap = f(self, X, *args, **kwargs)
141 if isinstance(data_to_wrap, tuple):
142 # only wrap the first output for cross decomposition
143 return (
144 _wrap_data_with_container(method, data_to_wrap[0], X, self),
145 *data_to_wrap[1:],
146 )
File /opt/conda/lib/python3.10/site-packages/sklearn/base.py:878, in TransformerMixin.fit_transform(self, X, y, **fit_params)
874 # non-optimized default implementation; override when a better
875 # method is possible for a given clustering algorithm
876 if y is None:
877 # fit method of arity 1 (unsupervised transformation)
--> 878 return self.fit(X, **fit_params).transform(X)
879 else:
880 # fit method of arity 2 (supervised transformation)
881 return self.fit(X, y, **fit_params).transform(X)
File /opt/conda/lib/python3.10/site-packages/sklearn/preprocessing/_data.py:824, in StandardScaler.fit(self, X, y, sample_weight)
822 # Reset internal state before fitting
823 self._reset()
--> 824 return self.partial_fit(X, y, sample_weight)
File /opt/conda/lib/python3.10/site-packages/sklearn/preprocessing/_data.py:861, in StandardScaler.partial_fit(self, X, y, sample_weight)
858 self._validate_params()
860 first_call = not hasattr(self, "n_samples_seen_")
--> 861 X = self._validate_data(
862 X,
863 accept_sparse=("csr", "csc"),
864 dtype=FLOAT_DTYPES,
865 force_all_finite="allow-nan",
866 reset=first_call,
867 )
868 n_features = X.shape[1]
870 if sample_weight is not None:
File /opt/conda/lib/python3.10/site-packages/sklearn/base.py:565, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
563 raise ValueError("Validation should be done on X, y or both.")
564 elif not no_val_X and no_val_y:
--> 565 X = check_array(X, input_name="X", **check_params)
566 out = X
567 elif no_val_X and not no_val_y:
File /opt/conda/lib/python3.10/site-packages/sklearn/utils/validation.py:902, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
900 # If input is 1D raise error
901 if array.ndim == 1:
--> 902 raise ValueError(
903 "Expected 2D array, got 1D array instead:\narray={}.\n"
904 "Reshape your data either using array.reshape(-1, 1) if "
905 "your data has a single feature or array.reshape(1, -1) "
906 "if it contains a single sample.".format(array)
907 )
909 if dtype_numeric and array.dtype.kind in "USV":
910 raise ValueError(
911 "dtype='numeric' is not compatible with arrays of bytes/strings."
912 "Convert your data to numeric values explicitly instead."
913 )
ValueError: Expected 2D array, got 1D array instead:
array=[22. 38. 26. 35. 35. 54. 2. 27. 14. 4. 58. 20.
39. 14. 55. 2. 31. 35. 34. 15. 28. 8. 38. 19.
40. 66. 28. 42. 21. 18. 14. 40. 27. 3. 19. 18.
7. 21. 49. 29. 65. 21. 28.5 5. 11. 22. 45. 4.
29. 19. 17. 26. 32. 16. 21. 26. 32. 25. 0.83 30.
22. 29. 28. 17. 33. 16. 23. 24. 29. 20. 46. 26.
59. 71. 23. 34. 34. 28. 21. 33. 37. 28. 21. 38.
47. 14.5 22. 20. 17. 21. 70.5 29. 24. 2. 21. 32.5
32.5 54. 12. 24. 45. 33. 20. 47. 29. 25. 23. 19.
37. 16. 24. 22. 24. 19. 18. 19. 27. 9. 36.5 42.
51. 22. 55.5 40.5 51. 16. 30. 44. 40. 26. 17. 1.
9. 45. 28. 61. 4. 1. 21. 56. 18. 50. 30. 36.
9. 1. 4. 45. 40. 36. 32. 19. 19. 3. 44. 58.
42. 24. 28. 34. 45.5 18. 2. 32. 26. 16. 40. 24.
35. 22. 30. 31. 27. 42. 32. 30. 16. 27. 51. 38.
22. 19. 20.5 18. 35. 29. 59. 5. 24. 44. 8. 19.
33. 29. 22. 30. 44. 25. 24. 37. 54. 29. 62. 30.
41. 29. 30. 35. 50. 3. 52. 40. 36. 16. 25. 58.
35. 25. 41. 37. 63. 45. 7. 35. 65. 28. 16. 19.
33. 30. 22. 42. 22. 26. 19. 36. 24. 24. 23.5 2.
50. 19. 0.92 17. 30. 30. 24. 18. 26. 28. 43. 26.
24. 54. 31. 40. 22. 27. 30. 22. 36. 61. 36. 31.
16. 45.5 38. 16. 29. 41. 45. 45. 2. 24. 28. 25.
36. 24. 40. 3. 42. 23. 15. 25. 28. 22. 38. 40.
29. 45. 35. 30. 60. 24. 25. 18. 19. 22. 3. 22.
27. 20. 19. 42. 1. 32. 35. 18. 1. 36. 17. 36.
21. 28. 23. 24. 22. 31. 46. 23. 28. 39. 26. 21.
28. 20. 34. 51. 3. 21. 33. 44. 34. 18. 30. 10.
21. 29. 28. 18. 28. 19. 32. 28. 42. 17. 50. 14.
21. 24. 64. 31. 45. 20. 25. 28. 4. 13. 34. 5.
52. 36. 30. 49. 29. 65. 50. 48. 34. 47. 48. 38.
56. 0.75 38. 33. 23. 22. 34. 29. 22. 2. 9. 50.
63. 25. 35. 58. 30. 9. 21. 55. 71. 21. 54. 25.
24. 17. 21. 37. 16. 18. 33. 28. 26. 29. 36. 54.
24. 47. 34. 36. 32. 30. 22. 44. 40.5 50. 39. 23.
2. 17. 30. 7. 45. 30. 22. 36. 9. 11. 32. 50.
64. 19. 33. 8. 17. 27. 22. 22. 62. 48. 39. 36.
40. 28. 24. 19. 29. 32. 62. 53. 36. 16. 19. 34.
39. 32. 25. 39. 54. 36. 18. 47. 60. 22. 35. 52.
47. 37. 36. 49. 49. 24. 44. 35. 36. 30. 27. 22.
40. 39. 35. 24. 34. 26. 4. 26. 27. 42. 20. 21.
21. 61. 57. 21. 26. 80. 51. 32. 9. 28. 32. 31.
41. 20. 24. 2. 0.75 48. 19. 56. 23. 18. 21. 18.
24. 32. 23. 58. 50. 40. 47. 36. 20. 32. 25. 43.
40. 31. 70. 31. 18. 24.5 18. 43. 36. 27. 20. 14.
60. 25. 14. 19. 18. 15. 31. 4. 25. 60. 52. 44.
49. 42. 18. 35. 18. 25. 26. 39. 45. 42. 22. 24.
48. 29. 52. 19. 38. 27. 33. 6. 17. 34. 50. 27.
20. 30. 25. 25. 29. 11. 23. 23. 28.5 48. 35. 36.
21. 24. 31. 70. 16. 30. 19. 31. 4. 6. 33. 23.
48. 0.67 28. 18. 34. 33. 41. 20. 36. 16. 51. 30.5
32. 24. 48. 57. 54. 18. 5. 43. 13. 17. 29. 25.
25. 18. 8. 1. 46. 16. 25. 39. 49. 31. 30. 30.
34. 31. 11. 0.42 27. 31. 39. 18. 39. 33. 26. 39.
35. 6. 30.5 23. 31. 43. 10. 52. 27. 38. 27. 2.
1. 15. 0.83 23. 18. 39. 21. 32. 20. 16. 30. 34.5
17. 42. 35. 28. 4. 74. 9. 16. 44. 18. 45. 51.
24. 41. 21. 48. 24. 42. 27. 31. 4. 26. 47. 33.
47. 28. 15. 20. 19. 56. 25. 33. 22. 28. 25. 39.
27. 19. 26. 32. ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

复制代码

FishC · 发表于 2024-10-18 21:03:21

Terence888 发表于 2024-10-18 21:03
data_train = scaler.fit_transform(data_train)
上 ...

这个错误提示是因为传递给`StandardScaler`的数据是一个一维数组，而`StandardScaler`期望接收到一个二维数组（通常是形如`(样本数, 特征数)`的结构）。

要解决这个问题，你需要确保传入的数据是二维的。如果你的数据本身是一维的（例如只有一个特征），你需要对数据进行调整。比如：

import numpy as np
import sklearn.preprocessing as preprocessing
scaler = preprocessing.StandardScaler()
# 假设 data_train 是一维的，可以这样转换为二维
data_train = np.array(data_train).reshape(-1, 1)
# 进行归一化
data_train_scaled = scaler.fit_transform(data_train)
print(data_train_scaled)

复制代码

通过使用`reshape(-1, 1)`，把一维数据转换成二维的`n x 1`格式，这样就可以使用`StandardScaler`进行操作了。

以上回复来自 FishC AI 小助理，如未能正确解答您的问题，请继续追问。

Terence888 · 发表于 2024-10-19 10:25:41

FishC 发表于 2024-10-18 21:03
这个错误提示是因为传递给`StandardScaler`的数据是一个一维数组，而`StandardScaler`期望接收到一个二 ...

train_df = df.filter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass_.*')

复制代码

regex怎么传递参数，‘|’表示什么意思？

train_np = train_df.as_matrix()

复制代码

为什么要进行这个操作

FishC · 发表于 2024-10-19 10:28:59

Terence888 发表于 2024-10-19 10:25
train_df = df.filter(regex='Survived|Age_.*|SibSp| ...

关于你的问题，我来一一解答：

1. **正则表达式中的`regex`参数和`'|'`符号**：

在`df.filter(regex=...)`中，`regex`参数用于指定一个正则表达式，以匹配你想要选择的数据框列。例如：

- `'Survived|Age_.*|SibSp'`：这是一个包含多个条件的正则表达式。
- `'|'` 是正则表达式中的“或”操作符，表示匹配它前面或后面的正则表达式。

在这个例子中：

- `Survived`：匹配名为`Survived`的列。
- `Age_.*`：匹配以`Age_`开头的所有列（`.`表示任意字符，`*`表示任意次数）。
- `SibSp`：匹配名为`SibSp`的列。

因此，传递给`regex`的字符串即是要用来选择的列名模式，通过使用正则表达式，可以灵活地选择符合任意规则的列。

2. **`train_df.as_matrix()`的作用**：

`train_df.as_matrix()`是在早期版本的Pandas中使用的方法，用于将DataFrame转换为NumPy数组。在新版的Pandas（从0.23.0开始），建议使用`.values`属性，或者`.to_numpy()`方法来替代，因为`.as_matrix()`已被废弃。例如：

train_np = train_df.to_numpy()

复制代码

或

train_np = train_df.values

复制代码

进行这个操作的原因是，有时候模型训练需要的是NumPy数组而不是Pandas DataFrame，因此需要进行这种类型的转换，以便于将数据输入到如scikit-learn等机器学习库中。这样可以对数据执行更快速的线性代数运算和操作。

以上回复来自 FishC AI 小助理，如未能正确解答您的问题，请继续追问。

Terence888 · 发表于 2024-10-19 10:43:39

FishC 发表于 2024-10-19 10:28
关于你的问题，我来一一解答：

1. **正则表达式中的`regex`参数和`'|'`符号**：

---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/tmp/ipykernel_33/2196286347.py in ?()
1 #使用逻辑回归训练模型
2 from sklearn import linear_model
----> 3 train = data_train.fiter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Sex_.*|Embarked_.*|Pclass_.*')#regex用于指定包含多个条件的正则表达式,'|'的“或”操作符
4 LR_train = train.to_numpy()
5 X_train = LR_train[:,1:]
6 y_ttain = LR_train[:,0]
/opt/conda/lib/python3.10/site-packages/pandas/core/generic.py in ?(self, name)
6295 and name not in self._accessors
6296 and self._info_axis._can_hold_identifiers_and_holds_name(name)
6297 ):
6298 return self[name]
-> 6299 return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'fiter'

复制代码

报错如上
代码如下

train = data_train.fiter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Sex_.*|Embarked_.*|Pclass_.*')

复制代码

FishC · 发表于 2024-10-19 10:43:44

Terence888 发表于 2024-10-19 10:43
-------------------------------------------------- ...

报错提示指出，`DataFrame`对象没有名为`fiter`的方法。这个问题是因为拼写错误，正确的方法名称应该是`filter`。

请把代码改为：

train = data_train.filter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Sex_.*|Embarked_.*|Pclass_.*')

复制代码

这个`filter`方法可以用来筛选满足正则表达式模式的列，你的正则表达式会选择名字中包含匹配模式的列。修正拼写错误后，代码应该可以正常运行。

以上回复来自 FishC AI 小助理，如未能正确解答您的问题，请继续追问。

Terence888 · 发表于 2024-10-19 10:45:45

FishC 发表于 2024-10-19 10:43
报错提示指出，`DataFrame`对象没有名为`fiter`的方法。这个问题是因为拼写错误，正确的方法名称应该是 ...

#使用逻辑回归训练模型
from sklearn import linear_model
train = data_train.filter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Sex_.*|Embarked_.*|Pclass_.*')#regex用于指定包含多个条件的正则表达式,'|'的“或”操作符
LR_train = train.to_numpy()
X_train = LR_train[:,1:]
y_train = LR_train[:,0]
#使用k折交叉验证确定正则化项C参数的值
model = linear_model.LogisticRegression(C=1,penalty='l2',tol=1e-6)#C为正则化
model.fit(X_train,y_train)
model

复制代码

上述代码报错

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[27], line 10
8 #使用k折交叉验证确定正则化项C参数的值
9 model = linear_model.LogisticRegression(C=1,penalty='l2',tol=1e-6)#C为正则化
---> 10 model.fit(X_train,y_train)
11 model
File /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1204, in LogisticRegression.fit(self, X, y, sample_weight)
1194 _dtype = [np.float64, np.float32]
1196 X, y = self._validate_data(
1197 X,
1198 y,
(...)
1202 accept_large_sparse=solver not in ["liblinear", "sag", "saga"],
1203 )
-> 1204 check_classification_targets(y)
1205 self.classes_ = np.unique(y)
1207 multi_class = _check_multi_class(self.multi_class, solver, len(self.classes_))
File /opt/conda/lib/python3.10/site-packages/sklearn/utils/multiclass.py:218, in check_classification_targets(y)
210 y_type = type_of_target(y, input_name="y")
211 if y_type not in [
212 "binary",
213 "multiclass",
(...)
216 "multilabel-sequences",
217 ]:
--> 218 raise ValueError("Unknown label type: %r" % y_type)
ValueError: Unknown label type: 'unknown'

复制代码

FishC · 发表于 2024-10-19 10:52:04

Terence888 发表于 2024-10-19 10:45
#使用逻辑回归训练模型
from sklearn import linear_ ...

这个错误是由于`y_train`的格式不正确导致的。逻辑回归期望`y_train`是目标变量的一个合适的标签数组。通常这是一个一维数组，包含数据集的标签（目标值），而当前代码中`y_train`是整个数据矩阵。

在逻辑回归模型中，`X_train`应该是特征数据，而`y_train`应该是相应的类标签。在你的代码中，`X_train`和`y_train`是相同的，这显然不符合逻辑回归模型的要求。

### 矫正步骤：

1. **分离特征和标签**：

- `X_train` 应该只包含特征。
- `y_train` 应该只包含目标变量，即`Survived`列。

假设目标列是`Survived`，其余部分是特征，那么可以这样构建：

# 从 DataFrame 分离目标变量
X_train = train.drop(columns=['Survived']).to_numpy()
y_train = train['Survived'].to_numpy()
# 训练模型
model = linear_model.LogisticRegression(C=1, penalty='l2', tol=1e-6)
model.fit(X_train, y_train)

复制代码

### 注意：

- 确保`Survived`列存在于你的数据集中，同时确认数据集中没有多余的列。
- 逻辑回归要求`y_train`为分类标签，通常在二分类问题中表现为数组例如`[0, 1, 1, 0, ...]`形式。当非二值分类时，需要进一步进行标签处理。

通过确保以上条件正确，可以修正错误并使模型正常训练。

以上回复来自 FishC AI 小助理，如未能正确解答您的问题，请继续追问。

Terence888 · 发表于 2024-10-19 10:58:45

FishC 发表于 2024-10-19 10:52
这个错误是由于`y_train`的格式不正确导致的。逻辑回归期望`y_train`是目标变量的一个合适的标签数组。 ...

#使用逻辑回归训练模型
from sklearn import linear_model
from model.collection import train_test_split
train = data_train.filter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Sex_.*|Embarked_.*|Pclass_.*')#regex用于指定包含多个条件的正则表达式,'|'的“或”操作符
X_train = train[:,1:].to_numpy()
y_train = train[:,0].to_numpy()
#简单交叉验证切分训练集和验证集
model = linear_model.LogisticRegression(C=1,penalty='l2',tol=1e-6)#C为正则化参数
model.fit(X_train,y_train)
model

复制代码

出现报错

---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[29], line 3
1 #使用逻辑回归训练模型
2 from sklearn import linear_model
----> 3 from model.collection import train_test_split
4 train = data_train.filter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Sex_.*|Embarked_.*|Pclass_.*')#regex用于指定包含多个条件的正则表达式,'|'的“或”操作符
5 X_train = train[:,1:].to_numpy()
ModuleNotFoundError: No module named 'model'

复制代码

FishC · 发表于 2024-10-19 10:58:56

Terence888 发表于 2024-10-19 10:58
#使用逻辑回归训练模型
from sklearn import linear_ ...

这个错误是因为错误导入了模块。`train_test_split`是`scikit-learn`库中用于拆分数据集的函数，应该从`sklearn.model_selection`模块导入，而不是`model.collection`。请将导入语句修改为：

from sklearn.model_selection import train_test_split

复制代码

此外，还有关于`X_train`和`y_train`的构建，如果`train`数据集中目标变量是`Survived`，需要将这个目标变量从特征中分离出来。假设目标变量是`Survived`：

# 使用逻辑回归训练模型
from sklearn import linear_model
from sklearn.model_selection import train_test_split
train = data_train.filter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Sex_.*|Embarked_.*|Pclass_.*')
# 分离特征和目标变量
X_train = train.drop(columns=['Survived']).to_numpy()
y_train = train['Survived'].to_numpy()
# 简单交叉验证切分训练集和验证集
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
model = linear_model.LogisticRegression(C=1, penalty='l2', tol=1e-6) # C为正则化参数
model.fit(X_train, y_train)
print(model)

复制代码

上述代码演示了如何正确地导入函数，分割数据集并训练逻辑回归模型。请确保`train`数据集中包含`Survived`列，并相应地设置特征和目标。

以上回复来自 FishC AI 小助理，如未能正确解答您的问题，请继续追问。

Terence888 · 发表于 2024-10-19 11:28:02

FishC 发表于 2024-10-19 10:58
这个错误是因为错误导入了模块。`train_test_split`是`scikit-learn`库中用于拆分数据集的函数，应该从 ...

from sklearn import linear_model
from sklearn.model_selection import train_test_split
train = data_train.filter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Sex_.*|Embarked_.*|Pclass_.*')#regex用于指定包含多个条件的正则表达式,'|'的“或”操作符
X = train.drop(columns='Survived')
y = train['Survived']
#简单交叉验证切分训练集和验证集
X_train,X_valid,y_train,y_valid = train_test_split(X,y,train_size=0.8,shuffle=True)
model = linear_model.LogisticRegression(C=1,penalty='l2',tol=1e-6)#C为正则化参数
model.fit(X_train,y_train)
pre_valid = model.predict(X_valid)
pd.concat([X_valid,y_valid,pre_valid],axis=1)

复制代码

报错如下

---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[41], line 16
14 model.fit(X_train,y_train)
15 pre_valid = model.predict(X_valid)
---> 16 pd.concat([X_valid,y_valid,pre_valid],axis=1)
17 #acc = model.score(X_train,y_train)
18 #print(acc)
File /opt/conda/lib/python3.10/site-packages/pandas/core/reshape/concat.py:382, in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
379 elif copy and using_copy_on_write():
380 copy = False
--> 382 op = _Concatenator(
383 objs,
384 axis=axis,
385 ignore_index=ignore_index,
386 join=join,
387 keys=keys,
388 levels=levels,
389 names=names,
390 verify_integrity=verify_integrity,
391 copy=copy,
392 sort=sort,
393 )
395 return op.get_result()
File /opt/conda/lib/python3.10/site-packages/pandas/core/reshape/concat.py:448, in _Concatenator.__init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
445 objs, keys = self._clean_keys_and_objs(objs, keys)
447 # figure out what our result ndim is going to be
--> 448 ndims = self._get_ndims(objs)
449 sample, objs = self._get_sample_object(objs, ndims, keys, names, levels)
451 # Standardize axis parameter to int
File /opt/conda/lib/python3.10/site-packages/pandas/core/reshape/concat.py:489, in _Concatenator._get_ndims(self, objs)
484 if not isinstance(obj, (ABCSeries, ABCDataFrame)):
485 msg = (
486 f"cannot concatenate object of type '{type(obj)}'; "
487 "only Series and DataFrame objs are valid"
488 )
--> 489 raise TypeError(msg)
491 ndims.add(obj.ndim)
492 return ndims
TypeError: cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid

复制代码

FishC · 发表于 2024-10-19 11:36:54

Terence888 发表于 2024-10-19 11:28
from sklearn import linear_model
from sklearn.mod ...

在这段代码中，`pd.concat()`函数的调用中缺少参数，所以会引发`TypeError`。`pd.concat()`需要一个可迭代的对象，比如一个列表或元组，包含要连接的`DataFrame`或`Series`。

要解决这个问题，需要提供需要连接的数据对象。在你的代码中，通常想要查看模型预测结果和实际标签的对比情况。可以使用下面的方法来修正：

import pandas as pd
# 通过模型在X_valid上进行预测
pre_valid = model.predict(X_valid)
# 合并预测值与实际标签，便于查看预测效果
result_df = pd.DataFrame({"Actual": y_valid.to_numpy().flatten(), "Predicted": pre_valid.flatten()})
# 显示合并后的DataFrame
print(result_df)

复制代码

在这个示例中：

1. **创建一个DataFrame**：通过把`y_valid`和`pre_valid`加入字典中，其中`Actual`列表示真实标签，`Predicted`列表示模型预测的标签。

2. **使用`print(result_df)`**：输出查看数据结构，便于验证模型性能。

这样可以更有效地比较预测和实际的差异。请确保`y_valid`和`pre_valid`维度合适，可以正常比较。

以上回复来自 FishC AI 小助理，如未能正确解答您的问题，请继续追问。

Terence888 · 发表于 2024-10-19 12:02:20

FishC 发表于 2024-10-19 11:36
在这段代码中，`pd.concat()`函数的调用中缺少参数，所以会引发`TypeError`。`pd.concat()`需要一个可 ...

#Find relevant features
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV
# The "accuracy" scoring is proportional to the number of correct classifications
rfecv = RFECV(estimator=model, step=1, cv=StratifiedKFold(2), scoring='accuracy')
rfecv.fit(X_train, y_train)
print("Optimal number of features : %d" % rfecv.n_features_)
# Plot number of features VS. cross-validation scores
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(rfecv.cv_results_) + 1), rfecv.cv_results_)
plt.show()

复制代码

使用cv分数对训练集结果进行评估，报错如下

Optimal number of features : 8
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[44], line 30
28 plt.xlabel("Number of features selected")
29 plt.ylabel("Cross validation score (nb of correct classifications)")
---> 30 plt.plot(range(1, len(rfecv.cv_results_) + 1), rfecv.cv_results_)
31 plt.show()
File /opt/conda/lib/python3.10/site-packages/matplotlib/pyplot.py:2812, in plot(scalex, scaley, data, *args, **kwargs)
2810 @_copy_docstring_and_deprecators(Axes.plot)
2811 def plot(*args, scalex=True, scaley=True, data=None, **kwargs):
-> 2812 return gca().plot(
2813 *args, scalex=scalex, scaley=scaley,
2814 **({"data": data} if data is not None else {}), **kwargs)
File /opt/conda/lib/python3.10/site-packages/matplotlib/axes/_axes.py:1688, in Axes.plot(self, scalex, scaley, data, *args, **kwargs)
1445 """
1446 Plot y versus x as lines and/or markers.
1447
(...)
1685 (``'green'``) or hex strings (``'#008000'``).
1686 """
1687 kwargs = cbook.normalize_kwargs(kwargs, mlines.Line2D)
-> 1688 lines = [*self._get_lines(*args, data=data, **kwargs)]
1689 for line in lines:
1690 self.add_line(line)
File /opt/conda/lib/python3.10/site-packages/matplotlib/axes/_base.py:311, in _process_plot_var_args.__call__(self, data, *args, **kwargs)
309 this += args[0],
310 args = args[1:]
--> 311 yield from self._plot_args(
312 this, kwargs, ambiguous_fmt_datakey=ambiguous_fmt_datakey)
File /opt/conda/lib/python3.10/site-packages/matplotlib/axes/_base.py:501, in _process_plot_var_args._plot_args(self, tup, kwargs, return_kwargs, ambiguous_fmt_datakey)
499 self.axes.xaxis.update_units(x)
500 if self.axes.yaxis is not None:
--> 501 self.axes.yaxis.update_units(y)
503 if x.shape[0] != y.shape[0]:
504 raise ValueError(f"x and y must have same first dimension, but "
505 f"have shapes {x.shape} and {y.shape}")
File /opt/conda/lib/python3.10/site-packages/matplotlib/axis.py:1713, in Axis.update_units(self, data)
1711 neednew = self.converter != converter
1712 self.converter = converter
-> 1713 default = self.converter.default_units(data, self)
1714 if default is not None and self.units is None:
1715 self.set_units(default)
File /opt/conda/lib/python3.10/site-packages/matplotlib/category.py:105, in StrCategoryConverter.default_units(data, axis)
103 # the conversion call stack is default_units -> axis_info -> convert
104 if axis.units is None:
--> 105 axis.set_units(UnitData(data))
106 else:
107 axis.units.update(data)
File /opt/conda/lib/python3.10/site-packages/matplotlib/category.py:181, in UnitData.__init__(self, data)
179 self._counter = itertools.count()
180 if data is not None:
--> 181 self.update(data)
File /opt/conda/lib/python3.10/site-packages/matplotlib/category.py:214, in UnitData.update(self, data)
212 # check if convertible to number:
213 convertible = True
--> 214 for val in OrderedDict.fromkeys(data):
215 # OrderedDict just iterates over unique values in data.
216 _api.check_isinstance((str, bytes), value=val)
217 if convertible:
218 # this will only be called so long as convertible is True.
TypeError: unhashable type: 'dict'

复制代码

FishC · 发表于 2024-10-19 12:02:28

Terence888 发表于 2024-10-19 12:02
#Find relevant features
from sklearn.model_select ...

在你的代码中，`rfecv.cv_results_`返回的是一个字典，而不是一个直接可用于`plt.plot()`的结构。因此，你需要从这个字典中提取你想要绘制的具体数值。

`RFECV`的交叉验证分数通常保存在`grid_scores_`属性中，而不是`cv_results_`。所以你可以修改你的代码为：

import matplotlib.pyplot as plt
# Plot number of features VS. cross-validation scores
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (accuracy)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()

复制代码

### 说明：

- `rfecv.grid_scores_`：这是一个包含每个特征个数下交叉验证得分的数组，你可以直接将其作为`y`值传入`plt.plot()`绘图。
- `range(1, len(rfecv.grid_scores_) + 1)`：生成一个数字范围，用作x轴。

这样，`plt.plot()`将会显示不同特征数量下模型的交叉验证得分。确保在执行上述代码之前导入了`matplotlib.pyplot`。

以上回复来自 FishC AI 小助理，如未能正确解答您的问题，请继续追问。

Terence888 · 发表于 2024-10-19 14:31:03

FishC 发表于 2024-10-19 12:02
在你的代码中，`rfecv.cv_results_`返回的是一个字典，而不是一个直接可用于`plt.plot()`的结构。因此 ...

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_tr_arr, y_tr_arr)
C:\Users\saish\Anaconda2\envs\tensorflow\lib\site-packages\sklearn\utils\validation.py:547: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
print (clf.intercept_, clf.coef_)
[-0.32958987] [[-0.65118738 1.21001434 -1.38924001 -1.46364162]]
pred = clf.predict(X_ts_arr)
print ('Accuracy from sk-learn: {0}'.format(clf.score(X_ts_arr, y_ts_arr)))
Accuracy from sk-learn: 1.0

复制代码

score函数如何使用

FishC · 发表于 2024-10-19 14:31:12

Terence888 发表于 2024-10-19 14:31
from sklearn.linear_model import LogisticRegressio ...

在`scikit-learn`中，`LogisticRegression`的`score`方法用于评估模型在给定数据上的准确性。其使用方式如下：

accuracy = clf.score(X_test, y_test)

复制代码

### 参数说明：
- `X_test`：待评估的数据集的特征部分。
- `y_test`：待评估的数据集的真实标签。

### 返回值：
- `score`函数返回模型在`X_test`上的准确度，值介于0和1之间。1表示完全正确地分类了测试数据，0表示没有正确分类。

### 错误处理（关于DataConversionWarning）：
你的代码中也出现了`DataConversionWarning`，提示`y`的形状需要调整。通常发生在`y`传入的是二维数组，而`LogisticRegression`期望一维数组。可以使用`.ravel()`方法进行转换：

y_tr_arr = y_tr_arr.ravel() # 将y_tr_arr转换为一维数组

复制代码

通过正确调整标签的形状，同时确认特征和标签匹配，可以确保模型正常使用并返回准确的分数。

以上回复来自 FishC AI 小助理，如未能正确解答您的问题，请继续追问。

账号		自动登录	找回密码
密码			立即注册

使用sklearn对数据特征进行归一化代码

马上注册，结交更多好友，享用更多功能^_^

浏览过的版块