使用sklearn对数据特征进行归一化代码,Python交流,编程语言专区,鱼C论坛

Terence888 发表于 2024-10-18 20:58:38

使用sklearn对数据特征进行归一化代码

import sklearn.preprocessing as preprocessing
scaler = preprocessing.StandardScaler()
age_scale_param = scaler.fit(df['Age'])
df['Age_scaled'] = scaler.fit_transform(df['Age'], age_scale_param)
fare_scale_param = scaler.fit(df['Fare'])
df['Fare_scaled'] = scaler.fit_transform(df['Fare'], fare_scale_param)
df

其中，age_scale_param = scaler.fit(df['Age'])和scaler.fit_transform(df['Age'], age_scale_param)表示什么意思

FishC 发表于 2024-10-18 20:58:46

在您的代码中，似乎混淆了一些概念，导致了一些不正确的用法。我来解释一下正确的使用方式。

在使用`sklearn.preprocessing`进行数据特征归一化时，通常采用以下步骤：

1. **创建标准化对象**：`scaler = preprocessing.StandardScaler()`

2. **拟合数据并转换数据**：通常使用`fit_transform`方法一步完成，或者分两步，也就是先`fit`再`transform`。具体解释如下：

- `age_scale_param = scaler.fit(df)`: 这一步是用来计算数据的均值和标准差等统计量（这一行不需要将结果赋值给`age_scale_param`，因为`fit`方法本身并没有返回值，它仅仅是在内部调整scaler对象的状态）。

- `df_scaled = scaler.fit_transform(df)`: 这一步将原始数据根据前面计算的均值和标准差进行归一化。这一步相当于将`fit`和`transform`结合在一次调用中实现。

您的代码里，`scaler.fit_transform(df, age_scale_param)`和`fare_scale_param = scaler.fit(df)`重复了不必要的操作，而且语法上不合适。正确的方法是直接对数据调用`fit_transform`，结果如下：

import sklearn.preprocessing as preprocessing

# 创建标准化对象
scaler = preprocessing.StandardScaler()

# 对数据进行标准化
df_scaled = scaler.fit_transform(df)

# 查看归一化后的数据
print(df_scaled)

这样就完成了对`df`的标准化处理，`df_scaled`就是归一化后的数据。每个特征都会被平移至均值为0，尺度为单位方差。

以上回复来自 FishC AI 小助理，如未能正确解答您的问题，请继续追问。

Terence888 发表于 2024-10-18 21:03:15

FishC 发表于 2024-10-18 20:58
在您的代码中，似乎混淆了一些概念，导致了一些不正确的用法。我来解释一下正确的使用方式。

在使用`skl ...

data_train['Age_scaled'] = scaler.fit_transform(data_train['Age'])
上述代码报错如下---------------------------------------------------------------------------
ValueError                            Traceback (most recent call last)
Cell In, line 6
   3 import sklearn.preprocessing as preprocessing
   4 scaler = preprocessing.StandardScaler()
----> 6 data_train['Age_scaled'] = scaler.fit_transform(data_train['Age'])

File /opt/conda/lib/python3.10/site-packages/sklearn/utils/_set_output.py:140, in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)
138 @wraps(f)
139 def wrapped(self, X, *args, **kwargs):
--> 140 data_to_wrap = f(self, X, *args, **kwargs)
141 if isinstance(data_to_wrap, tuple):
142       # only wrap the first output for cross decomposition
143       return (
144          _wrap_data_with_container(method, data_to_wrap, X, self),
145          *data_to_wrap,
146       )

File /opt/conda/lib/python3.10/site-packages/sklearn/base.py:878, in TransformerMixin.fit_transform(self, X, y, **fit_params)
874 # non-optimized default implementation; override when a better
875 # method is possible for a given clustering algorithm
876 if y is None:
877 # fit method of arity 1 (unsupervised transformation)
--> 878 return self.fit(X, **fit_params).transform(X)
879 else:
880 # fit method of arity 2 (supervised transformation)
881 return self.fit(X, y, **fit_params).transform(X)

File /opt/conda/lib/python3.10/site-packages/sklearn/preprocessing/_data.py:824, in StandardScaler.fit(self, X, y, sample_weight)
822 # Reset internal state before fitting
823 self._reset()
--> 824 return self.partial_fit(X, y, sample_weight)

File /opt/conda/lib/python3.10/site-packages/sklearn/preprocessing/_data.py:861, in StandardScaler.partial_fit(self, X, y, sample_weight)
858 self._validate_params()
860 first_call = not hasattr(self, "n_samples_seen_")
--> 861 X = self._validate_data(
862 X,
863 accept_sparse=("csr", "csc"),
864 dtype=FLOAT_DTYPES,
865 force_all_finite="allow-nan",
866 reset=first_call,
867 )
868 n_features = X.shape
870 if sample_weight is not None:

File /opt/conda/lib/python3.10/site-packages/sklearn/base.py:565, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
563 raise ValueError("Validation should be done on X, y or both.")
564 elif not no_val_X and no_val_y:
--> 565 X = check_array(X, input_name="X", **check_params)
566 out = X
567 elif no_val_X and not no_val_y:

File /opt/conda/lib/python3.10/site-packages/sklearn/utils/validation.py:902, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
900 # If input is 1D raise error
901 if array.ndim == 1:
--> 902       raise ValueError(
903          "Expected 2D array, got 1D array instead:\narray={}.\n"
904          "Reshape your data either using array.reshape(-1, 1) if "
905          "your data has a single feature or array.reshape(1, -1) "
906          "if it contains a single sample.".format(array)
907       )
909 if dtype_numeric and array.dtype.kind in "USV":
910 raise ValueError(
911       "dtype='numeric' is not compatible with arrays of bytes/strings."
912       "Convert your data to numeric values explicitly instead."
913 )

ValueError: Expected 2D array, got 1D array instead:
array=[22. 38. 26. 35. 35. 54. 2. 27. 14. 4. 58. 20.
39. 14. 55. 2. 31. 35. 34. 15. 28. 8. 38. 19.
40. 66. 28. 42. 21. 18. 14. 40. 27. 3. 19. 18.
7. 21. 49. 29. 65. 21. 28.5 5. 11. 22. 45. 4.
29. 19. 17. 26. 32. 16. 21. 26. 32. 25. 0.83 30.
22. 29. 28. 17. 33. 16. 23. 24. 29. 20. 46. 26.
59. 71. 23. 34. 34. 28. 21. 33. 37. 28. 21. 38.
47. 14.522. 20. 17. 21. 70.529. 24. 2. 21. 32.5
32.554. 12. 24. 45. 33. 20. 47. 29. 25. 23. 19.
37. 16. 24. 22. 24. 19. 18. 19. 27. 9. 36.542.
51. 22. 55.540.551. 16. 30. 44. 40. 26. 17. 1.
9. 45. 28. 61. 4. 1. 21. 56. 18. 50. 30. 36.
9. 1. 4. 45. 40. 36. 32. 19. 19. 3. 44. 58.
42. 24. 28. 34. 45.518. 2. 32. 26. 16. 40. 24.
35. 22. 30. 31. 27. 42. 32. 30. 16. 27. 51. 38.
22. 19. 20.518. 35. 29. 59. 5. 24. 44. 8. 19.
33. 29. 22. 30. 44. 25. 24. 37. 54. 29. 62. 30.
41. 29. 30. 35. 50. 3. 52. 40. 36. 16. 25. 58.
35. 25. 41. 37. 63. 45. 7. 35. 65. 28. 16. 19.
33. 30. 22. 42. 22. 26. 19. 36. 24. 24. 23.5 2.
50. 19. 0.92 17. 30. 30. 24. 18. 26. 28. 43. 26.
24. 54. 31. 40. 22. 27. 30. 22. 36. 61. 36. 31.
16. 45.538. 16. 29. 41. 45. 45. 2. 24. 28. 25.
36. 24. 40. 3. 42. 23. 15. 25. 28. 22. 38. 40.
29. 45. 35. 30. 60. 24. 25. 18. 19. 22. 3. 22.
27. 20. 19. 42. 1. 32. 35. 18. 1. 36. 17. 36.
21. 28. 23. 24. 22. 31. 46. 23. 28. 39. 26. 21.
28. 20. 34. 51. 3. 21. 33. 44. 34. 18. 30. 10.
21. 29. 28. 18. 28. 19. 32. 28. 42. 17. 50. 14.
21. 24. 64. 31. 45. 20. 25. 28. 4. 13. 34. 5.
52. 36. 30. 49. 29. 65. 50. 48. 34. 47. 48. 38.
56. 0.75 38. 33. 23. 22. 34. 29. 22. 2. 9. 50.
63. 25. 35. 58. 30. 9. 21. 55. 71. 21. 54. 25.
24. 17. 21. 37. 16. 18. 33. 28. 26. 29. 36. 54.
24. 47. 34. 36. 32. 30. 22. 44. 40.550. 39. 23.
2. 17. 30. 7. 45. 30. 22. 36. 9. 11. 32. 50.
64. 19. 33. 8. 17. 27. 22. 22. 62. 48. 39. 36.
40. 28. 24. 19. 29. 32. 62. 53. 36. 16. 19. 34.
39. 32. 25. 39. 54. 36. 18. 47. 60. 22. 35. 52.
47. 37. 36. 49. 49. 24. 44. 35. 36. 30. 27. 22.
40. 39. 35. 24. 34. 26. 4. 26. 27. 42. 20. 21.
21. 61. 57. 21. 26. 80. 51. 32. 9. 28. 32. 31.
41. 20. 24. 2. 0.75 48. 19. 56. 23. 18. 21. 18.
24. 32. 23. 58. 50. 40. 47. 36. 20. 32. 25. 43.
40. 31. 70. 31. 18. 24.518. 43. 36. 27. 20. 14.
60. 25. 14. 19. 18. 15. 31. 4. 25. 60. 52. 44.
49. 42. 18. 35. 18. 25. 26. 39. 45. 42. 22. 24.
48. 29. 52. 19. 38. 27. 33. 6. 17. 34. 50. 27.
20. 30. 25. 25. 29. 11. 23. 23. 28.548. 35. 36.
21. 24. 31. 70. 16. 30. 19. 31. 4. 6. 33. 23.
48. 0.67 28. 18. 34. 33. 41. 20. 36. 16. 51. 30.5
32. 24. 48. 57. 54. 18. 5. 43. 13. 17. 29. 25.
25. 18. 8. 1. 46. 16. 25. 39. 49. 31. 30. 30.
34. 31. 11. 0.42 27. 31. 39. 18. 39. 33. 26. 39.
35. 6. 30.523. 31. 43. 10. 52. 27. 38. 27. 2.
1. 15. 0.83 23. 18. 39. 21. 32. 20. 16. 30. 34.5
17. 42. 35. 28. 4. 74. 9. 16. 44. 18. 45. 51.
24. 41. 21. 48. 24. 42. 27. 31. 4. 26. 47. 33.
47. 28. 15. 20. 19. 56. 25. 33. 22. 28. 25. 39.
27. 19. 26. 32.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

FishC 发表于 2024-10-18 21:03:21

Terence888 发表于 2024-10-18 21:03
data_train = scaler.fit_transform(data_train)
上 ...

这个错误提示是因为传递给`StandardScaler`的数据是一个一维数组，而`StandardScaler`期望接收到一个二维数组（通常是形如`(样本数, 特征数)`的结构）。

要解决这个问题，你需要确保传入的数据是二维的。如果你的数据本身是一维的（例如只有一个特征），你需要对数据进行调整。比如：

import numpy as np
import sklearn.preprocessing as preprocessing

scaler = preprocessing.StandardScaler()

# 假设 data_train 是一维的，可以这样转换为二维
data_train = np.array(data_train).reshape(-1, 1)

# 进行归一化
data_train_scaled = scaler.fit_transform(data_train)

print(data_train_scaled)

通过使用`reshape(-1, 1)`，把一维数据转换成二维的`n x 1`格式，这样就可以使用`StandardScaler`进行操作了。

以上回复来自 FishC AI 小助理，如未能正确解答您的问题，请继续追问。

Terence888 发表于 2024-10-19 10:25:41

FishC 发表于 2024-10-19 10:28:59

Terence888 发表于 2024-10-19 10:25
train_df = df.filter(regex='Survived|Age_.*|SibSp| ...

关于你的问题，我来一一解答：

1. **正则表达式中的`regex`参数和`'|'`符号**：

在`df.filter(regex=...)`中，`regex`参数用于指定一个正则表达式，以匹配你想要选择的数据框列。例如：

- `'Survived|Age_.*|SibSp'`：这是一个包含多个条件的正则表达式。
- `'|'` 是正则表达式中的“或”操作符，表示匹配它前面或后面的正则表达式。

在这个例子中：

- `Survived`：匹配名为`Survived`的列。
- `Age_.*`：匹配以`Age_`开头的所有列（`.`表示任意字符，`*`表示任意次数）。
- `SibSp`：匹配名为`SibSp`的列。

因此，传递给`regex`的字符串即是要用来选择的列名模式，通过使用正则表达式，可以灵活地选择符合任意规则的列。

2. **`train_df.as_matrix()`的作用**：

`train_df.as_matrix()`是在早期版本的Pandas中使用的方法，用于将DataFrame转换为NumPy数组。在新版的Pandas（从0.23.0开始），建议使用`.values`属性，或者`.to_numpy()`方法来替代，因为`.as_matrix()`已被废弃。例如：

train_np = train_df.to_numpy()

或

train_np = train_df.values

进行这个操作的原因是，有时候模型训练需要的是NumPy数组而不是Pandas DataFrame，因此需要进行这种类型的转换，以便于将数据输入到如scikit-learn等机器学习库中。这样可以对数据执行更快速的线性代数运算和操作。

以上回复来自 FishC AI 小助理，如未能正确解答您的问题，请继续追问。

Terence888 发表于 2024-10-19 10:43:39

FishC 发表于 2024-10-19 10:28
关于你的问题，我来一一解答：

1. **正则表达式中的`regex`参数和`'|'`符号**：

---------------------------------------------------------------------------
AttributeError                         Traceback (most recent call last)
/tmp/ipykernel_33/2196286347.py in ?()
   1 #使用逻辑回归训练模型
   2 from sklearn import linear_model
----> 3 train = data_train.fiter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Sex_.*|Embarked_.*|Pclass_.*')#regex用于指定包含多个条件的正则表达式,'|'的“或”操作符
   4 LR_train = train.to_numpy()
   5 X_train = LR_train[:,1:]
   6 y_ttain = LR_train[:,0]

/opt/conda/lib/python3.10/site-packages/pandas/core/generic.py in ?(self, name)
6295          and name not in self._accessors
6296          and self._info_axis._can_hold_identifiers_and_holds_name(name)
6297       ):
6298          return self
-> 6299       return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'fiter'
报错如上
代码如下
train = data_train.fiter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Sex_.*|Embarked_.*|Pclass_.*')

FishC 发表于 2024-10-19 10:43:44

Terence888 发表于 2024-10-19 10:43
-------------------------------------------------- ...

报错提示指出，`DataFrame`对象没有名为`fiter`的方法。这个问题是因为拼写错误，正确的方法名称应该是`filter`。

请把代码改为：

train = data_train.filter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Sex_.*|Embarked_.*|Pclass_.*')

这个`filter`方法可以用来筛选满足正则表达式模式的列，你的正则表达式会选择名字中包含匹配模式的列。修正拼写错误后，代码应该可以正常运行。

以上回复来自 FishC AI 小助理，如未能正确解答您的问题，请继续追问。

Terence888 发表于 2024-10-19 10:45:45

FishC 发表于 2024-10-19 10:43
报错提示指出，`DataFrame`对象没有名为`fiter`的方法。这个问题是因为拼写错误，正确的方法名称应该是 ...

#使用逻辑回归训练模型
from sklearn import linear_model
train = data_train.filter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Sex_.*|Embarked_.*|Pclass_.*')#regex用于指定包含多个条件的正则表达式,'|'的“或”操作符
LR_train = train.to_numpy()
X_train = LR_train[:,1:]
y_train = LR_train[:,0]

#使用k折交叉验证确定正则化项C参数的值
model = linear_model.LogisticRegression(C=1,penalty='l2',tol=1e-6)#C为正则化
model.fit(X_train,y_train)
model

上述代码报错
---------------------------------------------------------------------------
ValueError                            Traceback (most recent call last)
Cell In, line 10
   8 #使用k折交叉验证确定正则化项C参数的值
   9 model = linear_model.LogisticRegression(C=1,penalty='l2',tol=1e-6)#C为正则化
---> 10 model.fit(X_train,y_train)
11 model

File /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1204, in LogisticRegression.fit(self, X, y, sample_weight)
1194 _dtype =
1196 X, y = self._validate_data(
1197 X,
1198 y,
(...)
1202 accept_large_sparse=solver not in ["liblinear", "sag", "saga"],
1203 )
-> 1204 check_classification_targets(y)
1205 self.classes_ = np.unique(y)
1207 multi_class = _check_multi_class(self.multi_class, solver, len(self.classes_))

File /opt/conda/lib/python3.10/site-packages/sklearn/utils/multiclass.py:218, in check_classification_targets(y)
210 y_type = type_of_target(y, input_name="y")
211 if y_type not in [
212 "binary",
213 "multiclass",
(...)
216 "multilabel-sequences",
217 ]:
--> 218 raise ValueError("Unknown label type: %r" % y_type)

ValueError: Unknown label type: 'unknown'

FishC 发表于 2024-10-19 10:52:04

Terence888 发表于 2024-10-19 10:45
#使用逻辑回归训练模型
from sklearn import linear_ ...

这个错误是由于`y_train`的格式不正确导致的。逻辑回归期望`y_train`是目标变量的一个合适的标签数组。通常这是一个一维数组，包含数据集的标签（目标值），而当前代码中`y_train`是整个数据矩阵。

在逻辑回归模型中，`X_train`应该是特征数据，而`y_train`应该是相应的类标签。在你的代码中，`X_train`和`y_train`是相同的，这显然不符合逻辑回归模型的要求。

### 矫正步骤：

1. **分离特征和标签**：

- `X_train` 应该只包含特征。
- `y_train` 应该只包含目标变量，即`Survived`列。

假设目标列是`Survived`，其余部分是特征，那么可以这样构建：

# 从 DataFrame 分离目标变量
X_train = train.drop(columns=['Survived']).to_numpy()
y_train = train['Survived'].to_numpy()

# 训练模型
model = linear_model.LogisticRegression(C=1, penalty='l2', tol=1e-6)
model.fit(X_train, y_train)

### 注意：

- 确保`Survived`列存在于你的数据集中，同时确认数据集中没有多余的列。
- 逻辑回归要求`y_train`为分类标签，通常在二分类问题中表现为数组例如``形式。当非二值分类时，需要进一步进行标签处理。

通过确保以上条件正确，可以修正错误并使模型正常训练。

以上回复来自 FishC AI 小助理，如未能正确解答您的问题，请继续追问。

Terence888 发表于 2024-10-19 10:58:45

FishC 发表于 2024-10-19 10:52
这个错误是由于`y_train`的格式不正确导致的。逻辑回归期望`y_train`是目标变量的一个合适的标签数组。 ...

#使用逻辑回归训练模型
from sklearn import linear_model
from model.collection import train_test_split
train = data_train.filter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Sex_.*|Embarked_.*|Pclass_.*')#regex用于指定包含多个条件的正则表达式,'|'的“或”操作符
X_train = train[:,1:].to_numpy()
y_train = train[:,0].to_numpy()

#简单交叉验证切分训练集和验证集

model = linear_model.LogisticRegression(C=1,penalty='l2',tol=1e-6)#C为正则化参数
model.fit(X_train,y_train)
model

出现报错---------------------------------------------------------------------------
ModuleNotFoundError                   Traceback (most recent call last)
Cell In, line 3
   1 #使用逻辑回归训练模型
   2 from sklearn import linear_model
----> 3 from model.collection import train_test_split
   4 train = data_train.filter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Sex_.*|Embarked_.*|Pclass_.*')#regex用于指定包含多个条件的正则表达式,'|'的“或”操作符
   5 X_train = train[:,1:].to_numpy()

ModuleNotFoundError: No module named 'model'

FishC 发表于 2024-10-19 10:58:56

Terence888 发表于 2024-10-19 10:58
#使用逻辑回归训练模型
from sklearn import linear_ ...

这个错误是因为错误导入了模块。`train_test_split`是`scikit-learn`库中用于拆分数据集的函数，应该从`sklearn.model_selection`模块导入，而不是`model.collection`。请将导入语句修改为：

from sklearn.model_selection import train_test_split

此外，还有关于`X_train`和`y_train`的构建，如果`train`数据集中目标变量是`Survived`，需要将这个目标变量从特征中分离出来。假设目标变量是`Survived`：

# 使用逻辑回归训练模型
from sklearn import linear_model
from sklearn.model_selection import train_test_split

train = data_train.filter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Sex_.*|Embarked_.*|Pclass_.*')

# 分离特征和目标变量
X_train = train.drop(columns=['Survived']).to_numpy()
y_train = train['Survived'].to_numpy()

# 简单交叉验证切分训练集和验证集
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

model = linear_model.LogisticRegression(C=1, penalty='l2', tol=1e-6) # C为正则化参数
model.fit(X_train, y_train)

print(model)

上述代码演示了如何正确地导入函数，分割数据集并训练逻辑回归模型。请确保`train`数据集中包含`Survived`列，并相应地设置特征和目标。

以上回复来自 FishC AI 小助理，如未能正确解答您的问题，请继续追问。

Terence888 发表于 2024-10-19 11:28:02

FishC 发表于 2024-10-19 10:58
这个错误是因为错误导入了模块。`train_test_split`是`scikit-learn`库中用于拆分数据集的函数，应该从 ...

from sklearn import linear_model
from sklearn.model_selection import train_test_split
train = data_train.filter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Sex_.*|Embarked_.*|Pclass_.*')#regex用于指定包含多个条件的正则表达式,'|'的“或”操作符
X = train.drop(columns='Survived')
y = train['Survived']

#简单交叉验证切分训练集和验证集
X_train,X_valid,y_train,y_valid = train_test_split(X,y,train_size=0.8,shuffle=True)

model = linear_model.LogisticRegression(C=1,penalty='l2',tol=1e-6)#C为正则化参数
model.fit(X_train,y_train)
pre_valid = model.predict(X_valid)
pd.concat(,axis=1)

报错如下
---------------------------------------------------------------------------
TypeError                               Traceback (most recent call last)
Cell In, line 16
14 model.fit(X_train,y_train)
15 pre_valid = model.predict(X_valid)
---> 16 pd.concat(,axis=1)
17 #acc = model.score(X_train,y_train)
18 #print(acc)

File /opt/conda/lib/python3.10/site-packages/pandas/core/reshape/concat.py:382, in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
379 elif copy and using_copy_on_write():
380 copy = False
--> 382 op = _Concatenator(
383 objs,
384 axis=axis,
385 ignore_index=ignore_index,
386 join=join,
387 keys=keys,
388 levels=levels,
389 names=names,
390 verify_integrity=verify_integrity,
391 copy=copy,
392 sort=sort,
393 )
395 return op.get_result()

File /opt/conda/lib/python3.10/site-packages/pandas/core/reshape/concat.py:448, in _Concatenator.__init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
445 objs, keys = self._clean_keys_and_objs(objs, keys)
447 # figure out what our result ndim is going to be
--> 448 ndims = self._get_ndims(objs)
449 sample, objs = self._get_sample_object(objs, ndims, keys, names, levels)
451 # Standardize axis parameter to int

File /opt/conda/lib/python3.10/site-packages/pandas/core/reshape/concat.py:489, in _Concatenator._get_ndims(self, objs)
484 if not isinstance(obj, (ABCSeries, ABCDataFrame)):
485       msg = (
486          f"cannot concatenate object of type '{type(obj)}'; "
487          "only Series and DataFrame objs are valid"
488       )
--> 489       raise TypeError(msg)
491 ndims.add(obj.ndim)
492 return ndims

TypeError: cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid

FishC 发表于 2024-10-19 11:36:54

Terence888 发表于 2024-10-19 11:28
from sklearn import linear_model
from sklearn.mod ...

在这段代码中，`pd.concat()`函数的调用中缺少参数，所以会引发`TypeError`。`pd.concat()`需要一个可迭代的对象，比如一个列表或元组，包含要连接的`DataFrame`或`Series`。

要解决这个问题，需要提供需要连接的数据对象。在你的代码中，通常想要查看模型预测结果和实际标签的对比情况。可以使用下面的方法来修正：

import pandas as pd

# 通过模型在X_valid上进行预测
pre_valid = model.predict(X_valid)

# 合并预测值与实际标签，便于查看预测效果
result_df = pd.DataFrame({"Actual": y_valid.to_numpy().flatten(), "Predicted": pre_valid.flatten()})

# 显示合并后的DataFrame
print(result_df)

在这个示例中：

1. **创建一个DataFrame**：通过把`y_valid`和`pre_valid`加入字典中，其中`Actual`列表示真实标签，`Predicted`列表示模型预测的标签。

2. **使用`print(result_df)`**：输出查看数据结构，便于验证模型性能。

这样可以更有效地比较预测和实际的差异。请确保`y_valid`和`pre_valid`维度合适，可以正常比较。

以上回复来自 FishC AI 小助理，如未能正确解答您的问题，请继续追问。

Terence888 发表于 2024-10-19 12:02:20

FishC 发表于 2024-10-19 11:36
在这段代码中，`pd.concat()`函数的调用中缺少参数，所以会引发`TypeError`。`pd.concat()`需要一个可 ...

#Find relevant features
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV

# The "accuracy" scoring is proportional to the number of correct classifications
rfecv = RFECV(estimator=model, step=1, cv=StratifiedKFold(2), scoring='accuracy')
rfecv.fit(X_train, y_train)

print("Optimal number of features : %d" % rfecv.n_features_)

# Plot number of features VS. cross-validation scores
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(rfecv.cv_results_) + 1), rfecv.cv_results_)
plt.show()

使用cv分数对训练集结果进行评估，报错如下
Optimal number of features : 8
---------------------------------------------------------------------------
TypeError                               Traceback (most recent call last)
Cell In, line 30
28 plt.xlabel("Number of features selected")
29 plt.ylabel("Cross validation score (nb of correct classifications)")
---> 30 plt.plot(range(1, len(rfecv.cv_results_) + 1), rfecv.cv_results_)
31 plt.show()

File /opt/conda/lib/python3.10/site-packages/matplotlib/pyplot.py:2812, in plot(scalex, scaley, data, *args, **kwargs)
2810 @_copy_docstring_and_deprecators(Axes.plot)
2811 def plot(*args, scalex=True, scaley=True, data=None, **kwargs):
-> 2812 return gca().plot(
2813       *args, scalex=scalex, scaley=scaley,
2814       **({"data": data} if data is not None else {}), **kwargs)

File /opt/conda/lib/python3.10/site-packages/matplotlib/axes/_axes.py:1688, in Axes.plot(self, scalex, scaley, data, *args, **kwargs)
1445 """
1446 Plot y versus x as lines and/or markers.
1447
(...)
1685 (``'green'``) or hex strings (``'#008000'``).
1686 """
1687 kwargs = cbook.normalize_kwargs(kwargs, mlines.Line2D)
-> 1688 lines = [*self._get_lines(*args, data=data, **kwargs)]
1689 for line in lines:
1690 self.add_line(line)

File /opt/conda/lib/python3.10/site-packages/matplotlib/axes/_base.py:311, in _process_plot_var_args.__call__(self, data, *args, **kwargs)
309 this += args,
310 args = args
--> 311 yield from self._plot_args(
312 this, kwargs, ambiguous_fmt_datakey=ambiguous_fmt_datakey)

File /opt/conda/lib/python3.10/site-packages/matplotlib/axes/_base.py:501, in _process_plot_var_args._plot_args(self, tup, kwargs, return_kwargs, ambiguous_fmt_datakey)
499 self.axes.xaxis.update_units(x)
500 if self.axes.yaxis is not None:
--> 501 self.axes.yaxis.update_units(y)
503 if x.shape != y.shape:
504 raise ValueError(f"x and y must have same first dimension, but "
505                   f"have shapes {x.shape} and {y.shape}")

File /opt/conda/lib/python3.10/site-packages/matplotlib/axis.py:1713, in Axis.update_units(self, data)
1711 neednew = self.converter != converter
1712 self.converter = converter
-> 1713 default = self.converter.default_units(data, self)
1714 if default is not None and self.units is None:
1715 self.set_units(default)

File /opt/conda/lib/python3.10/site-packages/matplotlib/category.py:105, in StrCategoryConverter.default_units(data, axis)
103 # the conversion call stack is default_units -> axis_info -> convert
104 if axis.units is None:
--> 105 axis.set_units(UnitData(data))
106 else:
107 axis.units.update(data)

File /opt/conda/lib/python3.10/site-packages/matplotlib/category.py:181, in UnitData.__init__(self, data)
179 self._counter = itertools.count()
180 if data is not None:
--> 181 self.update(data)

File /opt/conda/lib/python3.10/site-packages/matplotlib/category.py:214, in UnitData.update(self, data)
212 # check if convertible to number:
213 convertible = True
--> 214 for val in OrderedDict.fromkeys(data):
215 # OrderedDict just iterates over unique values in data.
216 _api.check_isinstance((str, bytes), value=val)
217 if convertible:
218       # this will only be called so long as convertible is True.

TypeError: unhashable type: 'dict'

FishC 发表于 2024-10-19 12:02:28

Terence888 发表于 2024-10-19 12:02
#Find relevant features
from sklearn.model_select ...

在你的代码中，`rfecv.cv_results_`返回的是一个字典，而不是一个直接可用于`plt.plot()`的结构。因此，你需要从这个字典中提取你想要绘制的具体数值。

`RFECV`的交叉验证分数通常保存在`grid_scores_`属性中，而不是`cv_results_`。所以你可以修改你的代码为：

import matplotlib.pyplot as plt

# Plot number of features VS. cross-validation scores
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (accuracy)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()

### 说明：

- `rfecv.grid_scores_`：这是一个包含每个特征个数下交叉验证得分的数组，你可以直接将其作为`y`值传入`plt.plot()`绘图。
- `range(1, len(rfecv.grid_scores_) + 1)`：生成一个数字范围，用作x轴。

这样，`plt.plot()`将会显示不同特征数量下模型的交叉验证得分。确保在执行上述代码之前导入了`matplotlib.pyplot`。

以上回复来自 FishC AI 小助理，如未能正确解答您的问题，请继续追问。

Terence888 发表于 2024-10-19 14:31:03

FishC 发表于 2024-10-19 12:02
在你的代码中，`rfecv.cv_results_`返回的是一个字典，而不是一个直接可用于`plt.plot()`的结构。因此 ...

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_tr_arr, y_tr_arr)
C:\Users\saish\Anaconda2\envs\tensorflow\lib\site-packages\sklearn\utils\validation.py:547: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
      intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
      penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
      verbose=0, warm_start=False)
print (clf.intercept_, clf.coef_)
[-0.32958987] [[-0.651187381.21001434 -1.38924001 -1.46364162]]
pred = clf.predict(X_ts_arr)
print ('Accuracy from sk-learn: {0}'.format(clf.score(X_ts_arr, y_ts_arr)))
Accuracy from sk-learn: 1.0
score函数如何使用

FishC 发表于 2024-10-19 14:31:12

Terence888 发表于 2024-10-19 14:31
from sklearn.linear_model import LogisticRegressio ...

在`scikit-learn`中，`LogisticRegression`的`score`方法用于评估模型在给定数据上的准确性。其使用方式如下：

accuracy = clf.score(X_test, y_test)

### 参数说明：
- `X_test`：待评估的数据集的特征部分。
- `y_test`：待评估的数据集的真实标签。

### 返回值：
- `score`函数返回模型在`X_test`上的准确度，值介于0和1之间。1表示完全正确地分类了测试数据，0表示没有正确分类。

### 错误处理（关于DataConversionWarning）：
你的代码中也出现了`DataConversionWarning`，提示`y`的形状需要调整。通常发生在`y`传入的是二维数组，而`LogisticRegression`期望一维数组。可以使用`.ravel()`方法进行转换：

y_tr_arr = y_tr_arr.ravel()# 将y_tr_arr转换为一维数组

通过正确调整标签的形状，同时确认特征和标签匹配，可以确保模型正常使用并返回准确的分数。

以上回复来自 FishC AI 小助理，如未能正确解答您的问题，请继续追问。

页: [1]

鱼C论坛's Archiver

使用sklearn对数据特征进行归一化代码