为什么onehot编码后列数不一样

Terence888 · 发表于 2024-10-30 16:38:44

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

#对合并集的非数值型数据进行onehot编码
#nonnumerical = full_dataset.select_dtypes(include='object').columns
pd.get_dummies(full_dataset,dummy_na=True)
#df_nonnumerical

复制代码

#均值归一化数值型特征
numeric_cols = full_dataset.select_dtypes(np.number).columns # Numerical column name
numeric_cols = numeric_cols.drop(['Id',"SalePrice"], errors='ignore') # Drop label
df_numerical = full_dataset[numeric_cols].apply(lambda x: (x - x.mean()) / (x.std())) # Normalize every numerical column
df_numerical = df_numerical.fillna(0) # Fill in missing value with mean 0
df_numerical.shape

复制代码

full_df = pd.concat([df_nonnumerical,df_numerical],axis=1)
full_df.shape

复制代码

这样出来结果是330列

但是别人的出来结果是331列

# Preprocessing function
def preprocess(df):
df = df.copy() # Make copy so we don't ruin the original dataframe
df = df.drop(columns=["Id"], errors='ignore') # Drop Id column
numeric_cols = df.select_dtypes(np.number).columns # Numerical column name
numeric_cols = numeric_cols.drop("SalePrice", errors='ignore') # Drop label
categorical_cols = df.select_dtypes(include=['object']).columns # Categorical column name
df_categorical = pd.get_dummies(df[categorical_cols], dummy_na=True) # OneHotEncoding, dummy_na create another column to indicate if the cell contain NA
df_numerical = df[numeric_cols].apply(lambda x: (x - x.mean()) / (x.std())) # Normalize every numerical column
df_numerical = df_numerical.fillna(0) # Fill in missing value with mean 0
df = pd.concat([df_numerical, df_categorical], axis=1) # Bind numerical column and categorical column
return df

复制代码

FishC · 发表于 2024-10-30 16:39:00

One-hot编码后列数不一致的原因可能有以下几种：

1. **缺失值处理**：`dummy_na=True`会为每个分类变量增加一个额外的列来表示缺失值。如果数据集中某些列在不同的处理过程中缺失值不同，可能导致列数不一致。

2. **数据集差异**：如果在不同的步骤中对数据集进行了不同的处理（如删除某些列或行），也可能导致最终的列数不同。

3. **编码前的数据差异**：在进行one-hot编码之前，如果对数据进行了不同的预处理（如删除某些类别或合并某些类别），也会影响最终的列数。

4. **代码中的错误**：检查代码中是否有遗漏或错误的地方，比如在`pd.concat()`时没有正确指定要合并的DataFrame。

要解决这个问题，可以逐步检查每个步骤的数据形状，确保在每个步骤中数据处理的一致性。特别是要注意在one-hot编码前后的数据形状变化。

以上回复来自 FishC AI 小助理，如未能正确解答您的问题，请继续追问。

账号		自动登录	找回密码
密码			立即注册

为什么onehot编码后列数不一样

马上注册，结交更多好友，享用更多功能^_^

浏览过的版块