鱼C论坛

 找回密码
 立即注册
查看: 190|回复: 7

bert模型训练时报错

[复制链接]
发表于 2024-11-2 23:24:52 | 显示全部楼层 |阅读模式

马上注册,结交更多好友,享用更多功能^_^

您需要 登录 才可以下载或查看,没有账号?立即注册

x
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[50], line 4
      1 print('Training Start!')
      2 print('=' * 100)
----> 4 train(model,
      5     device,
      6     train_dataloader,
      7     valid_dataloader,
      8     CFG.epochs,
      9     loss_fn,
     10     optimizer,
     11     metric)
     13 del model,train_dataloader, valid_dataloader
     14 gc.collect()

Cell In[49], line 17, in train(model, device, train_dataloader, valid_dataloader, epochs, loss_fn, optimizer, metric)
     14 train_step = 0
     15 pbar = tqdm(train_dataloader)#tqdm参数是一个iterable
---> 17 for batch in pbar: # you can also write like "for batch in tqdm(train_dataloader"
     18     optimizer.zero_grad() # initialize
     19     train_step += 1

File /opt/conda/lib/python3.10/site-packages/tqdm/notebook.py:250, in tqdm_notebook.__iter__(self)
    248 try:
    249     it = super().__iter__()
--> 250     for obj in it:
    251         # return super(tqdm...) will not catch exception
    252         yield obj
    253 # NB: except ... [ as ...] breaks IPython async KeyboardInterrupt

File /opt/conda/lib/python3.10/site-packages/tqdm/std.py:1181, in tqdm.__iter__(self)
   1178 time = self._time
   1180 try:
-> 1181     for obj in iterable:
   1182         yield obj
   1183         # Update and possibly print the progressbar.
   1184         # Note: does not call self.update(1) for speed optimisation.

File /opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py:630, in _BaseDataLoaderIter.__next__(self)
    627 if self._sampler_iter is None:
    628     # TODO(https://github.com/pytorch/pytorch/issues/76750)
    629     self._reset()  # type: ignore[call-arg]
--> 630 data = self._next_data()
    631 self._num_yielded += 1
    632 if self._dataset_kind == _DatasetKind.Iterable and \
    633         self._IterableDataset_len_called is not None and \
    634         self._num_yielded > self._IterableDataset_len_called:

File /opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py:1344, in _MultiProcessingDataLoaderIter._next_data(self)
   1342 else:
   1343     del self._task_info[idx]
-> 1344     return self._process_data(data)

File /opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py:1370, in _MultiProcessingDataLoaderIter._process_data(self, data)
   1368 self._try_put_index()
   1369 if isinstance(data, ExceptionWrapper):
-> 1370     data.reraise()
   1371 return data

File /opt/conda/lib/python3.10/site-packages/torch/_utils.py:706, in ExceptionWrapper.reraise(self)
    702 except TypeError:
    703     # If the exception takes multiple arguments, don't try to
    704     # instantiate since we don't know how to
    705     raise RuntimeError(msg) from None
--> 706 raise exception

AttributeError: Caught AttributeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/tmp/ipykernel_31/3318300818.py", line 15, in __getitem__
    text = self.df.loc[idx]['text'] # extracting text from each row
AttributeError: 'list' object has no attribute 'loc'

报错如上,代码如下
print('Training Start!')
print('=' * 100)

train(model,
    device,
    train_dataloader,
    valid_dataloader,
    CFG.epochs,
    loss_fn,
    optimizer,
    metric)

del model,train_dataloader, valid_dataloader
gc.collect()
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

发表于 2024-11-2 23:25:01 | 显示全部楼层
根据错误信息,问题出在 `__getitem__` 方法中。具体来说,`self.df` 被当作一个 `list` 对象,而不是一个支持 `.loc` 方法的对象(如 `pandas.DataFrame`)。

要解决这个问题,请检查数据集类中 `__getitem__` 方法的实现,确保 `self.df` 是一个 `pandas.DataFrame` 对象。以下是一个可能的修正示例:
class YourDatasetClass:
    def __init__(self, dataframe):
        self.df = dataframe

    def __getitem__(self, idx):
        # 确保 self.df 是一个 DataFrame
        if isinstance(self.df, pd.DataFrame):
            text = self.df.loc[idx, 'text_column_name']  # 替换 'text_column_name' 为实际的列名
            # 其他处理逻辑
            return text
        else:
            raise TypeError("self.df 应该是一个 pandas DataFrame")

请确保在初始化数据集类时传入的是一个 `pandas.DataFrame` 对象,而不是一个 `list`。如果 `self.df` 是从其他地方传入的,检查传入的对象类型是否正确。

以上回复来自 FishC AI 小助理,如未能正确解答您的问题,请继续追问。
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 2024-11-3 10:33:10 | 显示全部楼层
FishC 发表于 2024-11-2 23:25
根据错误信息,问题出在 `__getitem__` 方法中。具体来说,`self.df` 被当作一个 `list` 对象,而不是一个 ...
Training Start!
====================================================================================================
/tmp/ipykernel_31/152112840.py:32: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  'token_type_ids':torch.tensor(encoded_dict['token_type_ids'],dtype=torch.long).unsqueeze(dim=0)}
/tmp/ipykernel_31/152112840.py:32: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  'token_type_ids':torch.tensor(encoded_dict['token_type_ids'],dtype=torch.long).unsqueeze(dim=0)}
/tmp/ipykernel_31/152112840.py:32: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  'token_type_ids':torch.tensor(encoded_dict['token_type_ids'],dtype=torch.long).unsqueeze(dim=0)}
  0%
 0/229 [00:00<?, ?it/s]
/tmp/ipykernel_31/152112840.py:32: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  'token_type_ids':torch.tensor(encoded_dict['token_type_ids'],dtype=torch.long).unsqueeze(dim=0)}
/tmp/ipykernel_31/152112840.py:32: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  'token_type_ids':torch.tensor(encoded_dict['token_type_ids'],dtype=torch.long).unsqueeze(dim=0)}
/tmp/ipykernel_31/152112840.py:32: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  'token_type_ids':torch.tensor(encoded_dict['token_type_ids'],dtype=torch.long).unsqueeze(dim=0)}
/tmp/ipykernel_31/152112840.py:32: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  'token_type_ids':torch.tensor(encoded_dict['token_type_ids'],dtype=torch.long).unsqueeze(dim=0)}
/tmp/ipykernel_31/152112840.py:32: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  'token_type_ids':torch.tensor(encoded_dict['token_type_ids'],dtype=torch.long).unsqueeze(dim=0)}
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[58], line 4
      1 print('Training Start!')
      2 print('=' * 100)
----> 4 train(model,
      5     device,
      6     train_dataloader,
      7     valid_dataloader,
      8     CFG.epochs,
      9     loss_fn,
     10     optimizer,
     11     metric)
     13 del model,train_dataloader, valid_dataloader
     14 gc.collect()

Cell In[39], line 17, in train(model, device, train_dataloader, valid_dataloader, epochs, loss_fn, optimizer, metric)
     14 train_step = 0
     15 pbar = tqdm(train_dataloader)#tqdm参数是一个iterable
---> 17 for batch in pbar: # you can also write like "for batch in tqdm(train_dataloader"
     18     optimizer.zero_grad() # initialize
     19     train_step += 1

File /opt/conda/lib/python3.10/site-packages/tqdm/notebook.py:250, in tqdm_notebook.__iter__(self)
    248 try:
    249     it = super().__iter__()
--> 250     for obj in it:
    251         # return super(tqdm...) will not catch exception
    252         yield obj
    253 # NB: except ... [ as ...] breaks IPython async KeyboardInterrupt

File /opt/conda/lib/python3.10/site-packages/tqdm/std.py:1181, in tqdm.__iter__(self)
   1178 time = self._time
   1180 try:
-> 1181     for obj in iterable:
   1182         yield obj
   1183         # Update and possibly print the progressbar.
   1184         # Note: does not call self.update(1) for speed optimisation.

File /opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py:630, in _BaseDataLoaderIter.__next__(self)
    627 if self._sampler_iter is None:
    628     # TODO(https://github.com/pytorch/pytorch/issues/76750)
    629     self._reset()  # type: ignore[call-arg]
--> 630 data = self._next_data()
    631 self._num_yielded += 1
    632 if self._dataset_kind == _DatasetKind.Iterable and \
    633         self._IterableDataset_len_called is not None and \
    634         self._num_yielded > self._IterableDataset_len_called:

File /opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py:1344, in _MultiProcessingDataLoaderIter._next_data(self)
   1342 else:
   1343     del self._task_info[idx]
-> 1344     return self._process_data(data)

File /opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py:1370, in _MultiProcessingDataLoaderIter._process_data(self, data)
   1368 self._try_put_index()
   1369 if isinstance(data, ExceptionWrapper):
-> 1370     data.reraise()
   1371 return data

File /opt/conda/lib/python3.10/site-packages/torch/_utils.py:706, in ExceptionWrapper.reraise(self)
    702 except TypeError:
    703     # If the exception takes multiple arguments, don't try to
    704     # instantiate since we don't know how to
    705     raise RuntimeError(msg) from None
--> 706 raise exception

KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc
    return self._engine.get_loc(casted_key)
  File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc
  File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 2606, in pandas._libs.hashtable.Int64HashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 2630, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 1287

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/tmp/ipykernel_31/152112840.py", line 15, in __getitem__
    text = self.df.loc[idx]['cleaned'] # extracting text from each row
  File "/opt/conda/lib/python3.10/site-packages/pandas/core/indexing.py", line 1191, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/opt/conda/lib/python3.10/site-packages/pandas/core/indexing.py", line 1431, in _getitem_axis
    return self._get_label(key, axis=axis)
  File "/opt/conda/lib/python3.10/site-packages/pandas/core/indexing.py", line 1381, in _get_label
    return self.obj.xs(label, axis=axis)
  File "/opt/conda/lib/python3.10/site-packages/pandas/core/generic.py", line 4301, in xs
    loc = index.get_loc(key)
  File "/opt/conda/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc
    raise KeyError(key) from err
KeyError: 1287

报错如上
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2024-11-3 10:33:19 | 显示全部楼层
Terence888 发表于 2024-11-03 10:33
Training Start!
================================= ...


根据错误信息,`KeyError: 1287` 表示在尝试访问数据集的某个索引时,索引超出了数据集的范围。这个问题通常是由于数据集的大小与索引不匹配导致的。

以下是一些可能的解决方案:

1. **检查数据集大小**:确保数据集的大小与您期望的一致。可以通过打印数据集的长度来确认:

print(len(self.df))

2. **检查索引范围**:在 `__getitem__` 方法中,确保索引 `idx` 在数据集的范围内。可以在方法中添加一个检查:

def __getitem__(self, idx):
       if idx >= len(self.df):
           raise IndexError("Index out of range")
       # 继续处理

3. **数据集分割**:如果您在分割数据集时出现问题,可能导致某个分割为空或大小不一致。检查数据集分割的代码,确保每个分割都有数据。

4. **数据加载器**:检查 `DataLoader` 的 `batch_size` 和 `shuffle` 参数,确保它们与数据集的大小兼容。

5. **调试信息**:在 `__getitem__` 方法中添加调试信息,帮助定位问题:

def __getitem__(self, idx):
       print(f"Accessing index: {idx}")
       # 继续处理

通过这些步骤,您应该能够更好地定位和解决问题。如果问题仍然存在,请检查数据集的其他部分,确保数据的完整性和一致性。

以上回复来自 FishC AI 小助理,如未能正确解答您的问题,请继续追问。
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 2024-11-3 10:38:44 | 显示全部楼层
FishC 发表于 2024-11-3 10:33
根据错误信息,`KeyError: 1287` 表示在尝试访问数据集的某个索引时,索引超出了数据集的范围。这个问 ...
from torch.utils.data import Dataset
import torch

#定义数据集
class LLMDataset(Dataset):
    def __init__(self,df,is_grad,tokenizer):
        self.df = df # Pandas.DataFrame
        self.is_grad = is_grad # True: train,valid / False: test
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.df) # number of samples

    def __getitem__(self,idx):
        text = self.df.loc[idx]['cleaned'] # extracting text from each row

        encoded_dict = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,#自动在每个文本前后添加特殊标记(如CLS和SEP)
            padding='max_length',#补0
            truncation=True,#句子长度大于max_length时截断
            max_length=84, # given to the max_length of tokenized text
            return_tensors='pt', # PyTorch
            return_attention_mask=True, # We should put it into the model,计算注意力(attention)时忽略那些paddle值
        )

        if self.is_grad:#训练集
            
            # [batch,1,max_len(84)] -> [batch,max_len]#使用squeeze降维
            return {'input_ids':encoded_dict['input_ids'].squeeze(),
                    'attention_mask':encoded_dict['attention_mask'].squeeze(),
                    'token_type_ids':torch.tensor(encoded_dict['token_type_ids'],dtype=torch.long).unsqueeze(dim=0)}
        else:#测试集
            # [batch,1,max_len(84)] -> [batch,max_len]
            return {'input_ids':encoded_dict['input_ids'].squeeze(),
                    'attention_mask':encoded_dict['attention_mask'].squeeze()}

上述处理数据集的代码有什么问题
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2024-11-3 10:54:58 | 显示全部楼层
Terence888 发表于 2024-11-03 10:38
from torch.utils.data import Dataset
import torch ...


在您提供的代码中,有几个问题需要注意:

1. **`self.df.loc` 的使用**:
   - 在 `__getitem__` 方法中,`text = self.df.loc` 这一行是错误的。`self.df.loc` 是一个方法,您需要使用 `self.df.loc[idx, 'column_name']` 来获取特定行和列的数据。确保替换 `'column_name'` 为实际的列名。

2. **`encoded_dict` 的使用**:
   - `encoded_dict` 是一个字典,包含多个键(如 `input_ids` 和 `attention_mask`)。在返回时,您需要分别处理这些键,而不是直接对 `encoded_dict` 使用 `squeeze`。

3. **`torch.tensor` 的使用**:
   - 在 `torch.tensor(encoded_dict, dtype=torch.long)` 中,`encoded_dict` 是一个字典,不能直接转换为张量。您需要对字典中的具体值进行转换。

以下是修正后的代码示例:
from torch.utils.data import Dataset
import torch

class LLMDataset(Dataset):
    def __init__(self, df, is_grad, tokenizer):
        self.df = df
        self.is_grad = is_grad
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        # 假设 'text_column' 是包含文本数据的列名
        text = self.df.loc[idx, 'text_column']

        encoded_dict = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            padding='max_length',
            truncation=True,
            max_length=84,
            return_tensors='pt',
            return_attention_mask=True,
        )

        if self.is_grad:
            return {
                'input_ids': encoded_dict['input_ids'].squeeze(),
                'attention_mask': encoded_dict['attention_mask'].squeeze(),
                'token_type_ids': torch.tensor(encoded_dict['token_type_ids'], dtype=torch.long).squeeze()
            }
        else:
            return {
                'input_ids': encoded_dict['input_ids'].squeeze(),
                'attention_mask': encoded_dict['attention_mask'].squeeze()
            }

请确保在 `__getitem__` 方法中正确访问 DataFrame 的列,并对 `encoded_dict` 中的具体键进行处理。

以上回复来自 FishC AI 小助理,如未能正确解答您的问题,请继续追问。
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 2024-11-3 11:54:19 | 显示全部楼层
FishC 发表于 2024-11-3 10:54
在您提供的代码中,有几个问题需要注意:

1. **`self.df.loc` 的使用**:
'input_ids': encoded_dict['input_ids'].squeeze(),
                'attention_mask': encoded_dict['attention_mask'].squeeze(),
                'token_type_ids': torch.tensor(encoded_dict['token_type_ids'], dtype=torch.long).squeeze()

input_ids attention_mask token_type_ids分别是什么形状的列表
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2024-11-3 11:54:28 | 显示全部楼层
Terence888 发表于 2024-11-03 11:54
'input_ids': encoded_dict.squeeze(),
             ...


In the context of the BERT model and the `transformers` library, the shapes of `input_ids`, `attention_mask`, and `token_type_ids` are typically as follows:

1. **`input_ids`**: This is a tensor containing the token IDs for the input text. Its shape is usually `(batch_size, max_length)`, where `max_length` is the maximum sequence length specified during tokenization.

2. **`attention_mask`**: This tensor indicates which tokens should be attended to and which should not (e.g., padding tokens). Its shape is also `(batch_size, max_length)`.

3. **`token_type_ids`**: This tensor is used to distinguish between different sequences in tasks like sentence pair classification. Its shape is `(batch_size, max_length)`.

In your code, it seems like there might be a misunderstanding in how `encoded_dict` is being used. The `encoded_dict` returned by `tokenizer.encode_plus` is a dictionary containing these keys, and you should access them individually. Here's a corrected version of how you might handle these tensors:
{
    'input_ids': encoded_dict['input_ids'].squeeze(),
    'attention_mask': encoded_dict['attention_mask'].squeeze(),
    'token_type_ids': encoded_dict['token_type_ids'].squeeze()
}

This assumes that `encoded_dict` contains these keys, which is typical when using `encode_plus` with `return_tensors='pt'

以上回复来自 FishC AI 小助理,如未能正确解答您的问题,请继续追问。
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

小黑屋|手机版|Archiver|鱼C工作室 ( 粤ICP备18085999号-1 | 粤公网安备 44051102000585号)

GMT+8, 2024-12-22 11:52

Powered by Discuz! X3.4

© 2001-2023 Discuz! Team.

快速回复 返回顶部 返回列表