关于Pytorch的DistributedDataParallel

一只魈咸鱼 · 发表于 2022-2-12 20:23:37

这个类是如何管理数据的呢？假如现在有8000训练集，4个GPU，那么是每个GPU分2000吗？

另外训练过程是怎样的呢，比如我epcho设为500，那他是每个GPU都以2000为训练集各自训练500次还是如何呢？
我在每个epoch的开头都放上了print(f'{epcho}th train')打印的是：
0th train
0th train
0th train
0th train
1th train
1th train
1th train
1th train
.......

Charles未晞 · 发表于 2022-2-13 13:44:37

可以参考我之前开源的项目：https://github.com/SegmentationBLWX/sssegmentation

distributed training的dataloader就是管理数据用的，区别就是update parameters的时候把各个模型的梯度求平均再反传

一只魈咸鱼 · 发表于 2022-2-14 08:38:13

Charles未晞发表于 2022-2-13 13:44
可以参考我之前开源的项目：https://github.com/SegmentationBLWX/sssegmentation

distributed training ...

我有尝试在程序的开头加上一条print(),发现每个进程都会打印，看起来像是各GPU都执行一遍程序的所有内容。这样似乎我在每一块GPU都创建了DistributedDataParallel，可按照DDP的说法，它是得到model的副本，这样不是意味着每一块GPU都会复制吗？那么谁是原版.....还有DistributedSampler，难道是每一块GPU都进行数据集分配吗？这样不是很奇怪吗。多块GPU之间的交流只有传递梯度这一条吗？

账号		自动登录	找回密码
密码			立即注册

关于Pytorch的DistributedDataParallel

浏览过的版块