Unverified 提交 4102fcc9 authored 作者: yzchen's avatar yzchen 提交者: GitHub

[WIP] Feature/ddp fixed (#401)

* Squashed commit of the following: commit d738487089e41c22b3b1cd73aa7c1c40320a6ebf Author: NanoCode012 <kevinvong@rocketmail.com> Date: Tue Jul 14 17:33:38 2020 +0700 Adding world_size Reduce calls to torch.distributed. For use in create_dataloader. commit e742dd9619d29306c7541821238d3d7cddcdc508 Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Tue Jul 14 15:38:48 2020 +0800 Make SyncBN a choice commit e90d4004387e6103fecad745f8cbc2edc918e906 Merge: 5bf8beb cd90360 Author: yzchen <Chenyzsjtu@gmail.com> Date: Tue Jul 14 15:32:10 2020 +0800 Merge pull request #6 from NanoCode012/patch-5 Update train.py commit cd9036017e7f8bd519a8b62adab0f47ea67f4962 Author: NanoCode012 <kevinvong@rocketmail.com> Date: Tue Jul 14 13:39:29 2020 +0700 Update train.py Remove redundant `opt.` prefix. commit 5bf8bebe8873afb18b762fe1f409aca116fac073 Merge: c9558a9 a1c8406a Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Tue Jul 14 14:09:51 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit c9558a9b51547febb03d9c1ca42e2ef0fc15bb31 Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Tue Jul 14 13:51:34 2020 +0800 Add device allocation for loss compute commit 4f08c692fb5e943a89e0ee354ef6c80a50eeb28d Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Thu Jul 9 11:16:27 2020 +0800 Revert drop_last commit 1dabe33a5a223b758cc761fc8741c6224205a34b Merge: a1ce9b1 4b8450b Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Thu Jul 9 11:15:49 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit a1ce9b1e96b71d7fcb9d3e8143013eb8cebe5e27 Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Thu Jul 9 11:15:21 2020 +0800 fix lr warning commit 4b8450b46db76e5e58cd95df965d4736077cfb0e Merge: b9a50ae 02c63ef Author: yzchen <Chenyzsjtu@gmail.com> Date: Wed Jul 8 21:24:24 2020 +0800 Merge pull request #4 from NanoCode012/patch-4 Add drop_last for multi gpu commit 02c63ef81cf98b28b10344fe2cce08a03b143941 Author: NanoCode012 <kevinvong@rocketmail.com> Date: Wed Jul 8 10:08:30 2020 +0700 Add drop_last for multi gpu commit b9a50aed48ab1536f94d49269977e2accd67748f Merge: ec2dc6c 121d90b3 Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Tue Jul 7 19:48:04 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit ec2dc6cc56de43ddff939e14c450672d0fbf9b3d Merge: d0326e3 82a6182 Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Tue Jul 7 19:34:31 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit d0326e398dfeeeac611ccc64198d4fe91b7aa969 Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Tue Jul 7 19:31:24 2020 +0800 Add SyncBN commit 82a6182b3ad0689a4432b631b438004e5acb3b74 Merge: 96fa40a 050b2a5 Author: yzchen <Chenyzsjtu@gmail.com> Date: Tue Jul 7 19:21:01 2020 +0800 Merge pull request #1 from NanoCode012/patch-2 Convert BatchNorm to SyncBatchNorm commit 050b2a5a79a89c9405854d439a1f70f892139b1c Author: NanoCode012 <kevinvong@rocketmail.com> Date: Tue Jul 7 12:38:14 2020 +0700 Add cleanup for process_group commit 2aa330139f3cc1237aeb3132245ed7e5d6da1683 Author: NanoCode012 <kevinvong@rocketmail.com> Date: Tue Jul 7 12:07:40 2020 +0700 Remove apex.parallel. Use torch.nn.parallel For future compatibility commit 77c8e27e603bea9a69e7647587ca8d509dc1990d Author: NanoCode012 <kevinvong@rocketmail.com> Date: Tue Jul 7 01:54:39 2020 +0700 Convert BatchNorm to SyncBatchNorm commit 96fa40a3a925e4ffd815fe329e1b5181ec92adc8 Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Mon Jul 6 21:53:56 2020 +0800 Fix the datset inconsistency problem commit 16e7c269d062c8d16c4d4ff70cc80fd87935dc95 Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Mon Jul 6 11:34:03 2020 +0800 Add loss multiplication to preserver the single-process performance commit e83805563065ffd2e38f85abe008fc662cc17909 Merge: 625bb49 3bdea3f6 Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Fri Jul 3 20:56:30 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit 625bb49f4e52d781143fea0af36d14e5be8b040c Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Thu Jul 2 22:45:15 2020 +0800 DDP established * Squashed commit of the following: commit 94147314e559a6bdd13cb9de62490d385c27596f Merge: 65157e2 37acbdc0 Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Thu Jul 16 14:00:17 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov4 into feature/DDP_fixed commit 37acbdc0 Author: Glenn Jocher <glenn.jocher@ultralytics.com> Date: Wed Jul 15 20:03:41 2020 -0700 update test.py --save-txt commit b8c2da4a Author: Glenn Jocher <glenn.jocher@ultralytics.com> Date: Wed Jul 15 20:00:48 2020 -0700 update test.py --save-txt commit 65157e2fc97d371bc576e18b424e130eb3026917 Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Wed Jul 15 16:44:13 2020 +0800 Revert the README.md removal commit 1c802bfa503623661d8617ca3f259835d27c5345 Merge: cd55b44 0f3b8bb Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Wed Jul 15 16:43:38 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit cd55b445c4dcd8003ff4b0b46b64adf7c16e5ce7 Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Wed Jul 15 16:42:33 2020 +0800 fix the DDP performance deterioration bug. commit 0f3b8bb1fae5885474ba861bbbd1924fb622ee93 Author: Glenn Jocher <glenn.jocher@ultralytics.com> Date: Wed Jul 15 00:28:53 2020 -0700 Delete README.md commit f5921ba1e35475f24b062456a890238cb7a3cf94 Merge: 85ab2f3 bd3fdbb Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Wed Jul 15 11:20:17 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit bd3fdbbf1b08ef87931eef49fa8340621caa7e87 Author: Glenn Jocher <glenn.jocher@ultralytics.com> Date: Tue Jul 14 18:38:20 2020 -0700 Update README.md commit c1a97a7767ccb2aa9afc7a5e72fd159e7c62ec02 Merge: 2bf86b8 f796708b Author: Glenn Jocher <glenn.jocher@ultralytics.com> Date: Tue Jul 14 18:36:53 2020 -0700 Merge branch 'master' into feature/DDP_fixed commit 2bf86b892fa2fd712f6530903a0d9b8533d7447a Author: NanoCode012 <kevinvong@rocketmail.com> Date: Tue Jul 14 22:18:15 2020 +0700 Fixed world_size not found when called from test commit 85ab2f38cdda28b61ad15a3a5a14c3aafb620dc8 Merge: 5a19011 c8357ad Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Tue Jul 14 22:19:58 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit 5a19011949398d06e744d8d5521ab4e6dfa06ab7 Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Tue Jul 14 22:19:15 2020 +0800 Add assertion for <=2 gpus DDP commit c8357ad5b15a0e6aeef4d7fe67ca9637f7322a4d Merge: e742dd9 787582f Author: yzchen <Chenyzsjtu@gmail.com> Date: Tue Jul 14 22:10:02 2020 +0800 Merge pull request #8 from MagicFrogSJTU/NanoCode012-patch-1 Modify number of dataloaders' workers commit 787582f97251834f955ef05a77072b8c673a8397 Author: NanoCode012 <kevinvong@rocketmail.com> Date: Tue Jul 14 20:38:58 2020 +0700 Fixed issue with single gpu not having world_size commit 63648925288d63a21174a4dd28f92dbfebfeb75a Author: NanoCode012 <kevinvong@rocketmail.com> Date: Tue Jul 14 19:16:15 2020 +0700 Add assert message for clarification Clarify why assertion was thrown to users commit 69364d6050e048d0d8834e0f30ce84da3f6a13f3 Author: NanoCode012 <kevinvong@rocketmail.com> Date: Tue Jul 14 17:36:48 2020 +0700 Changed number of workers check commit d738487089e41c22b3b1cd73aa7c1c40320a6ebf Author: NanoCode012 <kevinvong@rocketmail.com> Date: Tue Jul 14 17:33:38 2020 +0700 Adding world_size Reduce calls to torch.distributed. For use in create_dataloader. commit e742dd9619d29306c7541821238d3d7cddcdc508 Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Tue Jul 14 15:38:48 2020 +0800 Make SyncBN a choice commit e90d4004387e6103fecad745f8cbc2edc918e906 Merge: 5bf8beb cd90360 Author: yzchen <Chenyzsjtu@gmail.com> Date: Tue Jul 14 15:32:10 2020 +0800 Merge pull request #6 from NanoCode012/patch-5 Update train.py commit cd9036017e7f8bd519a8b62adab0f47ea67f4962 Author: NanoCode012 <kevinvong@rocketmail.com> Date: Tue Jul 14 13:39:29 2020 +0700 Update train.py Remove redundant `opt.` prefix. commit 5bf8bebe8873afb18b762fe1f409aca116fac073 Merge: c9558a9 a1c8406a Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Tue Jul 14 14:09:51 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit c9558a9b51547febb03d9c1ca42e2ef0fc15bb31 Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Tue Jul 14 13:51:34 2020 +0800 Add device allocation for loss compute commit 4f08c692fb5e943a89e0ee354ef6c80a50eeb28d Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Thu Jul 9 11:16:27 2020 +0800 Revert drop_last commit 1dabe33a5a223b758cc761fc8741c6224205a34b Merge: a1ce9b1 4b8450b Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Thu Jul 9 11:15:49 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit a1ce9b1e96b71d7fcb9d3e8143013eb8cebe5e27 Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Thu Jul 9 11:15:21 2020 +0800 fix lr warning commit 4b8450b46db76e5e58cd95df965d4736077cfb0e Merge: b9a50ae 02c63ef Author: yzchen <Chenyzsjtu@gmail.com> Date: Wed Jul 8 21:24:24 2020 +0800 Merge pull request #4 from NanoCode012/patch-4 Add drop_last for multi gpu commit 02c63ef81cf98b28b10344fe2cce08a03b143941 Author: NanoCode012 <kevinvong@rocketmail.com> Date: Wed Jul 8 10:08:30 2020 +0700 Add drop_last for multi gpu commit b9a50aed48ab1536f94d49269977e2accd67748f Merge: ec2dc6c 121d90b3 Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Tue Jul 7 19:48:04 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit ec2dc6cc56de43ddff939e14c450672d0fbf9b3d Merge: d0326e3 82a6182 Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Tue Jul 7 19:34:31 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit d0326e398dfeeeac611ccc64198d4fe91b7aa969 Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Tue Jul 7 19:31:24 2020 +0800 Add SyncBN commit 82a6182b3ad0689a4432b631b438004e5acb3b74 Merge: 96fa40a 050b2a5 Author: yzchen <Chenyzsjtu@gmail.com> Date: Tue Jul 7 19:21:01 2020 +0800 Merge pull request #1 from NanoCode012/patch-2 Convert BatchNorm to SyncBatchNorm commit 050b2a5a79a89c9405854d439a1f70f892139b1c Author: NanoCode012 <kevinvong@rocketmail.com> Date: Tue Jul 7 12:38:14 2020 +0700 Add cleanup for process_group commit 2aa330139f3cc1237aeb3132245ed7e5d6da1683 Author: NanoCode012 <kevinvong@rocketmail.com> Date: Tue Jul 7 12:07:40 2020 +0700 Remove apex.parallel. Use torch.nn.parallel For future compatibility commit 77c8e27e603bea9a69e7647587ca8d509dc1990d Author: NanoCode012 <kevinvong@rocketmail.com> Date: Tue Jul 7 01:54:39 2020 +0700 Convert BatchNorm to SyncBatchNorm commit 96fa40a3a925e4ffd815fe329e1b5181ec92adc8 Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Mon Jul 6 21:53:56 2020 +0800 Fix the datset inconsistency problem commit 16e7c269d062c8d16c4d4ff70cc80fd87935dc95 Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Mon Jul 6 11:34:03 2020 +0800 Add loss multiplication to preserver the single-process performance commit e83805563065ffd2e38f85abe008fc662cc17909 Merge: 625bb49 3bdea3f6 Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Fri Jul 3 20:56:30 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit 625bb49f4e52d781143fea0af36d14e5be8b040c Author: yizhi.chen <chenyzsjtu@outlook.com> Date: Thu Jul 2 22:45:15 2020 +0800 DDP established * Fixed destroy_process_group in DP mode * Update torch_utils.py * Update utils.py Revert build_targets() to current master. * Update datasets.py * Fixed world_size attribute not found Co-authored-by: 's avatarNanoCode012 <kevinvong@rocketmail.com> Co-authored-by: 's avatarGlenn Jocher <glenn.jocher@ultralytics.com>
上级 b6fe2e45
差异被折叠。
......@@ -14,7 +14,7 @@ from PIL import Image, ExifTags
from torch.utils.data import Dataset
from tqdm import tqdm
from utils.utils import xyxy2xywh, xywh2xyxy
from utils.utils import xyxy2xywh, xywh2xyxy, torch_distributed_zero_first
help_url = 'https://github.com/ultralytics/yolov5/wiki/Train-Custom-Data'
img_formats = ['.bmp', '.jpg', '.jpeg', '.png', '.tif', '.dng']
......@@ -46,21 +46,25 @@ def exif_size(img):
return s
def create_dataloader(path, imgsz, batch_size, stride, opt, hyp=None, augment=False, cache=False, pad=0.0, rect=False):
dataset = LoadImagesAndLabels(path, imgsz, batch_size,
augment=augment, # augment images
hyp=hyp, # augmentation hyperparameters
rect=rect, # rectangular training
cache_images=cache,
single_cls=opt.single_cls,
stride=int(stride),
pad=pad)
def create_dataloader(path, imgsz, batch_size, stride, opt, hyp=None, augment=False, cache=False, pad=0.0, rect=False, local_rank=-1, world_size=1):
# Make sure only the first process in DDP process the dataset first, and the following others can use the cache.
with torch_distributed_zero_first(local_rank):
dataset = LoadImagesAndLabels(path, imgsz, batch_size,
augment=augment, # augment images
hyp=hyp, # augmentation hyperparameters
rect=rect, # rectangular training
cache_images=cache,
single_cls=opt.single_cls,
stride=int(stride),
pad=pad)
batch_size = min(batch_size, len(dataset))
nw = min([os.cpu_count(), batch_size if batch_size > 1 else 0, 8]) # number of workers
nw = min([os.cpu_count() // world_size, batch_size if batch_size > 1 else 0, 8]) # number of workers
train_sampler = torch.utils.data.distributed.DistributedSampler(dataset) if local_rank != -1 else None
dataloader = torch.utils.data.DataLoader(dataset,
batch_size=batch_size,
num_workers=nw,
sampler=train_sampler,
pin_memory=True,
collate_fn=LoadImagesAndLabels.collate_fn)
return dataloader, dataset
......@@ -301,7 +305,7 @@ class LoadImagesAndLabels(Dataset): # for training/testing
f += glob.iglob(p + os.sep + '*.*')
else:
raise Exception('%s does not exist' % p)
self.img_files = [x.replace('/', os.sep) for x in f if os.path.splitext(x)[-1].lower() in img_formats]
self.img_files = sorted([x.replace('/', os.sep) for x in f if os.path.splitext(x)[-1].lower() in img_formats])
except Exception as e:
raise Exception('Error loading data from %s: %s\nSee %s' % (path, e, help_url))
......
......@@ -8,6 +8,7 @@ import time
from copy import copy
from pathlib import Path
from sys import platform
from contextlib import contextmanager
import cv2
import matplotlib
......@@ -31,6 +32,18 @@ matplotlib.rc('font', **{'size': 11})
cv2.setNumThreads(0)
@contextmanager
def torch_distributed_zero_first(local_rank: int):
"""
Decorator to make all processes in distributed training wait for each local_master to do something.
"""
if local_rank not in [-1, 0]:
torch.distributed.barrier()
yield
if local_rank == 0:
torch.distributed.barrier()
def init_seeds(seed=0):
random.seed(seed)
np.random.seed(seed)
......@@ -424,15 +437,16 @@ class BCEBlurWithLogitsLoss(nn.Module):
def compute_loss(p, targets, model): # predictions, targets, model
device = targets.device
ft = torch.cuda.FloatTensor if p[0].is_cuda else torch.Tensor
lcls, lbox, lobj = ft([0]), ft([0]), ft([0])
lcls, lbox, lobj = ft([0]).to(device), ft([0]).to(device), ft([0]).to(device)
tcls, tbox, indices, anchors = build_targets(p, targets, model) # targets
h = model.hyp # hyperparameters
red = 'mean' # Loss reduction (sum or mean)
# Define criteria
BCEcls = nn.BCEWithLogitsLoss(pos_weight=ft([h['cls_pw']]), reduction=red)
BCEobj = nn.BCEWithLogitsLoss(pos_weight=ft([h['obj_pw']]), reduction=red)
BCEcls = nn.BCEWithLogitsLoss(pos_weight=ft([h['cls_pw']]), reduction=red).to(device)
BCEobj = nn.BCEWithLogitsLoss(pos_weight=ft([h['obj_pw']]), reduction=red).to(device)
# class label smoothing https://arxiv.org/pdf/1902.04103.pdf eqn 3
cp, cn = smooth_BCE(eps=0.0)
......@@ -448,7 +462,7 @@ def compute_loss(p, targets, model): # predictions, targets, model
balance = [1.0, 1.0, 1.0]
for i, pi in enumerate(p): # layer index, layer predictions
b, a, gj, gi = indices[i] # image, anchor, gridy, gridx
tobj = torch.zeros_like(pi[..., 0]) # target obj
tobj = torch.zeros_like(pi[..., 0]).to(device) # target obj
nb = b.shape[0] # number of targets
if nb:
......@@ -458,7 +472,7 @@ def compute_loss(p, targets, model): # predictions, targets, model
# GIoU
pxy = ps[:, :2].sigmoid() * 2. - 0.5
pwh = (ps[:, 2:4].sigmoid() * 2) ** 2 * anchors[i]
pbox = torch.cat((pxy, pwh), 1) # predicted box
pbox = torch.cat((pxy, pwh), 1).to(device) # predicted box
giou = bbox_iou(pbox.t(), tbox[i], x1y1x2y2=False, GIoU=True) # giou(prediction, target)
lbox += (1.0 - giou).sum() if red == 'sum' else (1.0 - giou).mean() # giou loss
......@@ -467,7 +481,7 @@ def compute_loss(p, targets, model): # predictions, targets, model
# Class
if model.nc > 1: # cls loss (only if multiple classes)
t = torch.full_like(ps[:, 5:], cn) # targets
t = torch.full_like(ps[:, 5:], cn).to(device) # targets
t[range(nb), tcls[i]] = cp
lcls += BCEcls(ps[:, 5:], t) # BCE
......
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论