代码收藏家技术教程 2022-07-23

YOLOv5 源码解析 —— 网络模型建立

在 YOLOv5 源码中，模型的建立是依靠 yolo.py 中的函数和对象完成的：

parse_model

这是一个用于解析 yaml 文件以创建网络模型的函数，以 yolov5s.yaml 为例讲一下解析过程 (在 models 文件夹中)：

depth_multiple：表示网络深度增益 (针对 yaml 文件中每一行的 "number")，计算方法如下：

$n = max(1, round(n \cdot d_m))$

这个文件中 depth_multiple = 0.33，则是将网络深度缩小 3 倍

width_multiple：表示卷积通道增益 (针对每个网络单元的输出通道数)，计算方法如下：

$c_2=8 \cdot ceil(\frac{c_2 \cdot w_m}{8})$

这个文件中 width_multiple = 0.5，则是将卷积通道数缩小 2 倍

# YOLOv5 🚀 by Ultralytics, GPL-3.0 license

# Parameters
nc: 80  # number of classes
depth_multiple: 0.33  # model depth multiple
width_multiple: 0.50  # layer channel multiple
anchors:
  - [10,13, 16,30, 33,23]  # P3/8
  - [30,61, 62,45, 59,119]  # P4/16
  - [116,90, 156,198, 373,326]  # P5/32

# YOLOv5 v6.0 backbone
backbone:
  # [from, number, module, args]
  [[-1, 1, Conv, [64, 6, 2, 2]],  # 0-P1/2, 第 1 次出现 s = 2
   [-1, 1, Conv, [128, 3, 2]],  # 1-P2/4, 第 2 次出现 s = 2
   [-1, 3, C3, [128]],
   [-1, 1, Conv, [256, 3, 2]],  # 3-P3/8, 第 3 次出现 s = 2
   [-1, 6, C3, [256]],
   [-1, 1, Conv, [512, 3, 2]],  # 5-P4/16, 第 4 次出现 s = 2
   [-1, 9, C3, [512]],
   [-1, 1, Conv, [1024, 3, 2]],  # 7-P5/32, 第 5 次出现 s = 2
   [-1, 3, C3, [1024]],
   [-1, 1, SPPF, [1024, 5]],  # 9
  ]

# YOLOv5 v6.0 head
head:
  [[-1, 1, Conv, [512, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],  # 第 1 次上采样
   [[-1, 6], 1, Concat, [1]],  # cat backbone P4
   [-1, 3, C3, [512, False]],  # 13

   [-1, 1, Conv, [256, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']], # 第 2 次上采样
   [[-1, 4], 1, Concat, [1]],  # cat backbone P3
   [-1, 3, C3, [256, False]],  # 17 (P3/8-small)

   [-1, 1, Conv, [256, 3, 2]],  # 第 6 次出现 s = 2
   [[-1, 14], 1, Concat, [1]],  # cat head P4
   [-1, 3, C3, [512, False]],  # 20 (P4/16-medium)

   [-1, 1, Conv, [512, 3, 2]],  # 第 7 次出现 s = 2
   [[-1, 10], 1, Concat, [1]],  # cat head P5
   [-1, 3, C3, [1024, False]],  # 23 (P5/32-large)

   [[17, 20, 23], 1, Detect, [nc, anchors]],  # Detect(P3, P4, P5)
  ]

back_bone 和 head 关键字下，每一行都对应一个网络层，其中四个参数分别对应：

from：其值为网络层索引 (一个或多个，-1表示上一个网络层)，将对应网络层的输出作为 forward 函数的参数

number：网络单元串联深度 / 对应卷积单元的参数 n (BottleneckCSP, C3, C3TR, C3Ghost)

module：对应 models.common / torch.nn 中的卷积单元名称

args：无需 c1 (输入通道数) 和 n (前文的 number)。对于 Conv、DWConv、GhostConv、GhostBottleneck、Focus 而言，第三个参数为 s (卷积步长)

(models.common.py 中各网络单元的参数解析见下文)

YOLOv5-6.0 源码解析 —— 卷积神经单元https://hebitzj.blog.csdn.net/article/details/124437488

这个函数在解析 yaml 文件的过程中会通过 logging 模块输出各个网络层的信息，其中的 arguments 与 yaml 中的 args 不同 (补全了缺失参数)，params 是该网络层的参数量

函数参数：

d：yaml 配置文件 (字典类型)

ch：输入图像的通道数，例如 RGB 格式图像的 ch = 3

def parse_model(d, ch):  # model_dict, input_channels(3)
    # 使用 logging 模块输出列标签
    LOGGER.info('\n%3s%18s%3s%10s  %-40s%-30s' % ('', 'from', 'n', 'params', 'module', 'arguments'))
    # 读取 yaml 文件的参数: 先验框, 分类数, 深度增益, 通道增益
    anchors, nc, gd, gw = d['anchors'], d['nc'], d['depth_multiple'], d['width_multiple']
    # na: 每组先验框包含的先验框数
    na = (len(anchors[0]) // 2) if isinstance(anchors, list) else anchors  # number of anchors
    # no: na * 属性数 (5 + 分类数)
    no = na * (nc + 5)  # number of outputs = anchors * (classes + 5)

    # 网络单元列表, 网络输出引用列表, 当前的输出通道数
    layers, save, c2 = [], [], ch[-1]  # layers, savelist, ch out
    # 读取 backbone, head 中的网络单元
    for i, (f, n, m, args) in enumerate(d['backbone'] + d['head']):  # from, number, module, args
        # 利用 eval 函数, 读取 model 参数对应的类
        m = eval(m) if isinstance(m, str) else m  # eval strings
        # 使用 eval 函数将字符串转换为变量
        for j, a in enumerate(args):
            try:
                args[j] = eval(a) if isinstance(a, str) else a  # eval strings
            except NameError:
                pass

        # 使用 gd (深度增益) 计算网络单元串联深度 / 对应卷积单元的参数 n
        n = n_ = max(round(n * gd), 1) if n > 1 else n  # depth gain
        # 当该网络单元的参数含有: 输入通道数, 输出通道数
        if m in [Conv, GhostConv, Bottleneck, GhostBottleneck, SPP, SPPF, DWConv, MixConv2d, Focus, CrossConv,
                 BottleneckCSP, C3, C3TR, C3SPP, C3Ghost]:
            # ch (输出通道数列表): ch[f] 即 from 参数对应的输出通道数
            c1, c2 = ch[f], args[0]
            if c2 != no:  # if not output
                # 令输出通道数为 8 的倍数
                c2 = make_divisible(c2 * gw, 8)

            # yaml 字典中不含有输入通道数, 补上
            # 指定新的输出通道数
            args = [c1, c2, *args[1:]]
            if m in [BottleneckCSP, C3, C3TR, C3Ghost]:
                # 当网络单元含有 n 参数, 补上
                args.insert(2, n)  # number of repeats
                # 同时将串联深度置为 1
                n = 1
        # BN 只有一个参数: from 参数对应的输出通道数
        elif m is nn.BatchNorm2d:
            args = [ch[f]]
        # Concat: from 为一个列表, 输出通道数为多个网络单元输出通道数的和
        elif m is Concat:
            c2 = sum([ch[x] for x in f])
        # Detect 检测头
        elif m is Detect:
            # 根据 from 参数找到对应的输出通道数, 确定 Detect 的 ch 参数
            args.append([ch[x] for x in f])
            # 如果先验框参数是一个 int, 则所有先验框长宽均为该数
            if isinstance(args[1], int):  # number of anchors
                args[1] = [list(range(args[1] * 2))] * len(f)
        # Contract, Expand: 计算对应的输出通道数
        elif m is Contract:
            c2 = ch[f] * args[0] ** 2
        elif m is Expand:
            c2 = ch[f] // args[0] ** 2
        else:
            c2 = ch[f]

        # 网络层: 根据网络串联深度生成指定个数的网络单元
        m_ = nn.Sequential(*[m(*args) for _ in range(n)]) if n > 1 else m(*args)  # module
        # 获取该网络单元的类名
        t = str(m)[8:-2].replace('__main__.', '')  # module type
        # 计算该网络层的参数量
        np = sum([x.numel() for x in m_.parameters()])  # number params
        # 设置网络层的实例属性
        m_.i, m_.f, m_.type, m_.np = i, f, t, np  # attach index, 'from' index, type, number params
        # 使用 logging 模块输出该网络层的信息
        LOGGER.info('%3s%18s%3s%10.0f  %-40s%-30s' % (i, f, n_, np, t, args))  # print
        # 保存除 -1 之外的 from 参数到 save 列表
        save.extend(x % i for x in ([f] if isinstance(f, int) else f) if x != -1)  # append to savelist
        # 保存该网络层
        layers.append(m_)
        if i == 0:
            ch = []
        # 存储该网络层的输出通道数信息
        ch.append(c2)
    return nn.Sequential(*layers), sorted(save)

返回值：

nn.Sequential(*layers)：网络层列表

sorted(save)：对于上面的 yaml 文件，输出是 [4, 6, 10, 14, 17, 20, 23]，即 from 参数的集合

Detect

Detect 对象是 YOLO 网络模型的最后一层 (对应 yaml 文件最后一行)，通过 yaml 文件进行声明，格式为：

[*from], 1, Detect, [nc, anchors]

其中 nc 为分类数，anchors 为先验框，修改 yaml 文件的前几行即可

在 parse_model 函数中，会根据 from 参数，找到对应网络层的输出通道数 (ch：列表)。传参给 Detect 对象后，生成对应的 Conv2d

nn.ModuleList(nn.Conv2d(x, self.no * self.na, 1) for x in ch)

其中，self.no 为分类数 (80) + 检测框属性数 (5)，self.na 为每个矩阵输出对应的先验框数 (3)

class Detect(nn.Module):
    stride = None  # strides computed during build
    onnx_dynamic = False  # ONNX export parameter

    def __init__(self, nc=80, anchors=(), ch=(), inplace=True):...
    
    def forward(self, x):
        z = []  # inference output
        for i in range(self.nl):
            # 调用 Conv2d 进行运算
            x[i] = self.m[i](x[i])  # conv
            bs, _, ny, nx = x[i].shape  # x(bs,255,20,20) to x(bs,3,20,20,85)
            # 维度重排列: bs, 先验框组数, 检测框行数, 检测框列数, 属性数 + 分类数
            x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()

            if not self.training:  # inference
                if self.grid[i].shape[2:4] != x[i].shape[2:4] or self.onnx_dynamic:
                    # 加载网格点坐标, 先验框尺寸
                    self.grid[i], self.anchor_grid[i] = self._make_grid(nx, ny, i)

                y = x[i].sigmoid()
                # stride: 相对于原图像的尺寸缩小倍数
                if self.inplace:
                    y[..., 0:2] = (y[..., 0:2] * 2. - 0.5 + self.grid[i]) * self.stride[i]  # xy
                    y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
                else:  # for YOLOv5 on AWS Inferentia https://github.com/ultralytics/yolov5/pull/2953
                    xy = (y[..., 0:2] * 2. - 0.5 + self.grid[i]) * self.stride[i]  # xy
                    wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
                    y = torch.cat((xy, wh, y[..., 4:]), -1)
                # 存储检测框信息
                z.append(y.view(bs, -1, self.no))
        return x if self.training else (torch.cat(z, 1), x)

forward 函数输入是列表 x (长度3，包含网络层 17、20、23 的输出结果)，有两个返回值，以分类数为 80 、检测框属性数为 5、长宽均为 76 的 x[0] 为例：

x：使用 Conv2d 将 x[0] 的通道数由 c1 变为 (80 + 5) × 3 = 255，并将 x[0] 的 shape 由 [bs, 255, 76, 76] 变换为 [bs, 3, 76, 76, 85]。对x[1]、x[2] 依次进行以上操作

torch.cat(z, 1)：对新 x[0] 使用 sigmoid 函数之后，分别对检测框的 xywh 进行运算。将新 x[0] 的 shape 从 [bs, 3, 76, 76, 85] 变换为 [bs, 17328, 85]，即包含 17328 个检测框的信息。对x[1]、x[2] 依次进行以上操作，在 dim = 1 处拼接，得到 shape 为 [bs, 22743, 85] 的检测框信息 (后续直接使用非极大抑制函数处理)

x 无论在 train 还是 eval 时都会 return，而 torch.cat(z, 1) 只在 eval 的时候 return

在 eval 时，记 grid 为 x[i] 各个网格点的坐标，anchor 为先验框的尺寸，stride 为 x[i] 相对于原图的尺寸缩小倍数，xywh 的计算方法为：

$x=[grid+2 \cdot sigmoid(x)-0.5]\cdot stride$

$w=[2\cdot sigmoid(w)]^2\cdot \frac{anchor}{stride}$

xy 由坐标、偏置两部分组成，YOLOv3 中的偏置是 $sigmoid(x)$ ，值域是 [0, 1]；而 YOLOv5 中的偏置是 $2 \cdot sigmoid(x) - 0.5$ ，值域是 [-0.5, 1.5]

绘制函数图像后发现，YOLOv5 的偏置在值域 [0, 1] 处的梯度更加平滑，这可能也是 YOLOv5 定位性能好的原因之一

Model

Model 对象的声明语句如下：

Model(cfg='yolov5s.yaml', ch=3, nc=None, anchors=None)

cfg：yaml 文件，将调用 parse_model 函数得到网络结构

ch：输入图像的通道数，例如 RGB 格式图像的 ch = 3

nc：分类数，如果设置了则会覆盖 yaml 字典的 nc 参数

anchors：先验框尺寸 (3 行 6 列)，如果设置了则会覆盖 yaml 字典的 anchors 参数

初始化：

调用 parse_model 函数得到网络结构
调用 self._initialize_biases，修改 Detect 实例的 Conv2d 的 bias
如果最后一层是 Detect，则计算网络模型使图像尺寸缩小的倍数，记为 stride；同时将先验框的尺寸除以对应的 stride
调用 utils.torch_utils.initialize_weights，初始化网络参数
调用 utils.torch_utils.model_info，输出网络深度、参数量、梯度量

在讲 forward 函数之前，先讲一下 forward 函数调用的几个函数

_profile_one_layer

class Model(nn.Module):

    def __init__(self, cfg='yolov5s.yaml', ch=3, nc=None, anchors=None):...

    def _profile_one_layer(self, m, x, dt):
        # 是否为 Detect 层
        c = isinstance(m, Detect)  # is final layer, copy input as inplace fix
        # 使用 thop 模块计算浮点运算量
        o = thop.profile(m, inputs=(x.copy() if c else x,), verbose=False)[0] / 1E9 * 2 if thop else 0  # FLOPs
        # 等待所有显卡进程结束再计时
        t = time_sync()
        # 推导 10 次
        for _ in range(10):
            m(x.copy() if c else x)
        # 记录推导耗时 (ms)
        dt.append((time_sync() - t) * 100)
        if m == self.model[0]:
            # 使用 logging 模块输出列标签
            LOGGER.info(f"{'time (ms)':>10s} {'GFLOPs':>10s} {'params':>10s}  {'module'}")
        # 使用 logging 模块输出性能测试结果
        LOGGER.info(f'{dt[-1]:10.2f} {o:10.2f} {m.np:10.0f}  {m.type}')
        if c:
            # 计算整个模型的推导耗时
            LOGGER.info(f"{sum(dt):10.2f} {'-':>10s} {'-':>10s}  Total")

这个函数用于测试每个网络层的性能，参数为：

m：网络层

x：该网络层的 from 列表中的网络层输出

dt：各网络层推导耗时 (列表)

使用 logging 模块输出：

time (ms)：前向推导时间

GFLOPs：浮点运算量，需要安装 thop 模块

params：网络层参数量

module：网络层名称

因为这个函数在计时过程中产生了额外的计算量，所以在 train、val、detect 三个主函数中都是直接禁用的。但各个网络层的前向推导时间、浮点运算量都是评判网络性能的重要指标，故做开发的话需做了解

_forward_once

class Model(nn.Module):

    def __init__(self, cfg='yolov5s.yaml', ch=3, nc=None, anchors=None):...

    def _forward_once(self, x, profile=False, visualize=False):
        # 各网络层输出, 各网络层推导耗时
        y, dt = [], []  # outputs
        for m in self.model:
            # 如果 from 不是直接指向上一层
            if m.f != -1:  # if not from previous layer
                # from 参数指向的网络层输出的列表
                x = y[m.f] if isinstance(m.f, int) else [x if j == -1 else y[j] for j in m.f]  # from earlier layers
            # 测试该网络层的性能
            if profile:
                self._profile_one_layer(m, x, dt)
            # 使用该网络层进行推导, 得到该网络层的输出
            x = m(x)  # run
            # 如果该网络层的索引存在 save 列表, 则保存输出
            y.append(x if m.i in self.save else None)  # save output
            if visualize:
                # 绘制该 batch 中第一张图像的特征图
                feature_visualization(x, m.type, m.i, save_dir=visualize)
        return x

这个函数是 forward 函数的两个分支之一，中规中矩的前向推导，参数为：

x：图像所对应的 tensor

profile：是否测试每个网络层的性能，是则调用 self._profile_one_layer 函数

visualize：是否输出每个网络层的特征图，是则调用 utils.plots.feature_visualization。这个函数是取 batch 中的第一张图像，然后把每个通道上的二维矩阵看成一张灰度图，分别绘制

绘制特征图使用的是 matplotlib，源码中的 imshow 函数并没有指定 cmap，可以自行修改指定绘图颜色

_forward_augment

class Model(nn.Module):

    def __init__(self, cfg='yolov5s.yaml', ch=3, nc=None, anchors=None):...

    def _forward_augment(self, x):
        img_size = x.shape[-2:]  # height, width
        # 尺寸变换: 不变, 缩小到原 0.83, 缩小到原 0.67
        s = [1, 0.83, 0.67]  # scales
        # 维度翻转: 无, 水平翻转, 无
        f = [None, 3, None]  # flips (2-ud, 3-lr)
        y = []  # outputs
        for si, fi in zip(s, f):
            # 对图像进行变换
            xi = scale_img(x.flip(fi) if fi else x, si, gs=int(self.stride.max()))
            # 得到整个模型的推导结果
            yi = self._forward_once(xi)[0]  # forward
            # cv2.imwrite(f'img_{si}.jpg', 255 * xi[0].cpu().numpy().transpose((1, 2, 0))[:, :, ::-1])  # save
            # 对推导结果进行逆变换
            yi = self._descale_pred(yi, fi, si, img_size)
            y.append(yi)
        # 截取 y[0] 对大物体的检测结果，保留 y[1] 所有的检测结果，截取 y[2] 对小物体的检测结果
        y = self._clip_augmented(y)  # clip augmented tails
        # 对所有检测框进行拼接, 得到新的检测框信息
        return torch.cat(y, 1), None  # augmented inference, train

增强式推导，其 x 参数是图像所对应的 tensor。这个函数只在 val、detect 主函数中使用，用于提高推导的精度

设分类数为 80 、检测框属性数为 5，则基本步骤是：

对图像进行变换：总共 3 次，分别是 [ 原图 ]，[ 尺寸缩小到原来的 0.83，同时水平翻转 ]，[ 尺寸缩小到原来的 0.67 ]
对图像使用 _forward_once 函数，得到在 eval 模式下网络模型的推导结果。对原图是 shape 为 [1, 22743, 85] 的图像检测框信息 (见 Detect 对象的 forward 函数)
根据尺寸缩小倍数、翻转维度对检测框信息进行逆变换，添加进列表 y
截取 y[0] 对大物体的检测结果，保留 y[1] 所有的检测结果，截取 y[2] 对小物体的检测结果，拼接得到新的检测框信息

forward

class Model(nn.Module):

    def __init__(self, cfg='yolov5s.yaml', ch=3, nc=None, anchors=None):...

    def forward(self, x, augment=False, profile=False, visualize=False):
        if augment:
            return self._forward_augment(x)  # augmented inference, None
        return self._forward_once(x, profile, visualize)  # single-scale inference, train

augment ：是否使用增强式推导，是则调用 self._forward_augment，否则调用 self._forward_once

profile：是否测试每个网络层的性能

visualize：是否输出每个网络层的特征图

来源：荷碧·TongZJ