代码收藏家技术教程 2022-07-22

mmdet3d+waymo 踩坑+验证环境正确性流程

处理新版的waymo数据集已经很费劲了，结果eval的结果和train的loss总是很差，原本以为只是model的问题，后面发现环境也有大坑。修好了之后，evaluate结束又开始报error了，明明结果都对的，非要报个error，一系列的事情忙了20天才弄好，中间基本没休息过，累死了。

前面配mmdet3d的时候，由于使用了最新版mmdet3d v1.0.0rc2，导致使用官方的config和model，nuscenes数据集上的eval和train结果都不对，后面用了同学环境的版本才好了，但这个时候测waymo就会报错，找了很久bug，才发现新版cuda，旧版torch和tensorflow存在一定程度的冲突，以至于一起用显卡的时候会出现问题。

waymo evaluate error复现

有空交个issue
环境：

mmcv-full                 1.4.0            
mmdet                     2.19.1     
mmdet3d                   0.17.3 
tensorflow                2.6.0
torch                     1.10.2+cu113
waymo-open-dataset-tf-2-6-0 1.4.7

问题出现在bash tools/dist_train.sh或者bash tools/dist_test.sh。
一旦调用过waymo_dataset.evaluate，程序结束都会报错：

terminate called after throwing an instance of ‘c10::CUDAError’
what(): CUDA error: unspecified launch failure
Exception raised from create_event_internal at
…/c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42
(0x7f27dd18ed62 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0x1c5f3 (0x7f282083b5f3 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a2
(0x7f282083c002 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7f27dd178314
in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: + 0x29eb89 (0x7f28a3b62b89 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xadfbe1 (0x7f28a43a3be1 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object*) + 0x292
(0x7f28a43a3ee2 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #61: PyRun_SimpleFileExFlags + 0x1bf (0x56231eaba54f in
/home/zhengliangtao/anaconda3/envs/open-mmlab/bin/python) frame #62:
Py_RunMain + 0x3a9 (0x56231eabaa29 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/bin/python) frame #63:
Py_BytesMain + 0x39 (0x56231eabac29 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/bin/python)

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process
57875 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process
57876 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode:
-6) local_rank: 0 (pid: 57874) of binary: /home/zhengliangtao/anaconda3/envs/open-mmlab/bin/python Traceback
(most recent call last): File
“/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/runpy.py”,
line 194, in _run_module_as_main
return _run_code(code, main_globals, None, File “/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/runpy.py”,
line 87, in _run_code
exec(code, run_globals) File “/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py”,
line 193, in
main() File “/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py”,
line 189, in main
launch(args) File “/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py”,
line 174, in launch
run(args) File “/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/run.py”,
line 710, in run
elastic_launch( File “/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 131, in call
return launch_agent(self._config, self._entrypoint, list(args)) File
“/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 259, in launch_agent
raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

注意到cuda进程和cpu进程是异步执行的，很坑爹的一点是cuda出错的时候cpu是不会停的，得等到某个时刻可能要同步了才停。所以无法根据报错信息定位。
但我用CUDA_LAUNCH_BLOCKING=1 bash tools/dist_test.sh没有用，一样执行完test.py才报错。于是我只能一段一段注释掉，看看究竟是哪里出的问题，找了1个小时之后才找到，如下：

mmdet3d/mmdet3d/core/evaluation/waymo_utils/
prediction_kitti_to_waymo.py:

def convert_one(self, file_idx):
	 file_pathname = self.waymo_tfrecord_pathnames[file_idx]
     file_data = tf.data.TFRecordDataset(file_pathname, compression_type='')

在最后调用tf.data.TFRecordDataset读取.tfrecord文件的时候，错误发生。如果注释掉这一句，则错误不发生。
而这个是多线程执行的，执行的命令为：

mmcv.track_parallel_progress(self.convert_one, range(len(self)),
                                     self.workers)

一旦使用多线程，错误就发生，不使用则不会发生。如下：

for idx in range(len(self)):
    self.convert_one(idx)##too slow

但对于大数据集，也会报错，看上去大概是gpu占用过久，报错如下：

[E ProcessGroupNCCL.cpp:587] [Rank 4] Watchdog caught collective
operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran
for 1808663 milliseconds before timing out. [E
ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed
out. Due to the asynchronous nature of CUDA kernels, subsequent GPU
operations might run on corrupted/incomplete data. To avoid this
inconsistency, we are taking the entire process down. terminate called
after throwing an instance of ‘std::runtime_error’ what(): [Rank 5]
Watchdog caught collective operation timeout:
WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808705
milliseconds before timing out. [E ProcessGroupNCCL.cpp:341] Some NCCL
operations have failed or timed out. Due to the asynchronous nature of
CUDA kernels, subsequent GPU operations might run on
corrupted/incomplete data. To avoid this inconsistency, we are taking
the entire process down. what(): [Rank 3] Watchdog caught
collective operation timeout: WorkNCCL(OpType=ALLREDUCE,
Timeout(ms)=1800000) ran for 1801257 milliseconds before timing out.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process
101573 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode:
-6) local_rank: 1 (pid: 101582) of binary: /home/zhenglt/anaconda3/envs/open-mmlab/bin/python Traceback (most
recent call last): File
“/home/zhenglt/anaconda3/envs/open-mmlab/lib/python3.8/runpy.py”, line
194, in _run_module_as_main
return _run_code(code, main_globals, None, File “/home/zhenglt/anaconda3/envs/open-mmlab/lib/python3.8/runpy.py”, line
87, in _run_code
exec(code, run_globals) File “/home/zhenglt/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py”,
line 193, in main() File
“/home/zhenglt/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py”,
line 189, in main File
“/home/zhenglt/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py”,
line 174, in launch
run(args) File “/home/zhenglt/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/run.py”,
line 710, in run
elastic_launch( File “/home/zhenglt/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 131, in call
return launch_agent(self._config, self._entrypoint, list(args)) File
“/home/zhenglt/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

在mmdet3d==v1.0.0rc2 和torch==1.11.0+cu113的时候没有这个错误，但并行worker开太大就会出现cuda out of memory的错误。

不同版本的错误情况

cuda11.3 + mmdet3d=0.17.3 + torch=1.10.2 + $waymo

waymo-open-dataset-tf-2-6-0：eval和train中的eval都报错。
waymo-open-dataset-tf-2-5-0：降级到2-5-0，也就是装了tensorflow 2.5.0，此时单独dist_test.sh不报错，但是用dist_train.sh，模型在train了几个epoch之后需要eval，这个时候eval完会报错。
更低的版本也不能用，具体是：

Successfully uninstalled waymo-open-dataset-tf-2-4-0-1.4.1
pip uninstall waymo-open-dataset-tf-2-3-0
Successfully uninstalled waymo-open-dataset-tf-2-2-0-1.3.1

另外tf-2-1-0版本太低了，错误更多。

cuda11.3 + mmdet3d=0.18.1 + torch = 1.11.0

mmdet3d都编译不了，虽然适配的mmcv支持torch 1.11，但是mmdet3d 0.18.1不支持。

cuda111 + mmdet3d=0.17.3 + torch = 1.9.0 + $waymo

和torch1.10.2的情况一样。

transfusion的实验环境

没装成功，版本变动过大，有很多需要测试正确性的工作，失去更换环境耐心。放弃了。

waymo evaluate error问题解决

安装不同版本不可行，那么只能绕开tensorflow读取了，把tfrecord里需要使用的文件存成对应.pkl，然后每次读取不读.tfrecord，读.pkl即可，如下：

def convert_one_pkl_style(self, file_idx):
        """Convert action for single file.

        Args:
            file_idx (int): Index of the file to be converted.
        """
        pkl_pathname = self.waymo_tfrecord_pathnames[file_idx].replace('.tfrecord','.pkl')
        infos = mmcv.load(pkl_pathname)
        ## return still got error here
        for info in infos:
            filename = info['filename']
            T_k2w = info['T_k2w']
            context_name = info['context_name']
            frame_timestamp_micros = info['frame_timestamp_micros']

            if filename in self.name2idx:
                kitti_result = \
                    self.kitti_result_files[self.name2idx[filename]]
                objects = self.parse_objects(kitti_result, T_k2w, context_name,
                                             frame_timestamp_micros)
            else:
                print(filename, 'not found.')
                objects = metrics_pb2.Objects()

            with open(
                    join(self.waymo_results_save_dir, f'{filename}.bin'),
                    'wb') as f:
                f.write(objects.SerializeToString())

其他失败的解决办法

我尝试过直接用dataset里的data_infos来传递参数，每次子进程convert_one读取data_infos里的某个frame的信息，但不仅结果错误，而且很慢，因为无法激活多线程执行，同一时间只有一个子进程是活跃的，原因未知。

waymo evaluation 过慢

北京的机器老卡，read file很慢，有时候重启就好了。但调用waymo官方的compute_metrics的时候不知道为什么有时候也很慢，但只在detr3d的eval中出现，pointpillar的eval没事。原因未知。包括create waymo data的gt_database的时候也会卡，跑了几天都跑不完。
上海的机器很快，没有任何卡住的问题。

验证环境正确性步骤

为了能安心训练，首先得保证环境不会出错，不知道mmdet为什么特别坑…
装了一个环境之后，目前应该做的事：

evaluation check: 去官网下一个checkpoint model，在dataset上跑，比如fcos3d+nuscene：bash tools/dist_test.sh configs/fcos3d/fcos3d_r101_caffe_fpn_gn-head_dcn_2x8_1x_nus-mono3d.py ckpts/xxx.pth 4 --eval=bbox
training procedure check：下载detr3d提供的fcos3d pretrained weights，然后 bash tools/dist_train.sh projects/configs/detr3d/detr3d_res101_gridmask.py 4，将结果和github给的log做对比。
waymo eval check： bash tools/dist_test.sh configs/pointpillars/hv_pointpillars_secfpn_sbn_2x16_2x_waymoD5-3d-car.py ckpts/hv_pointpillars_secfpn_sbn_2x16_2x_waymoD5-3d-car_20200901_204315-302fc3e7.pth 4 --eval=waymo
waymo training check：用改好的detr3d测：bash tools/dist_train.sh projects/configs/detr3d/detr3d_res101_gridmask_waymo_debug.py 4

我用waymo的子集弄了一个debug用的dataset，在修改工程代码，或者新装环境之后可以先跑个子集，快速检测是否有明显的bug：

training procedure check：顺便生成eval用的模型：bash tools/dist_train.sh projects/configs/detr3d/detr3d_res101_gridmask_waymo_debug.py 4
evaluation check：bash tools/dist_test.sh projects/configs/detr3d/detr3d_res101_gridmask_way mo_debug.py work_dirs/detr3d_res101_gridmask_waymo_debug/epoch_1.pth 4 --eval=waymo

如果mmdet3d都错了就没救了，重装吧，如果只是detr3d上错了，mmdet3d是对的，就得查一下自己的bug了。

misc

evaluation的时候设sample stride并不make
sense，主要是waymo给的compute_metrics好像并不支持做这种事…

对于pointpillar进行waymo_subset的eval结果如下，可以进行fast check而无需跑大数据集：

{‘Vehicle/L1 mAP’: 0.00619617, ‘Vehicle/L1 mAPH’: 0.00615365,
‘Vehicle/L2 mAP’: 0.00528095, ‘Vehicle/L2 mAPH’: 0.00524471,
‘Pedestrian/L1 mAP’: 0.0, ‘Pedestrian/L1 mAPH’: 0.0, ‘Pedestrian/L2
mAP’: 0.0, ‘Pedestrian/L2 mAPH’: 0.0, ‘Sign/L1 mAP’: 0.0, ‘Sign/L1
mAPH’: 0.0, ‘Sign/L2 mAP’: 0.0, ‘Sign/L2 mAPH’: 0.0, ‘Cyclist/L1 mAP’:
0.0, ‘Cyclist/L1 mAPH’: 0.0, ‘Cyclist/L2 mAP’: 0.0, ‘Cyclist/L2 mAPH’: 0.0, ‘Overall/L1 mAP’: 0.00206539, ‘Overall/L1 mAPH’: 0.0020512166666666666, ‘Overall/L2 mAP’: 0.0017603166666666668, ‘Overall/L2 mAPH’: 0.0017482366666666666}

来源：ZLTJohn

open-mmlab