抱歉,您的浏览器无法访问本站
本页面需要浏览器支持(启用)JavaScript
了解详情 >

模型训练笔记

记录在模型训练中常用到的东西

Tensorboard

Tensorboard是一个用于监控训练过程的UI

安装

pip install tensorboard

启动

找到训练的log文件夹,找到一个形如events.out.tfevents.xxxx.xxx.xxx.x的文件,运行

tensorboard --logdir=log/xxxx --port=7861

会启动一个服务,访问这个链接就可以查看当前训练信息

使用

from torch.utils.tensorboard import SummaryWriter

train_writer = SummaryWriter(log_dir=save_tensorboard_path)
train_writer.add_scalar('valid/mse_loss', np.mean(valid_loss), train_step)
train_writer.add_scalar('train/mse_loss', np.mean(loss_running[-args.log_interval*5:]), train_step)
train_writer.add_scalar('profile/io_time', profile_times['io'], train_step)

accelerate

huggingface推出的多机多卡训练框架,类似于tensorrun

脚本启动

train.sh

# 单机器多卡(4张GPU)
export CUDA_VISIBLE_DEVICES='0,1,2,3'
export NCCL_IB_DISABLE=0
export NCCL_P2P_DISABLE=0
export NCCL_DEBUG=INFO
export NUM_PROCESSES=${MLP_WORKER_NUM}
export NPROC_PER_NODE=${MLP_WORKER_GPU}

accelerate launch \
--config_file deepspeed.config \
--multi_gpu \
train.py

deepspeed.config

distributed_type: DEEPSPEED
fsdp_config: {}
num_processes: 4
num_machines: 1
mixed_precision: 'fp16'
use_cpu: false
machine_rank: 0
main_training_function: main

评论