模型训练笔记
记录在模型训练中常用到的东西
Tensorboard
Tensorboard是一个用于监控训练过程的UI
安装
启动
找到训练的log文件夹,找到一个形如events.out.tfevents.xxxx.xxx.xxx.x
的文件,运行
tensorboard --logdir=log/xxxx --port=7861
|
会启动一个服务,访问这个链接就可以查看当前训练信息
使用
from torch.utils.tensorboard import SummaryWriter
train_writer = SummaryWriter(log_dir=save_tensorboard_path) train_writer.add_scalar('valid/mse_loss', np.mean(valid_loss), train_step) train_writer.add_scalar('train/mse_loss', np.mean(loss_running[-args.log_interval*5:]), train_step) train_writer.add_scalar('profile/io_time', profile_times['io'], train_step)
|
accelerate
huggingface推出的多机多卡训练框架,类似于tensorrun
脚本启动
train.sh
export CUDA_VISIBLE_DEVICES='0,1,2,3' export NCCL_IB_DISABLE=0 export NCCL_P2P_DISABLE=0 export NCCL_DEBUG=INFO export NUM_PROCESSES=${MLP_WORKER_NUM} export NPROC_PER_NODE=${MLP_WORKER_GPU}
accelerate launch \ --config_file deepspeed.config \ --multi_gpu \ train.py
|
deepspeed.config
distributed_type: DEEPSPEED fsdp_config: {} num_processes: 4 num_machines: 1 mixed_precision: 'fp16' use_cpu: false machine_rank: 0 main_training_function: main
|