Diffusion
扩散原理
生成模型的目标是:给定一组数据,构建一个分布,生成新的数据
一种思想是,从一个简单的分布开始(如果高斯分布),将其转化到目标分布
扩散模型正是这样的框架,将一个复杂抽样,转化为一系列简单抽样。其核心就是学习反转很多中间步骤会更简单
高斯扩散
Gaussian Diffusion
对于一个满足目标分布(尽管这个分布当下还是未知的)的随机变量$x_0$,我们为他添加一系列独立的高斯噪声。这个过程被称为前向扩散(forward process)
$$
x_{t+1}=x_t + \eta_t ,\quad \eta_t \sim N(0, \sigma^2)
$$
经过观察可知,在极高的步数下,$x_t$的边缘分布$p_t$极其接近高斯分布,我们将其近似为高斯分布,可以直接采样
我们将任务分解为一个个:给定边缘分布 $p_t$,生成分布 $p_{t-1}$。这被称为反向采样器(reverse sampler),如果我们有了反向采样器,我们就能从一个高斯噪声不断扩散出原始分布 $p_0$
DDPM
一个常用的构建反向采样器的方法是DDPM:在步数 $t$,输入一个满足 $p_t$的值 $z$,输出一个值,满足条件分布
$$
p(x_{t-1}|x_t=z)
$$
为每一个$x_t$都学习一个条件分布,这过于复杂了
我们假定,当每一步的噪声 $\sigma$非常小时,每一步的条件分布都满足高斯分布,即
$$
p(x_{t-1}|x_t=z) \approx N(x_{t-1};\mu, \sigma^2)
$$
将条件分布转化为高斯分布,我们相当于得到了分布的形状。只需要再获得分布的均值,就能得到整个分布
而我们可以使用神经网络和回归求这个均值
Flow Matching
SD3使用了Flow Matching替代DDPM
扩散模型是Flow Matching的一种特例
Flow Matching通过匹配模型向量场(Vector Field)和目标向量场来训练模型,训练后的向量场可以实现从简单分布转变为复杂目标分布
Flow(流):一系列时间索引的向量场
SDXL
结构
SD1.5和SDXL都是UNet base model
SDXL是一个二阶段级联扩散模型,包括Base模型和Refiner模型。prompt经过Base模型得到一个图像Latents,Refiner对Latents进行降噪和细节提升,最后再用VAE Decoder解码为图像
Base模型是一个画图模型,可以实现T2I、I2I、Inpaint
UNet,在Encoder-Decoder结构的基础上,添加了Time Embedding、Cross Attention、Self-Attention
VAE
Encoder,图像2Latents
Decoder,Latents2图像
两个CLIP Text Encoder
画图
T2I
from diffusers import DiffusionPipelinepipe_id = "stabilityai/stable-diffusion-xl-base-1.0" pipe = DiffusionPipeline.from_pretrained(pipe_id, torch_dtype=torch.float16).to("cuda" ) prompt = "a blue hair gril" image = pipe(prompt, num_inference_steps=45 , guidance_scale=7.5 , height=1024 , width=1024 ).images[0 ] image.save("output.jpg" )
T2I LoRA
LoRA可以改变模型画风
from diffusers import DiffusionPipelinepipe_id = "stabilityai/stable-diffusion-xl-base-1.0" pipe = DiffusionPipeline.from_pretrained(pipe_id, torch_dtype=torch.float16).to("cuda" ) pipe.load_lora_weights("sd-gbf-lora" ) prompt = "a blue hair gril" lora_scale = 0.9 image = pipe(prompt, num_inference_steps=45 , guidance_scale=7.5 , cross_attention_kwargs={"scale" : lora_scale}, height=1024 , width=1024 ).images[0 ] image.save("output.jpg" )
I2I LoRA
将图片转为LoRA画风
import torchfrom PIL import Imagefrom diffusers import StableDiffusionXLImg2ImgPipelinepipe_id = "stabilityai/stable-diffusion-xl-base-1.0" pipe = StableDiffusionXLImg2ImgPipeline.from_pretrained(pipe_id, torch_dtype=torch.float16).to("cuda" ) pipe.load_lora_weights("sd-gbf-lora" ) input_image_path = "examples/lubi.jpg" input_image = Image.open (input_image_path).convert("RGB" ) prompt = "gbfhero" negative_prompt = "low quality, bad quality" with torch.no_grad(): output_image = pipe( prompt=prompt, negative_prompt=negative_prompt, guidance_scale=7.5 , cross_attention_kwargs={"scale" : 0.9 }, height=1024 , width=1024 , image=input_image, strength=0.5 ).images[0 ] output_image.save(f"outputs/1.jpg" )
I2I LoRA Controlnet
直接使用I2I LoRA效果并不好,对原图的控制能力比较弱,可以配合使用Controlnet使用
import osimport cv2import torchimport numpy as npfrom PIL import Imagefrom diffusers import StableDiffusionXLControlNetImg2ImgPipeline, ControlNetModeloutput_folder = "outputs" os.makedirs(output_folder, exist_ok=True ) pipe_id = "stabilityai/stable-diffusion-xl-base-1.0" controlnet_id = "diffusers/controlnet-canny-sdxl-1.0" controlnet = ControlNetModel.from_pretrained(controlnet_id, torch_dtype=torch.float16) pipe = StableDiffusionXLControlNetImg2ImgPipeline.from_pretrained(pipe_id, controlnet=controlnet, torch_dtype=torch.float16).to("cuda" ) pipe.load_lora_weights("sd-gbf-lora3" ) input_image_path = "examples/leishen.jpeg" input_image = Image.open (input_image_path).convert("RGB" ) np_image = np.array(input_image) np_image = cv2.Canny(np_image, 100 , 200 ) np_image = np_image[:, :, None ] np_image = np.concatenate([np_image, np_image, np_image], axis=2 ) canny_image = Image.fromarray(np_image) canny_image.save(f'{output_folder} /tmp_edge.png' ) prompt = "gbfhero, clean background" negative_prompt = "low quality, bad quality" lora_scale = 0.9 image = pipe(prompt, negative_prompt=negative_prompt, guidance_scale=7.5 , cross_attention_kwargs={"scale" : lora_scale}, controlnet_conditioning_scale=0.5 , image=input_image, strength=0.9 , control_image=canny_image, height=1024 , width=1024 ).images[0 ] image.save(f"{output_folder} /5.jpg" )
SDXL LoRA训练
参考train_text_to_image_lora_sdxl.py
需要几百到几千张图片训练上千步,才能得到一个较好的LoRA
Tensorboard
Tensorboard是一个用于监控训练过程的UI
找到训练的log文件夹,找到一个形如events.out.tfevents.xxxx.xxx.xxx.x
的文件,运行
tensorboard --logdir=log/xxxx --port=7861
会启动一个服务,访问这个链接就可以查看当前训练信息
from torch.utils.tensorboard import SummaryWritertrain_writer = SummaryWriter(log_dir=save_tensorboard_path) train_writer.add_scalar('valid/mse_loss' , np.mean(valid_loss), train_step) train_writer.add_scalar('train/mse_loss' , np.mean(loss_running[-args.log_interval*5 :]), train_step) train_writer.add_scalar('profile/io_time' , profile_times['io' ], train_step)
accelerate
huggingface推出的多机多卡训练框架,类似于tensorrun
train.sh
export CUDA_VISIBLE_DEVICES='0,1,2,3' export NCCL_IB_DISABLE=0export NCCL_P2P_DISABLE=0export NCCL_DEBUG=INFOexport NUM_PROCESSES=${MLP_WORKER_NUM} export NPROC_PER_NODE=${MLP_WORKER_GPU} accelerate launch \ --config_file deepspeed.config \ --multi_gpu \ train.py
deepspeed.config
distributed_type: DEEPSPEED fsdp_config: {} num_processes: 4 num_machines: 1 mixed_precision: 'fp16' use_cpu: false machine_rank: 0 main_training_function: main
DreamBooth LoRA训练
更推荐用这个,DreamBooth使用了Rare-token Identifiers,会将instance prompt映射到更稀有的区域,比如在英语中插入一些字符,“A dog”变成“A[V] dog“,这样的prompt在tokenizer的位置会更稀有,不容易受到原本prompt训练的影响,更容易学到东西
参考train_dreambooth_lora_sdxl.py ,使用3~5张图和一个相同的prompt,就能获得一个较好的效果
export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0" export INSTANCE_DIR="dog" export OUTPUT_DIR="lora-trained-xl" export VAE_PATH="madebyollin/sdxl-vae-fp16-fix" accelerate launch \ --mixed_precision="fp16" \ train.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --instance_data_dir=$INSTANCE_DIR \ --pretrained_vae_model_name_or_path=$VAE_PATH \ --output_dir=$OUTPUT_DIR \ --mixed_precision="fp16" \ --instance_prompt="a photo of sks dog" \ --resolution=1024 \ --train_batch_size=1 \ --gradient_accumulation_steps=4 \ --learning_rate=1e-4 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --max_train_steps=500 \ --validation_prompt="A photo of sks dog in a bucket" \ --validation_epochs=25 \ --seed="0" \
INSTANCE_DIR
:存放图片的文件夹路径,文件夹内只放图片即可
instance_prompt
:对这些图片的主体描述
修复
train_dreambooth_lora_sdxl.py 在fp16训练时有几处问题,比如修改log_validation
函数中to.(device)
的代码,当fp16时,不设置类型
if args.mixed_precision == 'fp16' : pipeline = pipeline.to(accelerator.device) else : pipeline = pipeline.to(accelerator.device, dtype=torch_dtype)
还有Dataset中加载instance data文件夹时,加载了所有类型的文件,应该只加载图片类型
filenames = sorted (os.listdir(instance_data_root)) filenames = list (filter (lambda file: file.endswith(('.jpeg' , '.png' , 'jpg' )), filenames)) filenames = [os.path.join(instance_data_root, name) for name in filenames] instance_images = [Image.open (path) for path in filenames]
删除
check_min_version("0.33.0.dev0" )
Flux
Flux是目前最好的画图模型,不过生态还不太完备
Flux、SD3是DiT结构
画图
import osimport torchfrom diffusers import FluxPipelinepipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev" , torch_dtype=torch.bfloat16).to("cuda" ) pipe.load_lora_weights("trained-flux-lora-gbf" ) prompt = "gbfhero, a blue hair gril with a sword, gorgeous background, swimwear" image = pipe( prompt, height=1024 , width=1024 , guidance_scale=3.5 , num_inference_steps=50 , max_sequence_length=512 ).images[0 ] os.makedirs("outputs" , exist_ok=True ) image.save("outputs/1.png" )
ControlNet LoRA
from diffusers import FluxControlNetImg2ImgPipeline, FluxControlNetModelcontrolnet = FluxControlNetModel.from_pretrained("InstantX/FLUX.1-dev-Controlnet-Canny" , torch_dtype=torch.bfloat16) pipe = FluxControlNetImg2ImgPipeline.from_pretrained("black-forest-labs/FLUX.1-dev" , controlnet=controlnet, torch_dtype=torch.bfloat16).to("cuda" ) pipe.load_lora_weights("trained-flux-lora-gbf" ) ... prompt = "gbf hero" image = pipe( prompt, guidance_scale=3.5 , image=input_image, strength=0.99 , control_image=canny_image, control_guidance_start=0.2 , control_guidance_end=0.8 , controlnet_conditioning_scale=1.0 , height=1024 , width=1024 ).images[0 ] image.save(f"{output_folder} /c1.png" )
DreamBooth LoRA训练
Flux模型非常巨大,训练很容易超显存、内存
参考train_dreambooth_lora_flux.py
export MODEL_NAME="black-forest-labs/FLUX.1-dev" export INSTANCE_DIR="gbf" export OUTPUT_DIR="trained-flux-lora" accelerate launch \ --mixed_precision="bf16" \ train.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --instance_data_dir=$INSTANCE_DIR \ --output_dir=$OUTPUT_DIR \ --mixed_precision="bf16" \ --instance_prompt="gbf hero" \ --resolution=1024 \ --train_batch_size=1 \ --guidance_scale=1 \ --gradient_accumulation_steps=4 \ --optimizer="prodigy" \ --learning_rate=1. \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --max_train_steps=500 \ --validation_prompt="A photo of gbf hero" \ --validation_epochs=25 \ --seed="0" \
训练代码也有问题,使用bf16进行训练时,需要替换log_validation
这行代码
autocast_ctx = torch.autocast(accelerator.device.type )