อธิบายการใช้ Wanvideo Wrapper by Kijai - gordon123/learn2ComfyUI GitHub Wiki

----- กำลังเขียน ------

WanVideo Block Swap

Parameter	Recommended Value	Description
`blocks_to_swap`	20 – 30 (max 40)	Number of transformer blocks offloaded to CPU
`offload_img_emb`	`false` (or `true`)	Whether to offload image embedding to CPU

blocks_to_swap = 20 is default; lower means faster but more VRAM use
Increasing improves VRAM but slows processing
Use offload_img_emb = true if image embedding is large

WanVideo Model Loader

Parameter	Example Value	Description
`model`	`Wan2_1-I2V-480P-14B_fp8_*`	Load diffusion model file (match resolution & weight type)
`base_precision`	`fp16` or `bf16`	Base dtype used (fp16 faster, bf16 safer memory-wise)
`quantization`	`fp8_e4m3fn` (optionally fp8_scaled)	Preferred quantization, may fallback to base_precision
`load_device`	`offload_device`	Where model is initially loaded—GPU or CPU
`attention_mode`	`sageattn` (or `sdpa`)	Select attention computation type; use SageAttention if installed
`compile_args`	e.g. `{"dynamo_cache_size_limit":10000}`	Torch compile tuning parameters
`block_swap_args`	Via Block Swap node	Control inference memory swap between CPU/GPU
`vram_management_args`	Flags `--lowvram`, `--highvram`	Adjust automatic memory handling behavior

💡 ข้อควรพิจารณา & แนะนำการตั้งค่าเจ้า: หาก VRAM จำกัด: ใช้ base_precision=fp16 + quantization=fp8_e4m3fn + attention_mode=sageattn

แม้ quant option จะแสดง แต่จาก GitHub issue อาจมี bug ทำให้ fallback เอง [source turn0search0]

switch ระหว่าง sdpa กับ sageattn หากไม่มี Sage ติดตั้ง ต้องใช้ SDPA แทน [source turn0search3]

ใช้ compile args เช่น {"dynamo_cache_size_limit":10000} เพื่อปรับสมรรถนะ (ถ้าใช้ torch compile)

ถ้าต้องการ LoRA สไตล์ lighten model (LightX2V) ให้ใส่ node เลือก lora แล้วต่อ input เข้า loader

WanVideo Torch Compile

Parameter	Value	Description
`backend`	`"inductor"`	ใช้ TorchInductor + Triton → balanced speed & stability
`fullgraph`	`false`	ปิด graph-merge เพราะ video workflows มักมี graph breaks
`mode`	`"default"`	สมดุล performance vs memory; ปลอดภัยกว่าโหมดเสี่ยงอื่น
`dynamic`	`true`	รองรับ input ที่เปลี่ยนบ่อย เช่น size/frame count, ไม่ให้ recompile บ่อยเกินไป
`dynamo_cache_size_limit`	`64` หรือ `128`	จำกัด cache size เพื่อไม่ให้ memory เกิน; แนะนำเริ่ม `64` แล้วปรับถ้าจำเป็น
`compile_transformer_blocks_only`	`true`	compile เฉพาะ transformer block ลดเวลาและใช้ memory น้อยลง
`dynamo_recompile_limit`	`128`	จำนวน recompile cycles ก่อน fallback เพิ่ม reliability

🧠 เวลาไหนควรปรับค่าตัวไหน ใช้งานใน WanVideoWorkflow สาย video:

วางโหนด Torch Compile Settings ก่อน Model Loader

ปรับ:

backend → "inductor"

fullgraph → false

dynamic → true ถ้าใช้ความละเอียดหรือ length เปลี่ยน

cache size limit → ตั้ง 64 หรือ 128

compile_transformer_blocks_only → true

recompile_limit → 128

ผลคือ โมเดล diffusion compile ทีละ block แล้ว cache ช่วยลด recompile ใน frame ถัดไป

🧪 torch.compile backends: inductor vs cudagraphs
🔹 Backend: "inductor" \ เป็น backend เริ่มต้นของ PyTorch 2.x (Torch Inductor)
ใช้ Triton สำหรับ compile GPU kernels on-the-fly → รันเร็ว, โอเวอร์เฮดต่ำ
ถนัดกับโค้ด dynamic, ใช้ในงาน inference หรือ train ได้ทั้ง forward/backward
PyTorch Developer Mailing List

🔹 Backend: "cudagraphs"
ใช้ CUDA Graph capture & replay สำหรับ execution แบบ static graph
ลด overhead ในการ launch kernel ซ้ำ แต่ ใช้ได้เฉพาะถ้า graph ไม่มี break
ในหลายกรณี บางโมเดลปริมาณ kernel น้อย กลับทำให้ช้าลง หรือ OOM เพราะ overhead การจัดการ placeholder parameters สูง