Log while running on Lambda Cloud - SoojungHong/Riding_LLaMA-and-Fine-Tuning GitHub Wiki
Instance I used :
1 X A10 (24GB)
30 CPU cores
205.4 GB RAM
1.5 TB SSD
$0.75/hr
ubuntu@138-2-225-200:~$ python llama_fine_tuning_in_memory.py
2024-07-04 21:22:18.736996: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-07-04 21:22:18.779125: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX512F AVX512_VNNI, in other operations, rebuild TensorFlow with the appropriate compiler flags.
/usr/lib/python3/dist-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.25.2
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
================================================================================
Your GPU supports bfloat16: accelerate training with bf16=True
================================================================================
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 635/635 [00:00<00:00, 9.58MB/s]
model.safetensors.index.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 66.7k/66.7k [00:00<00:00, 343MB/s]
model-00001-of-00015.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.85G/9.85G [03:41<00:00, 44.4MB/s]
model-00002-of-00015.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.80G/9.80G [00:37<00:00, 258MB/s]
model-00003-of-00015.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.97G/9.97G [03:14<00:00, 51.4MB/s]
model-00004-of-00015.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.80G/9.80G [04:01<00:00, 40.5MB/s]
model-00005-of-00015.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.80G/9.80G [00:37<00:00, 260MB/s]
model-00006-of-00015.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.80G/9.80G [03:36<00:00, 45.3MB/s]
model-00007-of-00015.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.97G/9.97G [03:12<00:00, 51.8MB/s]
model-00008-of-00015.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.80G/9.80G [00:37<00:00, 260MB/s]
model-00009-of-00015.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.80G/9.80G [00:38<00:00, 257MB/s]
model-00010-of-00015.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.80G/9.80G [03:07<00:00, 52.2MB/s]
model-00011-of-00015.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.97G/9.97G [04:17<00:00, 38.7MB/s]
model-00012-of-00015.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.80G/9.80G [00:37<00:00, 260MB/s]
model-00013-of-00015.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.80G/9.80G [03:27<00:00, 47.1MB/s]
model-00014-of-00015.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.50G/9.50G [04:20<00:00, 36.4MB/s]
model-00015-of-00015.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 524M/524M [00:09<00:00, 56.4MB/s]
Downloading shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [36:27<00:00, 145.87s/it]
Traceback (most recent call last):
File "/home/ubuntu/llama_fine_tuning_in_memory.py", line 67, in <module>
model = AutoModelForCausalLM.from_pretrained(
File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
return model_class.from_pretrained(
File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3787, in from_pretrained
hf_quantizer.validate_environment(device_map=device_map)
File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/quantizers/quantizer_bnb_4bit.py", line 86, in validate_environment
raise ValueError(
ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set load_in_8bit_fp32_cpu_offload=Trueand pass a customdevice_maptofrom_pretrained. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details.
ubuntu@138-2-225-200:~$
1 X A100 (40 GB SXM4), 30 CPU cores, 205.4 GB RAM
With this machine, still I have out of memory issue
1 X H100 (80 GB PCle) - $2.49/hour
Error :
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 29.25 GiB (GPU 0; 79.11 GiB total capacity; 49.25 GiB already allocated; 8.31 GiB free; 70.21 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 0%| | 0/250 [00:02<?, ?it/s]
- Fourth experiment
8 X A100 (80 GB) $14/hour
Error : torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 29.25 GiB (GPU 0; 79.15 GiB total capacity; 47.66 GiB already allocated; 28.82 GiB free; 49.01 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Then fixed with following setting : python -m torch.distributed.launch --nproc_per_node=4 train.py
But I had this error :
ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example device_map={'':torch.cuda.current_device()} or device_map={'':torch.xpu.current_device()}
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 12624 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 12625) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/lib/python3/dist-packages/torch/distributed/launch.py", line 196, in
main()
File "/usr/lib/python3/dist-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/usr/lib/python3/dist-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/usr/lib/python3/dist-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/lib/python3/dist-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/lib/python3/dist-packages/torch/distributed/launcher/api.py", line 248, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: