Log while running on Lambda Cloud - SoojungHong/Riding_LLaMA-and-Fine-Tuning GitHub Wiki

Machine I rent

1. first experiment

Instance I used :

1 X A10 (24GB)

30 CPU cores

205.4 GB RAM

1.5 TB SSD

$0.75/hr


Error I got

ubuntu@138-2-225-200:~$ python llama_fine_tuning_in_memory.py 2024-07-04 21:22:18.736996: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2024-07-04 21:22:18.779125: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX512F AVX512_VNNI, in other operations, rebuild TensorFlow with the appropriate compiler flags. /usr/lib/python3/dist-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.25.2 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" ================================================================================ Your GPU supports bfloat16: accelerate training with bf16=True ================================================================================ config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 635/635 [00:00<00:00, 9.58MB/s] model.safetensors.index.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 66.7k/66.7k [00:00<00:00, 343MB/s] model-00001-of-00015.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.85G/9.85G [03:41<00:00, 44.4MB/s] model-00002-of-00015.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.80G/9.80G [00:37<00:00, 258MB/s] model-00003-of-00015.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.97G/9.97G [03:14<00:00, 51.4MB/s] model-00004-of-00015.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.80G/9.80G [04:01<00:00, 40.5MB/s] model-00005-of-00015.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.80G/9.80G [00:37<00:00, 260MB/s] model-00006-of-00015.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.80G/9.80G [03:36<00:00, 45.3MB/s] model-00007-of-00015.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.97G/9.97G [03:12<00:00, 51.8MB/s] model-00008-of-00015.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.80G/9.80G [00:37<00:00, 260MB/s] model-00009-of-00015.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.80G/9.80G [00:38<00:00, 257MB/s] model-00010-of-00015.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.80G/9.80G [03:07<00:00, 52.2MB/s] model-00011-of-00015.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.97G/9.97G [04:17<00:00, 38.7MB/s] model-00012-of-00015.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.80G/9.80G [00:37<00:00, 260MB/s] model-00013-of-00015.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.80G/9.80G [03:27<00:00, 47.1MB/s] model-00014-of-00015.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.50G/9.50G [04:20<00:00, 36.4MB/s] model-00015-of-00015.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 524M/524M [00:09<00:00, 56.4MB/s] Downloading shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [36:27<00:00, 145.87s/it] Traceback (most recent call last): File "/home/ubuntu/llama_fine_tuning_in_memory.py", line 67, in <module> model = AutoModelForCausalLM.from_pretrained( File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained return model_class.from_pretrained( File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3787, in from_pretrained hf_quantizer.validate_environment(device_map=device_map) File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/quantizers/quantizer_bnb_4bit.py", line 86, in validate_environment raise ValueError( ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set load_in_8bit_fp32_cpu_offload=Trueand pass a customdevice_maptofrom_pretrained. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details. ubuntu@138-2-225-200:~$

2. The Lambda GPU machine I rent second experiment

1 X A100 (40 GB SXM4), 30 CPU cores, 205.4 GB RAM

With this machine, still I have out of memory issue

3. Third experiment

1 X H100 (80 GB PCle) - $2.49/hour

Error :

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 29.25 GiB (GPU 0; 79.11 GiB total capacity; 49.25 GiB already allocated; 8.31 GiB free; 70.21 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 0%| | 0/250 [00:02<?, ?it/s]

  1. Fourth experiment

8 X A100 (80 GB) $14/hour

Error : torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 29.25 GiB (GPU 0; 79.15 GiB total capacity; 47.66 GiB already allocated; 28.82 GiB free; 49.01 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Then fixed with following setting : python -m torch.distributed.launch --nproc_per_node=4 train.py

But I had this error : ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example device_map={'':torch.cuda.current_device()} or device_map={'':torch.xpu.current_device()} WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 12624 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 12625) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/lib/python3/dist-packages/torch/distributed/launch.py", line 196, in main() File "/usr/lib/python3/dist-packages/torch/distributed/launch.py", line 192, in main launch(args) File "/usr/lib/python3/dist-packages/torch/distributed/launch.py", line 177, in launch run(args) File "/usr/lib/python3/dist-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/usr/lib/python3/dist-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/lib/python3/dist-packages/torch/distributed/launcher/api.py", line 248, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

⚠️ **GitHub.com Fallback** ⚠️