Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hello everyone, *Context:* I am currently facing a headache-inducing issue regarding the integration of flash attention V2 for LLM training. I am running a python script locally, that then runs remotely. Without the integration of flash attention, the co

Hello everyone,

Context:
I am currently facing a headache-inducing issue regarding the integration of flash attention V2 for LLM training.
I am running a python script locally, that then runs remotely. Without the integration of flash attention, the code runs well and allows fetching data, training models, etc.
For the flash attention integration, I followed carefully the github repo installation steps (and I am quite convinced it is OK). The remote instance on which the code runs is an AWS EC2 instance. The built venv is created via pip here /root/.clearml/venvs-builds/3.9 .

Issue:
At some point during the task run, it fails with that mistake:

File "/root/.clearml/venvs-builds/3.9/task_repository/....git/...", line 252, in fit
model = AutoModelForCausalLM.from_pretrained(
File "/root/.clearml/venvs-builds/3.9/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
return model_class.from_pretrained(
File "/root/.clearml/venvs-builds/3.9/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3233, in from_pretrained
config = cls._check_and_enable_flash_attn_2(config, torch_dtype=torch_dtype, device_map=device_map)
File "/root/.clearml/venvs-builds/3.9/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1273, in _check_and_enable_flash_attn_2
raise ImportError(
ImportError: Flash Attention 2 is not available. Please refer to the documentation of None for installing it. Make sure to have at least the version 2.1.0
2023-11-08 21:48:05
Process failed, exit code 1

However, the installation of flash_attn package worked: Successfully installed MarkupSafe-2.1.3 einops-0.7.0 filelock-3.13.1 flash-attn-2.3.3 fsspec-2023.10.0 jinja2-3.1.2 mpmath-1.3.0 networkx-3.2.1 ninja-1.11.1.1 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-8.9.2.26 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.18.1 nvidia-nvjitlink-cu12-12.3.52 nvidia-nvtx-cu12-12.1.105 packaging-23.2 sympy-1.12 torch-2.1.0 triton-2.1.0 typing-extensions-4.8.0

The package has been installed AFTER the Task initialization (before the real script to be run) using this small snippet of code:
import subprocess

install_command = (
'/root/.clearml/venvs-builds/3.9/bin/python -m pip install --upgrade pip && /root/.clearml/venvs-builds/3.9/bin/python -m pip install flash-attn --no-build-isolation'
)
subprocess.run(install_command, shell=True)

From that point, I was very confused. I then decided to run another EC2 instance, go to the same level (loading an LLM with flash attention V2). I connected to the running docker container using the dev container VsCode extension. When running the same piece of code, by providing the venv to the command, it worked.

Conclusion:
I am thus extremely confused knowing that the task fails for a specific part of my training script, while running the same portion of the script in the docker container itself works..,. Does someone has any idea?

My first guess was that the package was installed into an incorrect location (other venv, etc). However, when uninstalling the package, the code running on the dev container failed too, meaning that the installation imo was correctly done.

I know the use case is very personal, but any help would be very appreciated 🙂
Thank you,

  
  
Posted 8 months ago
Votes Newest

Answers 3


Hi @<1523701087100473344:profile|SuccessfulKoala55> , the EC2 instance is spinned-up from the AWS autoscaler provided by ClearML. I use this following docker image: nvidia/cuda:11.8.0-devel-ubuntu20.0

So the EC2 instance runs a docker container

  
  
Posted 8 months ago

Hi @<1556812486840160256:profile|SuccessfulRaven86> , how exactly are you running the code remotely? Is this a daemon agent running on that EC2 instance?

  
  
Posted 8 months ago

It is due to the caching mechanism of Clearml. Is there a python command to update the venvs-cache?

  
  
Posted 8 months ago
454 Views
3 Answers
8 months ago
8 months ago
Tags