Yes AgitatedDove14 š
Interesting - I can reproduce easily
mmmh good point actually, I didnāt think about it
because at some point it introduces too much overhead I guess
AgitatedDove14 This seems to be consistent even if I specify the absolute path to /home/user/trains.conf
Thanks AgitatedDove14 !
What would be the exact content of NVIDIA_VISIBLE_DEVICES if I run the following command?trains-agent daemon --gpus 0,1 --queue default &
After some investigation, I think it could come from the way you catch error when checking the creds in trains.conf: When I passed the aws creds using env vars, another error poped up: https://github.com/boto/botocore/issues/2187 , linked to boto3
They are, but this doesnāt work - I guess itās because temp IAM accesses have an extra token, that should be passed as well, but there is no such option on the web UI, right?
` resource_configurations {
A100 {
instance_type = "p3.2xlarge"
is_spot = false
availability_zone = "us-east-1b"
ami_id = "ami-04c0416d6bd8e4b1f"
ebs_device_name = "/dev/xvda"
ebs_volume_size = 100
ebs_volume_type = "gp3"
}
}
queues {
aws_a100 = [["A100", 15]]
}
extra_trains_conf = """
agent.package_manager.system_site_packages = true
agent.package_manager.pip_version = "==20.2.3"
"""
extra_vm_bash_script = """
sudo apt-get install -y libsm6 libxext6 libx...
Sorry, its actuallytask.update_requirements(["."])Ā
Ok so it seems that the single quote is the reason, using double quotes works
edited the aws_auto_scaler.py, actually I think itās just a typo, I just need to double the brackets
Ho I see, I think we are now touching a very important point:
I thought that torch wheels already included cuda/cudnn libraries, so you don't need to care about the system cuda/cudnn version because eventually only the cuda/cudnn libraries extracted from the torch wheels were used. Is this correct? If not, then does that mean that one should use conda to install the correct cuda/cudnn cudatoolkit?
and just run the same code I run production
Sorry, I was actually able to fix it (using 1.1.3) not sure what was the problem š
Awesome, thanks WackyRabbit7 , AgitatedDove14 !
Hi SuccessfulKoala55 , How can I now if I log in in this free access mode? I assume it is since in the login page I only see login field, not password field
Thanks a lot for the solution SuccessfulKoala55 ! Iāll try that if the solution ādelete old bucket, wait for its name to be available, recreate it with the other aws account, transfer the data backā fails
CostlyOstrich36 , this also happens with clearml-agent 1.1.1 on a aws instanceā¦
I see 3 agents in the "Workers" tab
so most likely one hard requirement installs the version 2 of pyjwt while setting up the experiment
I am trying to upload an artifact during the execution