Reputation
Badges 1
981 × Eureka!because at some point it introduces too much overhead I guess
AgitatedDove14 This seems to be consistent even if I specify the absolute path to /home/user/trains.conf
Some more context: the second experiment finished and now, in the UI, in workers&queues tab, I see randomlytrains-agent-1 | - | - | - | ... (refresh page) trains-agent-1 | long-experiment | 12h | 72000 |
Hi CostlyOstrich36 , most of the time I want to compare two experiments in the DEBUG SAMPLE, so if I click on one sample to enlarge it I cannot see the others. Also once I closed the panel, the iteration number is not updated
Thanks AgitatedDove14 !
What would be the exact content of NVIDIA_VISIBLE_DEVICES if I run the following command?trains-agent daemon --gpus 0,1 --queue default &
After some investigation, I think it could come from the way you catch error when checking the creds in trains.conf: When I passed the aws creds using env vars, another error poped up: https://github.com/boto/botocore/issues/2187 , linked to boto3
They are, but this doesnāt work - I guess itās because temp IAM accesses have an extra token, that should be passed as well, but there is no such option on the web UI, right?
` resource_configurations {
A100 {
instance_type = "p3.2xlarge"
is_spot = false
availability_zone = "us-east-1b"
ami_id = "ami-04c0416d6bd8e4b1f"
ebs_device_name = "/dev/xvda"
ebs_volume_size = 100
ebs_volume_type = "gp3"
}
}
queues {
aws_a100 = [["A100", 15]]
}
extra_trains_conf = """
agent.package_manager.system_site_packages = true
agent.package_manager.pip_version = "==20.2.3"
"""
extra_vm_bash_script = """
sudo apt-get install -y libsm6 libxext6 libx...
Sorry, its actuallytask.update_requirements(["."])Ā
AgitatedDove14 So Iāll just replace task = clearml.Task.get_task(clearml.config.get_remote_task_id()) with Task.init() and wait for your fix š
Ok so it seems that the single quote is the reason, using double quotes works
edited the aws_auto_scaler.py, actually I think itās just a typo, I just need to double the brackets
Ho I see, I think we are now touching a very important point:
I thought that torch wheels already included cuda/cudnn libraries, so you don't need to care about the system cuda/cudnn version because eventually only the cuda/cudnn libraries extracted from the torch wheels were used. Is this correct? If not, then does that mean that one should use conda to install the correct cuda/cudnn cudatoolkit?
and just run the same code I run production
Sorry, I was actually able to fix it (using 1.1.3) not sure what was the problem š
Awesome, thanks WackyRabbit7 , AgitatedDove14 !
Hi SuccessfulKoala55 , How can I now if I log in in this free access mode? I assume it is since in the login page I only see login field, not password field
Thanks a lot for the solution SuccessfulKoala55 ! Iāll try that if the solution ādelete old bucket, wait for its name to be available, recreate it with the other aws account, transfer the data backā fails
CostlyOstrich36 , this also happens with clearml-agent 1.1.1 on a aws instanceā¦
I see 3 agents in the "Workers" tab
so most likely one hard requirement installs the version 2 of pyjwt while setting up the experiment
I am trying to upload an artifact during the execution
I mean, inside a parent, do not show the project [parent] if there is nothing inside