Reputation
Badges 1
981 × Eureka!Hi CostlyOstrich36 , most of the time I want to compare two experiments in the DEBUG SAMPLE, so if I click on one sample to enlarge it I cannot see the others. Also once I closed the panel, the iteration number is not updated
Thanks AgitatedDove14 !
What would be the exact content of NVIDIA_VISIBLE_DEVICES if I run the following command?trains-agent daemon --gpus 0,1 --queue default &
After some investigation, I think it could come from the way you catch error when checking the creds in trains.conf: When I passed the aws creds using env vars, another error poped up: https://github.com/boto/botocore/issues/2187 , linked to boto3
They are, but this doesnāt work - I guess itās because temp IAM accesses have an extra token, that should be passed as well, but there is no such option on the web UI, right?
` resource_configurations {
A100 {
instance_type = "p3.2xlarge"
is_spot = false
availability_zone = "us-east-1b"
ami_id = "ami-04c0416d6bd8e4b1f"
ebs_device_name = "/dev/xvda"
ebs_volume_size = 100
ebs_volume_type = "gp3"
}
}
queues {
aws_a100 = [["A100", 15]]
}
extra_trains_conf = """
agent.package_manager.system_site_packages = true
agent.package_manager.pip_version = "==20.2.3"
"""
extra_vm_bash_script = """
sudo apt-get install -y libsm6 libxext6 libx...
Sorry, its actuallytask.update_requirements(["."])Ā
AgitatedDove14 So Iāll just replace task = clearml.Task.get_task(clearml.config.get_remote_task_id()) with Task.init() and wait for your fix š
Ok so it seems that the single quote is the reason, using double quotes works
edited the aws_auto_scaler.py, actually I think itās just a typo, I just need to double the brackets
AppetizingMouse58 the events_plot.json template misses the plot_len declaration, could you please give me the definition of this field? (reindexing with dynamic: strict fails with: "mapping set to strict, dynamic introduction of [plot_len] within [_doc] is not allowed )
Ho I see, I think we are now touching a very important point:
I thought that torch wheels already included cuda/cudnn libraries, so you don't need to care about the system cuda/cudnn version because eventually only the cuda/cudnn libraries extracted from the torch wheels were used. Is this correct? If not, then does that mean that one should use conda to install the correct cuda/cudnn cudatoolkit?
and just run the same code I run production
Sorry, I was actually able to fix it (using 1.1.3) not sure what was the problem š
Awesome, thanks WackyRabbit7 , AgitatedDove14 !
Yes thatās what I did initially, but eventually I decided that itās too much complexity added for nothing really, Iād rather drop omegaconf and if one day clearml supports it out of the box take advantage of it
Hi SuccessfulKoala55 , How can I now if I log in in this free access mode? I assume it is since in the login page I only see login field, not password field
Thanks a lot for the solution SuccessfulKoala55 ! Iāll try that if the solution ādelete old bucket, wait for its name to be available, recreate it with the other aws account, transfer the data backā fails
CostlyOstrich36 , this also happens with clearml-agent 1.1.1 on a aws instanceā¦
I see 3 agents in the "Workers" tab
so most likely one hard requirement installs the version 2 of pyjwt while setting up the experiment
I am trying to upload an artifact during the execution
I mean, inside a parent, do not show the project [parent] if there is nothing inside
and then call task.connect_configuration probably
What is this cleanup service? where is it available?