Reputation
Badges 1
981 × Eureka!Interestingly, I do see the 100gb volume in the aws console:
I'll try to pass these values using the env vars
The rest of the configuration is set with env variables
Alright, how can I then mount a volume of the disk?
Yes AgitatedDove14 ๐
Interesting - I can reproduce easily
AgitatedDove14 So what you are saying is that since I have trains-server 0.16.1, I should use trains>=0.16.1? And what about trains-agent? Only version 0.16 is released atm, this is the one I use
mmmh good point actually, I didnโt think about it
because at some point it introduces too much overhead I guess
AgitatedDove14 This seems to be consistent even if I specify the absolute path to /home/user/trains.conf
Some more context: the second experiment finished and now, in the UI, in workers&queues tab, I see randomlytrains-agent-1 | - | - | - | ... (refresh page) trains-agent-1 | long-experiment | 12h | 72000 |
Hi CostlyOstrich36 , most of the time I want to compare two experiments in the DEBUG SAMPLE, so if I click on one sample to enlarge it I cannot see the others. Also once I closed the panel, the iteration number is not updated
Thanks AgitatedDove14 !
What would be the exact content of NVIDIA_VISIBLE_DEVICES if I run the following command?trains-agent daemon --gpus 0,1 --queue default &
After some investigation, I think it could come from the way you catch error when checking the creds in trains.conf: When I passed the aws creds using env vars, another error poped up: https://github.com/boto/botocore/issues/2187 , linked to boto3
They are, but this doesnโt work - I guess itโs because temp IAM accesses have an extra token, that should be passed as well, but there is no such option on the web UI, right?
` resource_configurations {
A100 {
instance_type = "p3.2xlarge"
is_spot = false
availability_zone = "us-east-1b"
ami_id = "ami-04c0416d6bd8e4b1f"
ebs_device_name = "/dev/xvda"
ebs_volume_size = 100
ebs_volume_type = "gp3"
}
}
queues {
aws_a100 = [["A100", 15]]
}
extra_trains_conf = """
agent.package_manager.system_site_packages = true
agent.package_manager.pip_version = "==20.2.3"
"""
extra_vm_bash_script = """
sudo apt-get install -y libsm6 libxext6 libx...
Sorry, its actuallytask.update_requirements(["."])ย
AgitatedDove14 So Iโll just replace task = clearml.Task.get_task(clearml.config.get_remote_task_id()) with Task.init() and wait for your fix ๐
Ok so it seems that the single quote is the reason, using double quotes works
edited the aws_auto_scaler.py, actually I think itโs just a typo, I just need to double the brackets
AppetizingMouse58 the events_plot.json template misses the plot_len declaration, could you please give me the definition of this field? (reindexing with dynamic: strict fails with: "mapping set to strict, dynamic introduction of [plot_len] within [_doc] is not allowed )
Ho I see, I think we are now touching a very important point:
I thought that torch wheels already included cuda/cudnn libraries, so you don't need to care about the system cuda/cudnn version because eventually only the cuda/cudnn libraries extracted from the torch wheels were used. Is this correct? If not, then does that mean that one should use conda to install the correct cuda/cudnn cudatoolkit?
and just run the same code I run production
Sorry, I was actually able to fix it (using 1.1.3) not sure what was the problem ๐