
Reputation
Badges 1
981 × Eureka!I still don't see why you would change the type of the cloned Task, I'm assuming the original Task had the correct type, no?
Because it is easier for me that I create a training task out of the controller task by cloning it (so that parameters are prefilled and I can set the parent task id)
yea I just realized that you would also need to specify different subnets, etc… not sure how easy it is 😞 But it would be very valuable, on-demand GPU instances are so hard to spin up nowadays in aws 😄
My bad, alpine is so light it doesnt have bash
(I am not part of the awesome ClearML team, just a happy user 🙂 )
I'll try to pass these values using the env vars
Should I open an issue in github clearml-agent repo?
From my experience, I only installed cuda drivers on my machines. I didn't used conda to install torch nor cudatoolkit, I just let clearml-agent download the torch wheel file and install it
but I also make sure to write the trains.conf to the root directory in this bash script:echo " sdk.aws.s3.key = *** sdk.aws.s3.secret = *** " > ~/trains.conf ... python3 -m trains_agent --config-file "~/trains.conf" ...
After some investigation, I think it could come from the way you catch error when checking the creds in trains.conf: When I passed the aws creds using env vars, another error poped up: https://github.com/boto/botocore/issues/2187 , linked to boto3
Alright, so the steps would be:
trains-agent build --docker nvidia/cuda --id myTaskId --target base_env_services
That would create me a base docker image base_env_services
. Then how should I ensure that trains-agent uses that base image for the services queue? My guess is:
trains-agent daemon --services-mode --detached --queue services --create-queue --docker base_env_services --cpu-only
Would that work?
if I want to resume a training on multi gpu, I will need to call this function on each process to send the weights to each gpu
So the problem comes when I domy_task.output_uri = "
s3://my-bucket , trains in the background checks if it has access to this bucket and it is not able to find/ read the creds
AgitatedDove14 I see that the default sample_frequency_per_sec=2.
, but in the UI, I see that there isn’t such resolution (ie. it logs every ~120 iterations, corresponding to ~30 secs.) What is the difference with report_frequency_sec=30.
?
SuccessfulKoala55 Could you please point me to where I could quickly patch that in the code?
I am confused now because I see in the master branch, the clearml.conf file has the following section:# Or enable credentials chain to let Boto3 pick the right credentials. # This includes picking credentials from environment variables, # credential file and IAM role using metadata service. # Refer to the latest Boto3 docs use_credentials_chain: false
So it states that IAM role using metadata service should be supported, right?
Here I have to do it for each task, is there a way to do it for all tasks at once?
I specified a torch @
https://download.pytorch.org/whl/cu100/torch-1.3.1%2Bcu100-cp36-cp36m-linux_x86_64.whl and it didn't detect the link, it tried to install latest version: 1.6.0
I should also rename /opt/trains/data/elastic_migrated_2020-08-11_15-27-05
folder to /opt/trains/data/elastic
before running the migration tool right?
You are right, thanks! I was trying to move /opt/trains/data to an external disk, mounted at /data
See my answer in the issue - I am not using docker
Sure! Here are the relevant parts:
` ...
Current configuration (clearml_agent v1.2.3, location: /tmp/.clearml_agent.3m6hdm1_.cfg):
...
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version = ==20.2.3
agent.package_manager.system_site_packages = false
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = pytorch
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 ...
Yes, I would like to update all references to the old bucket unfortunately… I think I’ll simply delete the old s3 bucket, wait or his name to be available again and recreate it where on the other aws account and move the data there. This way I don’t have to mess with clearml data - I am afraid to do something wrong and loose data
even if I explicitely use previous_task.output_uri = "
s3://my_bucket "
, it is ignored and still saves the json file locally
How about the overhead of running the training on docker on a VM?
mmmh probably yes, I can’t say for sure (because I don’t remember precisely when I upgraded to 0.17) but it looks like that
TimelyPenguin76 , no, I’ve only set the sdk.aws.s3.region = eu-central-1
param