Reputation
Badges 1
212 × Eureka!Seems like it is routing fine
I got everything working using the default queue. I can submit an experiment, and a new GPU node is provisioned, all good
For instance, if I wanted the default queue and gpu queue that I create, how do I do that?
So I'd create the queue in the UI, then update the helm yaml as above, and install? How would I add a 3rd queue?
Could I simply just reference the files by name and pass in a string such as ~/.clearml/my_file.json
After proving we can run our training, I would then advise we update our code base
Also, how do I associate that new queue with a worker?
No I'm not tracking. I'm pretty new to k8s so this might be beyond my current knowledge. Maybe if I rephrase my goals it may make more sense. Essentially I want to enqueue an experiment, pick a queue (gpu), and have a gpu ec2 node provisioned upon that, lastly the experiment is then initialized on that new gpu ec2 and executed. When the work is completed, I want the gpu ec2 node to terminate after x amount of time.
AgitatedDove14 Will I need sudo permissions if I add this script to extra_docker_shell_scriptecho "192.241.xx.xx venus.example.com venus" >> /etc/hosts
IMO, the dataset shouldnt be tied to the clearml.conf URLs that it was uploaded with, as that URL could change. It should respect the file server URL the agent has.
I learned helm a few days ago
That is the problem, the if condition is not evaluating to True
When I deployed the webserver, I changed the value https://github.com/allegroai/clearml-helm-charts/blob/main/charts/clearml/values.yaml#L36 to be the public file server URL. Then in the UI, I copied the blob from the settings/API keys. Which had the public URLs. After that I did my data uploads which worked fine as they used public URLs. The problem is due to tight security on this k8 cluster, the k8 pod cannot reach the public file server url which is associated with the dataset.
I think the best change would to respect the value set https://github.com/allegroai/clearml-helm-charts/blob/19a6785a03b780c2d22da1e79bcd69ac9ffcd839/charts/clearml-agent/values.yaml#L50 so you could change it down the road if infra/hosting changes. Also in this case, I'm uploading the data to the public file server URL, but my k8 pod can't reach that for security reasons.
I'm not familiar with helm that well to clone this, fix it, and then test
On a somewhat related note to k8s, do you know where I can change this host name? I got this error when my task is fetching a dataset.2022-09-23 15:09:45,318 - clearml.storage - ERROR - Could not download
AgitatedDove14 SmugDolphin23 Would the following subprocess calls break the auto connect to frameworks like tensorboard?
` exe = f"sfi/imagery/models/{strategy_pipeline}/train.py"
cmd = ["/home/npuser/.clearml/venvs-builds/3.7/bin/python", exe, train_config_path]
if training_run_id:
cmd += ["--training-run", str(training_run_id)]
logging.info("Training classifier with command:\n%s", " ".join(cmd))
returncode = subprocess.Popen(cmd).wait() `Note ` /home/npuser...
It will then parse the above information from my local workstation?
Not yet AgitatedDove14 Perhaps we can pair on this Monday.
If you look lower, it is there '/home/npuser/.clearml/venvs-builds/3.7/task_repository/commons-imagery-models-py'
My next question, how do I add more queues?
How would I do similar with a new queue
You guys are the maintainers of this repo
Would be great if the docker_bash_setup_script had output I could see
Its a legacy code base. There was issues around the memory not being cleared in the GPU when subprocesses were not used. At this point I refactored out the subprocess as it just adds more complexity.