Reputation
Badges 1
212 × Eureka!thank you for the help!
so it caches to ~/.clearml/ any files that are under the same project name?
and removed the duplicate Task.init()
Thanks for looking into this!
How would I do os.fork? I'm not familiar with that
hmm that does look really helpful! Let me see if I can fix this using that info! thankyou!
AWS, I've setup the shared memory between k8 nodes
Yep I updated those as well
ok yes, this is the problem
In other words, I'd like to create 3 queues via helm install. Each queue has its own podTemplate Is this possible?
Figured this out, the value is parsed from my local clearml.conf file
I think this is VPN related now
Seems like it is routing fine
I got everything working using the default queue. I can submit an experiment, and a new GPU node is provisioned, all good
For instance, if I wanted the default queue and gpu queue that I create, how do I do that?
So I'd create the queue in the UI, then update the helm yaml as above, and install? How would I add a 3rd queue?
Could I simply just reference the files by name and pass in a string such as ~/.clearml/my_file.json
After proving we can run our training, I would then advise we update our code base
Also, how do I associate that new queue with a worker?
No I'm not tracking. I'm pretty new to k8s so this might be beyond my current knowledge. Maybe if I rephrase my goals it may make more sense. Essentially I want to enqueue an experiment, pick a queue (gpu), and have a gpu ec2 node provisioned upon that, lastly the experiment is then initialized on that new gpu ec2 and executed. When the work is completed, I want the gpu ec2 node to terminate after x amount of time.
AgitatedDove14 Will I need sudo permissions if I add this script to extra_docker_shell_scriptecho "192.241.xx.xx venus.example.com venus" >> /etc/hosts
IMO, the dataset shouldnt be tied to the clearml.conf URLs that it was uploaded with, as that URL could change. It should respect the file server URL the agent has.
Also how do I provide the k8 glue agent permissions to spin up/down ec2 nodes?
I learned helm a few days ago