
Reputation
Badges 1
212 × Eureka!Traceback (most recent call last): File "sfi/imagery/models/training/ldc_train_end_to_end.py", line 26, in <module> from sfi.imagery.models.chip_classifier.eval import eval_chip_classifier ModuleNotFoundError: No module named 'sfi.imagery.models'
How do I setup the clearml k8s glue?
Also what is the base path where the git repo is cloned? So if my repo is called myProject.git, what would the full path be?
No I'm not tracking. I'm pretty new to k8s so this might be beyond my current knowledge. Maybe if I rephrase my goals it may make more sense. Essentially I want to enqueue an experiment, pick a queue (gpu), and have a gpu ec2 node provisioned upon that, lastly the experiment is then initialized on that new gpu ec2 and executed. When the work is completed, I want the gpu ec2 node to terminate after x amount of time.
In other words, I'd like to create 3 queues via helm install. Each queue has its own podTemplate
Is this possible?
When I run from sfi.imagery import models. It works fine local. So the repo is setup for proper imports. But fails in clearML tasks
basically, can I do local installs vs supplying a requirements.txt
Are you able to do screenshare to discuss this? I'm not sure I understand the k8 glue agent purpose.
Also how do I provide the k8 glue agent permissions to spin up/down ec2 nodes?
Is this a config file on your side or something I can change, if we had enterprise version?
hmm how would I add that to PYTHONPATH? Can that be done in the SETUP SHELL SCRIPT
window?
So if my main script is called main.py
and in main.py
I call a script called train.py
via a subprocess.Popen()
yes, I see in the UI how to create a new queue. How do I associate that queue with a nodeSelector though?
yea, does the enterprise version have more functionality like this?
Would I copy and paste this block to produce another queue and k8 glue agent?
The agents are docker containers, how do I modify the startup script so it creates a queue? It seems like having additional queues beyond default
is not handled by helm installs?
I wouldn't be able to pass in ~/.clearml/cache/storage_manager/datasets/ds_{ds_id}/my_file.json
as an argument?
Made some progress getting the gpu nodes to provision, but got this error on my task K8S glue status: Unschedulable (0/4 nodes are available: 1 node(s) had taint {
http://nvidia.com/gpu : true}, that the pod didn't tolerate, 3 node(s) didn't match Pod's node affinity/selector.)
I assumed I would need to upload it and then reference it somehow?
If I do both everything works, except then I lose clearML tracking (scalars, outputs, etc)
AgitatedDove14 Will I need sudo permissions if I add this script to extra_docker_shell_script
echo "192.241.xx.xx venus.example.com venus" >> /etc/hosts
IMO, the dataset shouldnt be tied to the clearml.conf URLs that it was uploaded with, as that URL could change. It should respect the file server URL the agent has.
How would I do os.fork? I'm not familiar with that