
Reputation
Badges 1
137 × Eureka!If now I abort the experiment (which is in a pending state and not running), and re-enqueue it again -- no parameters modifications this time...
and I re-enqueue it to the CPU queue, I see that it is sent to the right queue, and after a few seconds the job enters a running state and it completes correctly
I can ping it without issues, but I am not sure if the communications are set correctly
Super!!! many thanks CostlyFox64 !
just to understand well the problems you helped me fix:
for elastic search it looked like I wasn't running the cluster with enough memory
but what happened to the FileServer? and how can I prevent it happening in a potential "production" deployment?
Oh I see... for some reason I thought that all the dependencies of the environment would be tracked by ClearML, but it's only the ones that actually get imported...
If locally one detects that pandas is installed and can be used to read the csv, wouldn't it be possible to store this information in the clearml server so that it can be implicitly added to the requirements?
unfortunately I can't get info from the cluster
I see... because the problem it would be with permissions when creating artifacts to store in the "/shared" folder
Hi AgitatedDove14 I have spent some time going through the helm charts but I admit I still haven't clear how things should work.
I see that with the default values (mostly what I am using), the K8s Glue agent is deployed (which is what you suggested to use).
OK, I'll report that, and see if we can get it fixed
that's what I wanted to ask, while the proper networking is setup (I don't manage the cluster),
can I do tests using the .kube/config?
I guess to achieve what I want, I could disable the agent using the helm chart values.yaml
and then define pods for each of the agent on their respective nodes
Hi Martin, thanks for the explanation! I work with Maggie and help with the ClearML setup.
Just to be sure, currently, the PodTemplate contains:
resources: limits: nvidia.com/gpu: 1
you are suggesting to add also, something like:requests: memory: "100Mi" limits: memory: "200Mi"
is that correct?
On a related note, I am a bit puzzled by the fact that all the 4 GPUs are visible.
In the https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/ , i...
well there are already processes in place.. we aim at migrating everything to ClearML, but we hoped we could do it gradually
Thanks, I'll try to understand how the default agent coming with the helm chart is configured and try to copy how to setup a different one from there then
PunyWoodpecker71 just create a Personal Access Token and use it as the value for CLEARML_AGENT_GIT_PASS, https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token
My local clearml.conf is:# ClearML SDK configuration file api { # Notice: 'host' is the api server (default port 8008), not the web server. api_server: host web_server: host files_server: host # Credentials are generated using the webapp,
`
# Override with os environment: CLEARML_API_ACCESS_KEY / CLEARML_API_SECRET_KEY
credentials {"access_key": "access_key", "secret_key": "secret_key"}
}
sdk {
# ClearML - default SDK configuration
storage {
...
when I run it on my laptop...
what I am trying to achieve is not having to worry about this setting, and have all the artifacts and models uploaded to the file server automatically
I am not aware of how clearml-dataset works, but I'll have a look 🙂
but I can confirm that adding the requirement with Task.add_requirements()
does the trick
but I was a bit set off track seeing errors in the logs
I see in bitnami's gh-pages branch a file https://github.com/bitnami-labs/sealed-secrets/blob/gh-pages/index.html to do the redirect that contains:
` <html>
<head> <meta http-equiv="refresh" content="0; url= ` ` "> </head> <p><a href=" ` ` ">Redirect to repo index.yaml</a></p> </html> ` A similar file is missing in the ` clearml-helm-chart ` ` gh-pages ` branch.
great! thanks a lot!
so I assume clearml moves them from one queue to the other?