Reputation
Badges 1
137 × Eureka!What I still don't get, is how you would create different queues, targeting different nodes with different GPUs, and having them using the appropriate Cuda image.
Looking at the template, I don't understand how that's possible.
Thanks, I'll try to understand how the default agent coming with the helm chart is configured and try to copy how to setup a different one from there then
yes, the curl returned a 503 error
PunyWoodpecker71 just create a Personal Access Token and use it as the value for CLEARML_AGENT_GIT_PASS, https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token
Hi AgitatedDove14 I have spent some time going through the helm charts but I admit I still haven't clear how things should work.
I see that with the default values (mostly what I am using), the K8s Glue agent is deployed (which is what you suggested to use).
Not really 🙂
They files are clearly different, but if I understand correctly is it enough to add
` storage {
cache {
# Defaults to system temp folder / cache
default_base_dir: "~/.clearml/cache"
# default_cache_manager_size: 100
}
direct_access: [
# Objects matching are considered to be available for direct access, i.e. they will not be downloaded
# or cached, and any download request will return a di...
just to understand well the problems you helped me fix:
for elastic search it looked like I wasn't running the cluster with enough memory
but what happened to the FileServer? and how can I prevent it happening in a potential "production" deployment?
that's what I wanted to ask, while the proper networking is setup (I don't manage the cluster),
can I do tests using the .kube/config?
The behaviour I'd like to achieve is that any artefact is automatically saved to an S3 bucket, possibly without having the Data Scientist having to configure much on their side.
Right now, we are storing artefacts in the fileserver, and we have to make sure that we use output_uri=True in the Task.init call to have artefacts uploaded to ClearML fileserver.
What's the ideal setup to keep the boilerplate for DS code minimal?
CostlyOstrich36 so I don't have to write the clearml.conf?
I would like to setup things so that a data scientist working on a project doesn't have to know about buckets and this sort of things... Ideally the server and the agents are configured with a default bucket...
Hi Martin, thanks for the explanation! I work with Maggie and help with the ClearML setup.
Just to be sure, currently, the PodTemplate contains:
resources: limits: nvidia.com/gpu: 1
you are suggesting to add also, something like:requests: memory: "100Mi" limits: memory: "200Mi"is that correct?
On a related note, I am a bit puzzled by the fact that all the 4 GPUs are visible.
In the https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/ , i...
But shouldn't the path of the artifacts be a setting of the file server, and not of the agent?
OK, it wasn't the clearml.conf settings...
In the deployment I was referring to the fileserver, apiserver, etc. with the internal kubernetes dns names.
I changed them to the one exposed to the users (the same I have in my local clearml.conf) and things work.
But I can't really figure out why that would be the case...
If you need to know the value of some of them let me know CostlyOstrich36 I wanted to avoid leaking access keys etc. so I removed the values
So in the k8s glue agent deployment, the clearml.conf is just:sdk {} agent { package_manager: { extra_index_url: ["host"] } }(the API keys are exposed through environment variables)
ah I see, I'll give it a try then
thanks a lot! So as long as we have the storageclass in our kubernetes cluster configured correctly, the new helm chart should work out of the box?
OK. In the pod spawned by the K8s Glue Agent, clearml.conf is the same as the K8S Glue Agent
Thanks Alon
OK, thanks a lot, I'll try to get the networking thing sorted (and then I am sure I'll have lots more many doubts 😂 )
Yes, I add it to "default" queue (which is the one used in the config file for the k8 glue agent)
Yes, I still see those errors, but queues are working :)
but I don't understand the comment on GPUs as the documentation makes a lot of references on GPU configurations for agents
Hi Martin, thanks. My doubt is:
if I configure manually the pods for the different nodes, how do I make clearml server aware that those agents exist? This step is really not clear to me from the documentation (it talks about user, and it uses interactive commands which would mean entering in the agents manually) I will try also the k8s glue, but I would like first to understand how to configure a fixed number of agents manually
Hi folks, I think I found the issue, the documentation mention to set NVIDIA_VISIBLE_DEVICES to "", when in reality it should be "none" according to the code:
if Session.get_nvidia_visible_env() == 'none': # NVIDIA_VISIBLE_DEVICES set to none, marks cpu_only flag # active_gpus == False means no GPU reporting self._active_gpus = False
so I assume clearml moves them from one queue to the other?
but for sigterm you should be able to set cleanup steps, no?
Thanks CostlyOstrich36 I was thinking more to a setting of the environment, for example the documentation mentions the "--cpu-only" flag (which I am not sure I can use as I am using the helm charts from AllegroAI, I don't think I can override the command), or to set the env var NVIDIA_VISIBLE_DEVICES to an empty string (which I did, but I can still see the message)