Reputation
Badges 1
282 × Eureka!Hi, i have the same question. Why would this be ignored if called remotely?
https://clear.ml/docs/latest/docs/references/sdk/task/#set_base_docker
What's the diff between template-yaml and --overrides-yaml? I used the latter to ensure the gpu is passed in.
Thanks SuccessfulKoala55 . Just pm'ed him.
Thanks this would be a good alternative before the enterprise version comes in. How is this different from argparser btw?
I think in general, the 'published' action can be considered an 'approval'. The question is, how do we control who has the authority to 'publish'? The Web UI today does not support any uploads outside of the coding environment, would be nice it would be supported. But for now, the only workaround is to include parameters that stores document urls in the user properties.
Thanks TimelyPenguin76 , is there an env var for the S3 connection as well?
Hi this is the log. I didn't see any attempt from the agent to install virtualenv on the base image.
` 1618369068169 clearml-gpu-id-b926b4b809f544c49e99625380a1534b:gpuGPU-4ad68290-0daf-4634-6768-16fad73d47a3 DEBUG Current configuration (clearml_agent v0.17.2, location: /tmp/.clearml_agent.wgsmv2t9.cfg):
agent.worker_id = clearml-gpu-id-b926b4b809f544c49e99625380a1534b:gpuGPU-4ad68290-0daf-4634-6768-16fad73d47a3
agent.worker_name = clearml-gpu-id-b926b4b809f544c49e99625...
Sorry AgitatedDove14 can you bump me to that thread?
And any roadmap on this? The organisation's on ssh auth is firm. This can end up not possible to use ClearML for remote execution.
Thanks SuccessfulKoala55 . I can try my hand on a patch. But the pod spinning is handled by the k8s glue, which has no link to the client side. How should the client pass the key over to k8s glue during runtime via clearml server?
Ok that works. thanks.
Yeah.. issue is ClearML unable to talk to the nodes cos pytorch distributed needs to know their IP. There is some sort of integration missing that would enable this.
Does the glue write any error logs anywhere? I only see CLEARML_AGENT_UPDATE_VERSION =
and nothing else.
Hi yes, still getting the SSLs. It looks like some incompatibility with the OS ssl libraries.
[root@2c7498711bef elasticsearch]# curl -XGET `
yellow open events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b 4hAFNtGkRr-CHNGnUYfbTA 1 1 4724 271 660.9kb 660.9kb
yellow open events-log-d1bd92a3b039400cbafc60a7a5b1e52b M3qgFy1HRU2PibDOr1YOdw 1 1 1221 20 1013.6kb 1013.6kb
red open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-05 EQK8mnlhRxCrrKK3clcUFA 1 1
red open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_...
Hi, it's a preference from my developers. They preferred that the they install the python libraries into the images, load them up into the registry. In other words, they prefer to have libraries installed at image time.
running git diff on my terminal in this repo gave nothing. nothing at all.
I'm not familiar with elastic. What role does elastic play in ClearML?
i passed it through the yaml as follows.apiVersion: v1 kind: Pod spec: containers: - image: clearml-agent:latest" env: - name: PIP_INDEX_URL value: " " - name: PIP_TRUSTED_HOST value: "192.168.56.253" - name: PIP_FIND_LINKS value: " ` "
- name: GIT_SSL_NO_VERIFY
value: true
resources:
requests:
cpu: "2"
...
I also think it make sense that when you do certain definitive CI actions like publish, it would support some custom scripts to run.
Hi, i can't seem to find the source. What are the kind of situations where it will try to install torch outside of user requirements?
The problem is resolved by doing a git push. Somehow the git diff didn't capture the difference in requirements.txt in the project. I can't reproduce the same issue after this as well.
I have since ruled out the apt and pypi repos. Both of them are installing properly on the pods.
Sorry take back. Just realised that this argument only worked on running the agent, but when you enqueue a task into this agent, the argument is not passed on to the container that the agent spawned.
This is the same issue for the docker image. It reverts back to nvidia/cuda:10.1-runtime-ubuntu18.04 despite me setting something else.
Sorry, dev end I was referring to my developers.
I didn't think Horovod needs to be as complicated as you described. It can also work by running on multiple known nodes. How would i add a glue for multinode?
Horovod does also work with other similar products such as yours (E.g. Polyaxon).
It's a local deployment. I was only presented with username without a need to enter passwords. When I'm in, I don't see an option in my profile to set a password as well. Neither is there integration with ldap for example.
Hi, so you meant i need to installl virtualenv in my base image?
Is there a way for k8s glue to pass on self signed cert information to the agent pods?
and yes, there are stuff in there. In fact its been running for a few weeks with no issue. This appears to have happened after i added new workers, though i can't be sure this is the cause. Is there a limit to the number of workers that i can add for community edition?