Reputation
Badges 1
282 × Eureka!thanks. That seems to work. I got a question, does it save the best model or the model in the last epoch?
Hi, clearml-agent==0.17.2rc3 did work. I'm on a 1.19 k8s cluster, and has this error when a task is pulled. Is the glue not compatible with 1.19?
` Pulling task 3a90802d1dfa4ec09fbccba0beffbaa8 launching on kubernetes cluster
Pushing task 3a90802d1dfa4ec09fbccba0beffbaa8 into temporary pending queue
Kubernetes scheduling task id=3a90802d1dfa4ec09fbccba0beffbaa8
kubectl output:
Flag --replicas has been deprecated, has no effect and will be removed in the future.
Flag --generator has been depre...
Hi AgitatedDove14 , that's what i am trying to figure out as well. The task has nothing to do with torch, and the requirements.txt doesn't have any torch packages as well.
Hi AgitatedDove14 , do you mean the configuration tab in the UI? No, i don't see it.
Thanks. We set this configuration and the client ran and submitted the job for remote execution (agent running k8s glue). However when the job runs, and tries to save into model repo, this error came up.
ClearML.storage - ERROR - Failed creating storage object S3://ecs.ai Reason; Missing key and secret for S3 storage access ( S3://ECS.ai ).
I remember being told that the ClearML.conf on the client will not be used in a remote execution like the above so I think this was the problem. I also...
Hi, it make sense to automate this part just like how you automate the rest of the MLOps flow, especially when you already support Data Versioning/Lineage, Data Provenance (How it works with the experiment and as a model source) should be in too. Although i agree technically it's probably not possible to tell if the users actually used the indicated datasets after they do a datasets.get_copy()
.
Hi, the idea is to load the gituser and password into the --env by loading it via a env var so the client could access the resources without divulging the credentials in source code and it would be removed after completion since the container would be removed. Its actually doing well with ClearML except the part that the agent seems to print the content of docker_cmd on running the task.
I would like to note that this behaviour doesn't exist with the clearml-agent daemon though. It only exis...
The agent is running on a disconnected server on docker mode. I have a client that runs clearml-session and i saw from the agent's logs that the installation of vscode fails.
Yeah that sounds good. But from user perspective, especially the untrained, they wouldn't know what to point to. Example, some may think it's an exe, some think it's a zip bundle, and others think it's any github repo with the word vscode.
where should i indicate in the configuration?
Any idea?
Hi CostlyOstrich36 , What you described is task. I was referring to the pipeline controller.
Thanks AgitatedDove14 , will take a look.
Hi AgitatedDove14 , i dug a bitt deeper. I saw this in installed packages
in the original completed task. When the task is cloned, this is copied over and thus the problem. Can i ask, how ClearML create the list of installed packages? Why is it that some of them (E.g. attr is being pulled from @ file:///tmp/build/80754af9/attrs_1604765588209/work)
` absl-py==0.11.0
alabaster==0.7.12
antlr4-python3-runtime==4.8
apex==0.1
appdirs==1.4.4
argon2-cffi==20.1.0
ascii-graph==1.5.1
async-gener...
I'm also beginning to think this is related to https://clearml.slack.com/archives/CTK20V944/p1620664770492400 . Previously when i set force_repo_requirements_txt=true
and system_site_packages: true
, it seems to work. upgrading to v1.02 seems to change things.
Thanks 👍 . Should i create an issue on Github?
Hi, please correct me if i am wrong, to use the glue, i need the following.
A k8s cluster A kubectl that is connected to the k8s cluster A pip install of clearml-agent 0.17.1
So i did all the above, I'm not what it meant by running the entire thing on own machine.
This would be solved if --env GIT_SSL_NO_VERIFY=true is passed to the k8s pod that's spawned to run the job. Currently its not.
Previously we had similar issues when we switched images used in agent. Might want to check on that.
ok thanks.
Then you pass the tolerations definition through a different pod template?
Yup.
unfortunately, our security posture is so strict that we cannot have an agent git user that have unfettered read access to all repos.
The apply.yaml template is not working (E.g. the arguments env is not passed to the container), this is why i tried the code approaach instead.
AgitatedDove14 , will these be fixed?
Passing env via the code Passing env via template yaml
Yeah.. issue is ClearML unable to talk to the nodes cos pytorch distributed needs to know their IP. There is some sort of integration missing that would enable this.
i passed it through the yaml as follows.apiVersion: v1 kind: Pod spec: containers: - image: clearml-agent:latest" env: - name: PIP_INDEX_URL value: "
" - name: PIP_TRUSTED_HOST value: "192.168.56.253" - name: PIP_FIND_LINKS value: "
` "
- name: GIT_SSL_NO_VERIFY
value: true
resources:
requests:
cpu: "2"
...
So these (PIP_INDEX_URL) weren't used when clearml starts running pip.
Hi, by deployment strategies I meant by canary, blue-green...etc..etc. I figured this should be done by clearml-serving and maybe seldon as well.