They don't have the same version. I do seem to notice that if the client is using version 3.8, during remote execution will try to use that same version despite the docker image not installed with that version.
This is probably the whole script.
kubectl get nodespip install clearml-agentpython k8s_glue_example.py
can you please verify that you have all the required packages installed locally ?
Its not installed on the image that runs the experiment. But its reflected in the requirements.txt.
what is the setting of
agent.package_manager.system_site_packages
True.
The apply.yaml template is not working (E.g. the arguments env is not passed to the container), this is why i tried the code approaach instead.
Hi, i was reading this thread and wondered which version of clearml-server and clearml-agent has this taken effect with?
In the ClearML config that's being run by the ClearML container?
I can't seem to find the fix to this. Ended up using an image that comes with torch installed.
I would say yes, otherwise the vscode feature is only available on internet connected premises due to the hard coded URL to download vscode.
Here's my two cents worth.
I thought its really nice to start off the topic highlighting 'pipelines', its unfortunately one of the most missed component when ppl start off with ML work. Your article mentioned about drfits and how MLOps process covered it. I thought there are 2 more components that was important and deserves some mention.Retraining pipelines. ML engineers tend not to give much thought to how they want to transit a training pipeline in development to a automated retraining pipe...
Yeah that'll cover the first two points, but I don't see how it'll end up as a dataset catalogue as advertised.
To note, the latest codes have been pushed to the Gitlab repo.
The doc also mentioned preconfigured services with selectors in the form of"ai.allegro.agent.serial=pod-<number>" and a targetPort of 10022. Would you have any examples of how to do this?
No issues. I know its hard to track open threads with Slack. I wish there's a plugin for this too. 🙂
Any idea where i can find the relevant API calls for this?
yup. in this case it wasn't root. Removing that USER and -u in pip solves the problem. However, in our production images, we are required to remove root access.
` FROM nvidia/cuda:10.1-cudnn7-devel
ENV DEBIAN_FRONTEND noninteractive
RUN apt-get update && apt-get install -y
python3-opencv ca-certificates python3-dev git wget sudo ninja-build
RUN ln -sv /usr/bin/python3 /usr/bin/python
create a non-root user
ARG USER_ID=1000
RUN useradd -m --no-log-init --system --uid ${USER_ID} a...
I'm having the same problem. You using latest clearmagent? Is your docker image a root user by default?
After some churning, this is the answer. Change it in the clearml-agent init generated clearml.conf.
` default_docker: {
# default docker image to use when running in docker mode
image: "nvidia/cuda:10.1-runtime-ubuntu18.04"
# optional arguments to pass to docker image
# arguments: ["--ipc=host", ]
arguments: ["--env GIT_SSL_NO_VERIFY=true",]
} `
This is strange then, is it possible for clearml logs to register successfully saving into a S3 storage when actually it isn't? For example, i've seen in past experiences with certain S3 client that saved onto a local folder called 's3:/' instead of putting it on S3 storage itself.
Previously we had similar issues when we switched images used in agent. Might want to check on that.
Setting the credentials on agent machine means the users cannot use their own credentials since an k8s glue agent serves multiple users.
Referencing your suggestion, we can configure output_uri on task.set_base_docker() but how should we do this for the credentials?
Hi, i changed it, but it still point to https://files.pythonhosted.org/packages .
ok thanks. this would mean that increasing the disk space for my ClearML is the only option as we are not at liberty to delete.
Hi, clearml-agent==0.17.2rc3 did work. I'm on a 1.19 k8s cluster, and has this error when a task is pulled. Is the glue not compatible with 1.19?
` Pulling task 3a90802d1dfa4ec09fbccba0beffbaa8 launching on kubernetes cluster
Pushing task 3a90802d1dfa4ec09fbccba0beffbaa8 into temporary pending queue
Kubernetes scheduling task id=3a90802d1dfa4ec09fbccba0beffbaa8
kubectl output:
Flag --replicas has been deprecated, has no effect and will be removed in the future.
Flag --generator has been depre...
Hi AgitatedDove14 , that's what i am trying to figure out as well. The task has nothing to do with torch, and the requirements.txt doesn't have any torch packages as well.
AlertBlackbird30 , Actually the log says 10.2.docker_cmd = nvidia/cuda:10.2-devel-ubuntu18.04 -e GIT_SSL_NO_VERIFY=true
I meant the dataset id.
Congrats on v1.0. 🎉
so the clearml-agent daemon needs higher privilege?
I managed to find out why. The docker image I'm using is not set as root user thus the error. But I'm wondering why this is the case as docker best practices does indicate we should use a non root on production images.