Reputation
Badges 1
282 × Eureka!Its actually in your documentation. Its removed since 0.17 apparently.
https://allegro.ai/clearml/docs/docs/release_notes/ver_0_17.html#clearml-agent-0-17-2
And this is my logs, it tried to install something and encountered permission denied. It wouldn't if it obeyed the force_repo_requirements_txt.
1620664917916 Kahs-MacBook-Pro.local info ClearML Task: created new task id=024a421c0e174650a1c7ff64af756c26 ClearML results page: `
1620664920359 Kahs-MacBook-Pro.local info ClearML Mon...
Hi. The upgrade seems to go well but i'm seeing one wierd output. When i ran a task and observe the software installed under the execution tab , i still see clearml=0.17 . Is this expected?
Hi, it make sense to automate this part just like how you automate the rest of the MLOps flow, especially when you already support Data Versioning/Lineage, Data Provenance (How it works with the experiment and as a model source) should be in too. Although i agree technically it's probably not possible to tell if the users actually used the indicated datasets after they do a datasets.get_copy() .
thanks GrumpyPenguin23 , i'll look deeper on that. This kinda fits what i am looking for but its for TRAINS and there's no technical how-to.
https://clear.ml/blog/stop-using-kubernetes-for-ml-ops/
yes its on purpose, each user would have their own AWS credentials for default_output_uri.
Hi SuccessfulKoala55 , thanks. Opened issue on the CLearml-Agent GH at https://github.com/allegroai/clearml-agent/issues/67
I've been reading the documentation for a while and I'm not getting the following very well.
Given an open source codes say, huggingface. I wanted to do some training and i wanted to track my experiments using ClearML. The obvious choice would be to use Explicit Reporting in ClearML. But the part on sending my training job. and let ClearML orchestrate is vague. Would appreciate if i can be guided to the right documentation on this.
I'm also beginning to think this is related to https://clearml.slack.com/archives/CTK20V944/p1620664770492400 . Previously when i set force_repo_requirements_txt=true and system_site_packages: true , it seems to work. upgrading to v1.02 seems to change things.
Hi CostlyOstrich36 , nothing in particular. I was doing a research and noticed that ML Pipelines was mentioned not even once in the literature. So i wonder if one should be done. I'm looking at other aspects as well but i'll gradually ask on those.
Hi, the latest k8sglue-example.py was last commited about 4 months ago. Are you refering to that version?
We are deploying ClearML Server via the docker-compose.
For ClearML-Agent. We have the choice of Docker or K8S preferred (Using the Glue).
For K8S, we can't get the glue to work ( https://clearml.slack.com/archives/CTK20V944/p1614525898114200?thread_ts=1613923591.002100&cid=CTK20V944 ) so we can't make an assessment of whether it actually works for us.
Ok that worked. So every time i have changes in codes, i will have to rerun the experiment on my own machine that doesn't have any GPUs?
Kinda defeat the purpose of using ClearML Agent.
yup. in this case it wasn't root. Removing that USER and -u in pip solves the problem. However, in our production images, we are required to remove root access.
` FROM nvidia/cuda:10.1-cudnn7-devel
ENV DEBIAN_FRONTEND noninteractive
RUN apt-get update && apt-get install -y
python3-opencv ca-certificates python3-dev git wget sudo ninja-build
RUN ln -sv /usr/bin/python3 /usr/bin/python
create a non-root user
ARG USER_ID=1000
RUN useradd -m --no-log-init --system --uid ${USER_ID} a...
docker exec clearml-elastic curl zsh: no matches found:
Thanks SuccessfulKoala55 , how might I do this clean up? Does this increase with more use of ClearML? And to add, we save all artifacts onto a remote S3 server.
No issues. I know its hard to track open threads with Slack. I wish there's a plugin for this too. 🙂
Yes of cos, its a long one.
Hi, scenario as follows.
client.py runs task.execute_remotely(queue='myqueue', exit_process=True) The API section of clearml.conf at client side is read in. client side calls clearml server and insert task into queue. K8S glue retrieves task from queue. Spawn a K8S pod. K8S pod performs git clone Error. ssh keys not found.
Each individual has their own key in the gitlab profile and gitlab is configured to only work via ssh.
We can't place the key in the image as this is as good as ...
Likely network. Can you run a curl on ClearML server api server from jenkin stage and see if that gets through?
Does the glue write any error logs anywhere? I only see CLEARML_AGENT_UPDATE_VERSION =
and nothing else.
Ok. That brings me back to the spawned pod. At this point, clearml-agent and its config would be a controbuting factor. Is the absence of /tmp/.clearml_agent.xxxxxx.cfg an issue?
Hi SuccessfulKoala55 I was refering to the Task.init() or any other SDK API that we use in our training codes.
I'm using this feature, in this case i would create 2 agents, one with cpu only queue and the other with gpu queue. And then at the code level decide with queue to send to.
Hi, building a container with vscode is not possible. If i have an alternative location for the vscode, where should i indicate in the configuration?
That didn't work as well...
What's the diff between template-yaml and --overrides-yaml? I used the latter to ensure the gpu is passed in.
Hi SuccessfulKoala55 , would they need the fileserver to route to minio then? E.g.
This will ensure that any actions by clearml-data and models are saved into the S3 object store.
api {
files_server: s3://ecs.ai:80/clearml-data/default
}
aws {
s3 {
credentials {
host: http://ecs.ai:80
## Insert the iam credentials provided by your SAs here.
}
}
}
But if user forgot to do above, they will be saved on ClearML server. If I switch off f...