Reputation
Badges 1
282 × Eureka!I've been reading the documentation for a while and I'm not getting the following very well.
Given an open source codes say, huggingface. I wanted to do some training and i wanted to track my experiments using ClearML. The obvious choice would be to use Explicit Reporting in ClearML. But the part on sending my training job. and let ClearML orchestrate is vague. Would appreciate if i can be guided to the right documentation on this.
ah... thanks!
Hi Jake, thanks for the suggestion, let me try it out.
Ok. Problem was resolved with latest version of clearml-agent and clearml.
This one can be solved with shared cache + pipeline step, refreshing the cache in the shared cache machine.
Would you have an example of this in your code blogs to demonstrate this utilisation?
I'm using this feature, in this case i would create 2 agents, one with cpu only queue and the other with gpu queue. And then at the code level decide with queue to send to.
I did notice that in the tmp folder, .clearml_agent.xxxxx.cfg does not exists.
Space is way above nominal. What created this folder that it's trying to process? What processing is this?Processing /tmp/build/80754af9/attrs_1604765588209/workIs there any paths in the agent machine that i can clear out to remove any possible issues from previous versions?
They don't have the same version. I do seem to notice that if the client is using version 3.8, during remote execution will try to use that same version despite the docker image not installed with that version.
This is probably the whole script.
kubectl get nodespip install clearml-agentpython k8s_glue_example.py
can you please verify that you have all the required packages installed locally ?
Its not installed on the image that runs the experiment. But its reflected in the requirements.txt.
what is the setting ofÂ
agent.package_manager.system_site_packages
True.
The apply.yaml template is not working (E.g. the arguments env is not passed to the container), this is why i tried the code approaach instead.
Hi, i was reading this thread and wondered which version of clearml-server and clearml-agent has this taken effect with?
In the ClearML config that's being run by the ClearML container?
Hi AgitatedDove14 , i changed everything to cuda 10.1 and tried again with the same rrror. the section as follows. I made sure torch==1.6.0+cu101 and torchvision==0.8.2+cu101 are in the pypi repo. But the same error still came up.
` # Python 3.6.9 (default, Oct 8 2020, 12:12:24) [GCC 8.4.0]
boto3 == 1.14.56
clearml == 0.17.4
numpy == 1.19.1
torch == 1.6.0
torchvision == 0.7.0
Detailed import analysis
**************************
IMPORT PACKAGE boto3
clearml.storage: 0
IMPORT PACKAG...
I can't seem to find the fix to this. Ended up using an image that comes with torch installed.
I would say yes, otherwise the vscode feature is only available on internet connected premises due to the hard coded URL to download vscode.
Here's my two cents worth.
I thought its really nice to start off the topic highlighting 'pipelines', its unfortunately one of the most missed component when ppl start off with ML work. Your article mentioned about drfits and how MLOps process covered it. I thought there are 2 more components that was important and deserves some mention.Retraining pipelines. ML engineers tend not to give much thought to how they want to transit a training pipeline in development to a automated retraining pipe...
Yeah that'll cover the first two points, but I don't see how it'll end up as a dataset catalogue as advertised.
To note, the latest codes have been pushed to the Gitlab repo.
The doc also mentioned preconfigured services with selectors in the form of"ai.allegro.agent.serial=pod-<number>" and a targetPort of 10022. Would you have any examples of how to do this?
No issues. I know its hard to track open threads with Slack. I wish there's a plugin for this too. 🙂
Any idea where i can find the relevant API calls for this?
alright thanks. Its impt we clarify it works before we migrate the ifra.
yup. in this case it wasn't root. Removing that USER and -u in pip solves the problem. However, in our production images, we are required to remove root access.
` FROM nvidia/cuda:10.1-cudnn7-devel
ENV DEBIAN_FRONTEND noninteractive
RUN apt-get update && apt-get install -y
python3-opencv ca-certificates python3-dev git wget sudo ninja-build
RUN ln -sv /usr/bin/python3 /usr/bin/python
create a non-root user
ARG USER_ID=1000
RUN useradd -m --no-log-init --system --uid ${USER_ID} a...
I'm having the same problem. You using latest clearmagent? Is your docker image a root user by default?
After some churning, this is the answer. Change it in the clearml-agent init generated clearml.conf.
` default_docker: {
    # default docker image to use when running in docker mode
    image: "nvidia/cuda:10.1-runtime-ubuntu18.04"
    # optional arguments to pass to docker image
    # arguments: ["--ipc=host", ]
    arguments: ["--env GIT_SSL_NO_VERIFY=true",]
  } `
This is strange then, is it possible for clearml logs to register successfully saving into a S3 storage when actually it isn't? For example, i've seen in past experiences with certain S3 client that saved onto a local folder called 's3:/' instead of putting it on S3 storage itself.
Previously we had similar issues when we switched images used in agent. Might want to check on that.