Reputation
Badges 1
282 × Eureka!i see. Can i take it that when the client usestask.execute_remotely(queue_name="1gpu", exit_process=True)then none of the content in its clearml.conf will be used, except for the API part. And Clearml simply uses whatever is on the Agent side.api { # Notice: 'host' is the api server (default port 8008), not the web server. api_server: web_server: files_server: # Credentials are generated using the webapp, `
# Override with os environment: ...
Yes! I definitely think this is important, and hopefully we will see something there
(or at least in the docs)
Hi AgitatedDove14 , any updates in the docs to demonstrate this yet?
Its hard to tell, but the agent change was a significant one. Unless python versions has something to do with it.
I used nvcr pytorch image and instruct clearml to inherit global dependencies. No need to install torch and work well.
Do you mean this?Removing containers section: [{'image': 'clearml-agent:latest"', 'env': [{'name': 'PIP_INDEX_URL', 'value': ' '},
I'm also noticing a lot of this while the k8s glue is running.Ex: Expecting value: line 1 column 1 (char 0) K8S Glue pods monitor: Failed parsing kubectl output:
I see i understand better now. Thanks.
Hi thanks.
So i suppose ClearML make use of the information in .git folder at the root of the script folder to gather those info.
I have yet to go through thoroughly with ClearML agent. TimelyPenguin76 , so if i run a training with uncommited changes and didn't commit/push after. When i clone the task, isn't ClearML agent unable to pull that script from the git repo?
Hi, i dont't think clearml agent actually ran at that point in time. All i can see in the pod is
apt install of libpthread-stubs, libx11, libxau and libxcb1 packages. pip install of clearml-agentAfter the above are successful, the pod just hang there.
Thank. Gonna try that out. But i hit another snag. Strangely, the Agent is not creating the right venv. This is what the Agent created.
` pip:
- asn1crypto==0.24.0
- attrs==20.3.0
- certifi==2020.12.5
- chardet==4.0.0
- cryptography==2.1.4
- Cython==0.29.22
- furl==2.1.0
- future==0.18.2
- humanfriendly==9.1
- idna==2.6
- importlib-metadata==3.7.0
- jsonschema==3.2.0
- keyring==10.6.0
- keyrings.alt==3.0
- orderedmultidict==1.0.1
- pathlib2==2.3.5
- psutil==5.8.0
- pycrypto==2.6.1
- pygobject...
Thanks. Which brings me to the question. How does ClearML deal with all the CVEs? What is your process for response?
Hi, so this means if i want to use Kubernetes, i would have to 'manually' install multiple agents on all the worker nodes?
which clearml.conf is it refering to? I'm executing on my client, which is then remotely executed by the agent. Both of them has ~/clearml.conf.
where should i indicate in the configuration?
Any idea?
Is there anyway to see an error log from that?
Got that thanks. Just to better understand. When clearml-data upload my recursive folder of image data, it convert it into a compressed form with a different folder structure than the original datasets.
When my software pull the data, i'm returned a str. How would we manipulate the data from there?
I'm also beginning to think this is related to https://clearml.slack.com/archives/CTK20V944/p1620664770492400 . Previously when i set force_repo_requirements_txt=true and system_site_packages: true , it seems to work. upgrading to v1.02 seems to change things.
So the context I'm asking is I realise I'll need to catalogue all the dataset ids created by ppl separately on a spreadsheet. And for each experiment, I'll need to go into the code commit to see which id is being used. But on the other hand, I thought I've seen advertised use cases where the experiment can be directly linked to the dataset id being used. The brain's a bit rusty to recall how it was done.
If we run all the rank 0 and rank n tasks individually, it's defeats the purpose of using ClearML.
ok thanks.
Hi SuccessfulKoala55 , just to add, my clearml.conf (client) and clearml.agent.conf (agent) can have differing values. I'm not sure which one takes precedence and if this could be the cause.
Thanks SuccessfulKoala55 , how might I do this clean up? Does this increase with more use of ClearML? And to add, we save all artifacts onto a remote S3 server.
Yes it is! But ClearML didn't support multi node training out of the box in a way that it streamline the process. So we are trying to figure out a way to do it.
From an efficiency perspective, we should be pulling data as we feed into training. That said, always a good idea to uncompress large zip files and store them as smaller ones that allow you to batch pull for training.
Hi it is missing --docker on the agent. Thanks! Dynamic GPU option only available with Enterprise version right?