Reputation
Badges 1
282 × Eureka!Ok. That brings me back to the spawned pod. At this point, clearml-agent and its config would be a controbuting factor. Is the absence of /tmp/.clearml_agent.xxxxxx.cfg
an issue?
I want to rule out the glue being the problem. Is the Glue significant in initialising clearml-agent after the pod is spawned?
Its running as a long running POD on K8S. I'm using log -f
to track its stdout.
so the clearml-agent daemon needs higher privilege?
Does the glue write any error logs anywhere? I only see CLEARML_AGENT_UPDATE_VERSION =
and nothing else.
I have since ruled out the apt and pypi repos. Both of them are installing properly on the pods.
Does the enterprise version support natively?
Hi, i dont't think clearml agent actually ran at that point in time. All i can see in the pod is
apt install of libpthread-stubs, libx11, libxau and libxcb1 packages. pip install of clearml-agentAfter the above are successful, the pod just hang there.
I did notice that in the tmp folder, .clearml_agent.xxxxx.cfg does not exists.
Hi thanks for the examples! I will look into them. Quite a fair bit of my teams uses tf datasets to pull data directly from object stores, so tfrecords and stuff are heavily involved. I'm trying to figure if they should version the raw data or the tfrecords with ClearML, and if downloading entire set of data to local can be avoided as tf datasets is able to handle batch downloading quite well.
Space is way above nominal. What created this folder that it's trying to process? What processing is this?Processing /tmp/build/80754af9/attrs_1604765588209/work
Is there any paths in the agent machine that i can clear out to remove any possible issues from previous versions?
They don't have the same version. I do seem to notice that if the client is using version 3.8, during remote execution will try to use that same version despite the docker image not installed with that version.
Hi, so this means if i want to use Kubernetes, i would have to 'manually' install multiple agents on all the worker nodes?
and out of curiosity, what did you think we were talking about? cos i didn't see anywhere else that might print the secrets.
Thought this looked familiar.
https://clearml.slack.com/archives/CTK20V944/p1635323823155700?thread_ts=1635323823.155700&cid=CTK20V944
I managed to find out why. The docker image I'm using is not set as root user thus the error. But I'm wondering why this is the case as docker best practices does indicate we should use a non root on production images.
Yes it is! But ClearML didn't support multi node training out of the box in a way that it streamline the process. So we are trying to figure out a way to do it.
If we run all the rank 0 and rank n tasks individually, it's defeats the purpose of using ClearML.
I can't seem to find the fix to this. Ended up using an image that comes with torch installed.
Hi FriendlySquid61 , AgitatedDove14 , the issue and possible fix is in this issue raise. https://github.com/allegroai/clearml-agent/issues/51
I did another test by runningkubectl exec pod-name -- echo $PIP_INDEX_URL
and it returned nothing. So the env are not passed to the container at all.
what feature on this paid roadmap are you referring to? I am indeed communicating with Noem on paid features.
I can't seem to find the version number on the clearml web app. Is there a specific way?
From an efficiency perspective, we should be pulling data as we feed into training. That said, always a good idea to uncompress large zip files and store them as smaller ones that allow you to batch pull for training.
Its. 0.17-63.
It doesn't appear in profile page.
Hi. The upgrade seems to go well but i'm seeing one wierd output. When i ran a task and observe the software installed
under the execution
tab , i still see clearml=0.17
. Is this expected?