Reputation
Badges 1
103 × Eureka!everything else is 200 except these two
thank you very much.
for remote workers, would this env variable get parsed correctly?CLEARML_API_HTTP_RETRIES_BACKOFF_FACTOR=0.1
He's asking "what git credentials make sense to use for agents" - regardless of autoscaling or not. I had the same question earlier.
tldr: it depends on your security policies.
@<1719524650926477312:profile|EncouragingFish95> - if you have the ability to create a "service account" in your git provider, perhaps at the org-level, I would do that.
My org's cloud git provider does not enable this functionality, and so we have agreed that it is "acceptable" to have the agent's git credentials...
thanks so much!
I've been running a bunch of tests with timers and seeing an absurd amount of variance. Ive seen parameters connect and task create in seconds and other times it takes 4 minutes.
Since I see timeout connection errors somewhat regularly, I'm wondering if perhaps I'm having networking errors. Is there a way (at the class level) to control the retry logic on connecting to the API server?
my operating theory is that some sort of backoff / timeout (eg 10s) is causing the hig...
ugh. again. it launched all these tasks and then just died. logs go silent.


ah. a clue! it came right below that but i guess out of order...
that id is the pipeline that failed
None here's how I'm establishing worker-server (and client-server) comms fwiw
enqueuing. pipe.start("default") but I think it's picking up on my local clearml install instead of what I told it to use.
my tasks have this in them... what's the equivalent for pipeline controllers?
the workers connect to the clearml server via ssh-tunnels, so they all talk to "localhost" despite being deployed in different places. each task creates artifacts and metrics that are used downstream
N/A (still shows as running despite Abort being sent)
i ended up pinning the Dockerfile instruction to 1.18 but before that was letting the entrypoint script do the install (so, latest) .
much appreciate the env var tip . that's more elegant than what i did .
since I've turned off caching I've had much better luck . is what I'm experiencing a bug? (bitbucket nor github private repository work on second task per worker)
oh it's there, before running task.
from task pick-up to "git clone" is now ~30s, much better.
though as far as I understand, the recommendation is still to not run workers-in-docker like this:
export CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1
export CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=$(which python)
(and fwiw I have this in my entrypoint.sh )
cat <<EOF > ~/clearml.conf
agent {
vcs_cache {
enabled: true
}
package_manager: {
type: pip,
...
this is not about storage access tokens . its about the App Credentials .
those things you set as CLEARML_API_KEY and SECRET so that clients can talk to the api
trying to run the experiment that kept failing right now, watching logs (they go by fast)... will try to spot anything anamolous
For digitalocean:
host: "(region). digitaloceanspaces.com:443 "
bucket: “(bucket name)”
key: “(key)”
secret: “(secret)”
multipart: false
secure: true
(verify commented out entirely)
So for you - make sure to add your creds that have the right scope (r/w), and try specifying the bucket .
Then in clearml tasks themselves you tell the task using output_uri=“s3://(region).digitaloceanspaces.com:443/clearml/”
(I import this as a constant from a _constants.py file...
for me, it was to set loglevel higher up and reduce the number of prints that my code was doing. since I was using a logger instead of prints, it was pretty easy.
If you're using some framework that spits out its own progress bars, then I'd look into disabling those from options available.
Turning off logs entirely I don't know, will let clearml ppl respond to that.
For sure though the comms of CPU monitoring and epoch monitoring will lead to a lot of calls... but i'll agree 80k seems exce...
it's really frustrating, as I'm trying to debug server behavior (so I'm restarting often), and keep needing to re-create these.
that same pipeline with just 1 date input.
i have the flexibility from the UI to either run a single, a dozen, or a hundred experiments... in parallel.
pipelines are amazing 😃
ah I see. thank you very much!
trying export CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=$(which python)
but I still see Environment setup completed successfully
(it is printed after Running task id )
it still takes a full 3 minutes between task pulled by worker until Running task id
is this normal? What is happening in these few minutes (besides a git pull / switch)?
I would assume a lot of them are logs streaming? So you can try reducing printouts / progress bars. That seems to help for me.
For context: I have noticed the large number of API calls can be a problem when networking is unreliable. It causes a cascade of slow retries and can really hold up task execution. So do be cautious of where work is occurring relative to where the server is, and what connects the two.
ah, I'm self-hosting.
progress bars could easily take up several thousand calls, as it moves with each batch.
would love to know if the # of API calls decreases substantially by turning off auto_connect_streams . please post an update when you have one 😃
so, i got around this with env vars
in my worker entrypoint script , I do
export CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1
export CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=$(which python)
clearml-server-1.15.1, clearml-1.16.2
everything i just said comes from the screenshotted webpage and is regarding the CLEARML_API_ACCESS_KEY and CLEARML_API_SECRET_KEY env vars.
when i restart clearml server, the keys started disappearing . this was not the case before upgrading
is there a way for me to toggle CLEARML's log level? I'm doing some manual task-debugging in ipython and think it would be helpful to see network requests and timeouts if they're occurring.
Nope still dealing with it .
Oddly enough when i spin up a new instance on the new version, it doesnt seem to happen
yeah this problem seems to happen on 1.15.1 and 1.16.2 as well, prior runs were on the same version even. It just feels like it happens absolutely randomly (but often).
just happened again to me.
The pipeline is constructed from tasks, it basically does map/reduce. prepare data -> model training + evaluation -> backtesting performance summary.
It figures out how wide to go by parsing the date range supplied as input parameter. Been running stuff like this for months but only recently did ...
(the "magic" of the env detection is nice but man... it has its surprises)
