Reputation
Badges 1
103 × Eureka!I think i've narrowed this down to the ssh connection approach.
regarding the container that runs the pipeline:
- when I made it stop using autossh tunnels and instead put it on the same machine as the clearml server + used docker network host mode, suddenly the problematic pipeline started completing.
it's just so odd that the pipeline controller task is the only one with an issue. the modeling / data-creation tasks really all seem to complete consistently just fine.
so yeah, best guess n...
the pipeline is to orchestrate tasks to create more complex functionality, and take advantage of caching, yes.
here I run backtesting (how well did i predict the future), and can control frequency "every week, every month" etc.
so if I increase frequency, I dont need to rerun certain branches of the pipeline and therefore they are cached. another example: if I change something that impacts layer 3 but not layer 1-2, then about half my tasks are cached.
the pictured pipeline is: "create data...
i understood that part, but noticed that when putting in the code to start remotely, the consequence seems to be that the dag computation happens twice - once on my machine as it runs, and then again remotely (this is at least part of why its slower) . if i put pipe.start earlier in the code, the pipeline fails to execute the actual steps .
this is unlike tasks, which somehow are smart enough to publish in draft form when task.execute_remotely is up top .
do i just leave off pipe.start?
I did manage to figure this out with
docker compose stop agent-services
docker compose up --force-recreate --no-deps -d agent-services
and running an export for the newly generated key.
still though, noticing restarts cause App Credentials to be lost.
I really can't provide a script that matches exactly (though I do plan to publish something like this soon enough), but here's one that's quite close / similar in style:
None where I tried function-steps out instead, but it's a similar architecture for the pipeline (the point of the example was to show how to do a dynamic pipeline)
its odd... I really dont see tasks except the controller one dying
thank you!
out of curiosity: how come the clearml-webserver upgrades weren't included in this release? was it just to patch the api part of the codebase?
@<1523701868901961728:profile|ReassuredTiger98> id suggest trying to set up diff queues for diff repos then - each with a read only token . The issue is really only when you want a single token to grant access to many repos . Little inconvenient but definitely possible .
Ill also say ive had an easy time forking and modifying the agent code for custom logic changes, so you can always consider that option as well . Itβs easy enough to read to be honest .
I would assume a lot of them are logs streaming? So you can try reducing printouts / progress bars. That seems to help for me.
For context: I have noticed the large number of API calls can be a problem when networking is unreliable. It causes a cascade of slow retries and can really hold up task execution. So do be cautious of where work is occurring relative to where the server is, and what connects the two.
ah, I'm self-hosting.
progress bars could easily take up several thousand calls, as it moves with each batch.
would love to know if the # of API calls decreases substantially by turning off auto_connect_streams . please post an update when you have one π
for me, it was to set loglevel higher up and reduce the number of prints that my code was doing. since I was using a logger instead of prints, it was pretty easy.
If you're using some framework that spits out its own progress bars, then I'd look into disabling those from options available.
Turning off logs entirely I don't know, will let clearml ppl respond to that.
For sure though the comms of CPU monitoring and epoch monitoring will lead to a lot of calls... but i'll agree 80k seems exce...
It sounds like you understand the limitations correctly.
As far as I know, it'd be up to you to write your own code that computes the delta between old and new and only re-process the new entries.
The API would let you search through prior experimental results.
so you could load up the prior task, check the ids that showed up in output (maybe you save these as a separate artifact for faster load times), and only evaluate the new inputs. perhaps you copy over the old outputs to the new task...
fwiw - i'm starting to wonder if there's a difference between me "resetting the task" vs cloning it.
update: ever since turning off git caching, i've had much more stability. i cannot tell whether it's causing a slow down in task execution though - is the clone a shallow one by default?
Pipeline step caching matches on inputs and task status. If your task points to latest commit, clearml canβt know what that is until runtime and cant cache. On a fixed tag or commit, it sees no code has changed, and so if inputs match (hashable, all parameters are serializable), then it caches.
it's pretty reliably happening but the logs are just not informative. just stops midway
For digitalocean:
host: "(region). digitaloceanspaces.com:443 "
bucket: β(bucket name)β
key: β(key)β
secret: β(secret)β
multipart: false
secure: true
(verify commented out entirely)
So for you - make sure to add your creds that have the right scope (r/w), and try specifying the bucket .
Then in clearml tasks themselves you tell the task using output_uri=βs3://(region).digitaloceanspaces.com:443/clearml/β
(I import this as a constant from a _constants.py file...
basically the git hash of the executed experiment + a hash on the inputs to the task.
but pretty reliably some proportion of tasks still just take a much longer time. 1m - 10m is a variance i'd really like to understand.
everything i just said comes from the screenshotted webpage and is regarding the CLEARML_API_ACCESS_KEY and CLEARML_API_SECRET_KEY env vars.
when i restart clearml server, the keys started disappearing . this was not the case before upgrading
when I do a docker compose down; docker compose up -d ... these disappear.
to be clear... this was not happening before I upgraded to the latest version. That is why I am asking about this.

damn. I can't believe it. It disappeared again despite having 1.15.1 be the task's clearml version.
I'm going to try running the pipeline locally.
what if the preexisting venv is just the system python ? my base image is python:3.10.10 and i just pip install all requirements in that image . Does that not avoid venv still?
it's good to know that in theory there's a path forward with almost zero overhead . that's what I want .
is it reasonable to expect that with sufficient workers, I can get 50 tasks to run in the same time it takes to run a single one? i cant imagine the apiserver being a noticeable bottleneck .
i ran into this recently.
its a small thing but double check the port. should be 443, not 433 as in the docs (typo?) - seems you got this in the screenshot .
no region should be set .
i dont use backblaze but if it helps i can show my digitalocean spaces config . should be comparable .
so, i got around this with env vars
in my worker entrypoint script , I do
export CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1
export CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=$(which python)
N/A (still shows as running despite Abort being sent)
it happens consistently with this one task that really should be all cache.
I disabled cache in the final step and it seems to run now.
mind-blowing... but somehow just later in the same day I got the same pipeline to create its DAG and start running in under a minute.
I don't know what exactly I changed. The pipeline task was run locally (which I've never done before), then cloned to run remotely in my services queue. And then it just flew through the experiment at the pace I expected.
so there's hope. i'll keep stress-testing it and see what causes differences. I was right to suspect that such a simple DAG should not take...
okay that's a similar setup to mine... that's interesting.
much more in line with my expectation.
