
Reputation
Badges 1
103 × Eureka!of what task? i'm running lots of them and benchmarking execution times. would you like to see a best case or worst case scenario? (ive kept some experiments for each).
and yeah, in those docs you just linked, "boolean" vars like CLEARML_AGENT_GIT_CLONE_VERBOSE
explicitly say true
so I ended up trying that pattern. but originally i did try 1. let me go back to that now. thank you.
overall I've seen some improvements in execution time using the suggestions in this thread (tysm!) - th...
yeah locally it did run. I then ran another via UI spawned from the successful one, it showed cached steps and then refused to run the bottom one, disappearing again. No status message, no status reason. (not running... actually dead)
ah I see. thank you very much!
trying export CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=$(which python)
but I still see Environment setup completed successfully
(it is printed after Running task id
)
it still takes a full 3 minutes between task pulled by worker until Running task id
is this normal? What is happening in these few minutes (besides a git pull / switch)?
Pipeline step caching matches on inputs and task status. If your task points to latest commit, clearml can’t know what that is until runtime and cant cache. On a fixed tag or commit, it sees no code has changed, and so if inputs match (hashable, all parameters are serializable), then it caches.
I'm just working on speeding up the time from "queue experiment" to "my code actually runs remotely" - as of yesterday things would sit for many minutes at a time. trying to see if venv is the culprit .
thank you!
i'll take that design into consideration.
re: CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL in "docker venv mode" im still not quite sure I understand correctly - since the agent is running in a container, as far as it is concerned it may as well be on bare-metal.
is it just that there's no way for that worker to avoid venv? (i.e. the only way to bypass venv is to use docker-mode?)
yeah, still noticing that it can be multiple minutes before something starts...
like... what is happening in this time (besides a git clone), now that I set both
export CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=true
export CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=$(which python)
update: it's now been six mins and the task still isn't done. this should have run through in like a minute total end-to-end
, with about a hundred tasks. i'm taking a look at them now to see what the logs show for runtimes.
i would love some advice on that though - should I be using services mode + docker and some max # of instances to be spinning up multiple tasks instead?
my thinking was to avoid some of the docker overhead. but i did try this approach previously and found that the container limit wasn't exactly respected.
For digitalocean:
host: "(region). digitaloceanspaces.com:443 "
bucket: “(bucket name)”
key: “(key)”
secret: “(secret)”
multipart: false
secure: true
(verify commented out entirely)
So for you - make sure to add your creds that have the right scope (r/w), and try specifying the bucket .
Then in clearml tasks themselves you tell the task using output_uri=“s3://(region).digitaloceanspaces.com:443/clearml/”
(I import this as a constant from a _constants.py file...
that same pipeline with just 1 date input.
i have the flexibility from the UI to either run a single, a dozen, or a hundred experiments... in parallel.
pipelines are amazing 😃
basically the git hash of the executed experiment + a hash on the inputs to the task.
starting to . thanks for your explanation .
would those containers best be started from something in services mode? or is it possible to get no-overhead with my approach of worker-inside-docker?
i designed my tasks as different functions, based mostly on what metrics to report and artifacts that are best cached (and how to best leverage comparisons of tasks) . they do require cpu, but not a ton.
I'm now experimenting with lumping a lot of stuff into one big task and seeing how this go...
but pretty reliably some proportion of tasks still just take a much longer time. 1m - 10m is a variance i'd really like to understand.
thanks so much!
I've been running a bunch of tests with timers and seeing an absurd amount of variance. Ive seen parameters connect and task create in seconds and other times it takes 4 minutes.
Since I see timeout connection errors somewhat regularly, I'm wondering if perhaps I'm having networking errors. Is there a way (at the class level) to control the retry logic on connecting to the API server?
my operating theory is that some sort of backoff / timeout (eg 10s) is causing the hig...
mind-blowing... but somehow just later in the same day I got the same pipeline to create its DAG and start running in under a minute.
I don't know what exactly I changed. The pipeline task was run locally (which I've never done before), then cloned to run remotely in my services queue. And then it just flew through the experiment at the pace I expected.
so there's hope. i'll keep stress-testing it and see what causes differences. I was right to suspect that such a simple DAG should not take...
clearml-server-1.15.1, clearml-1.16.2
thanks for the clarification. is there any bypass? (a git diff + git rev parse should take mere milliseconds)
I'm working out of a mono repo, and am beginning to suspect its a cause of slowness. next week ill try moving a pipeline over to a new repo to test if this theory holds any water.
and yes, you're correct. I'd say this is exactly what clearml pipelines offer.
the smartness is simple enough: same inputs are assumed to create the same outputs (it's up to YOU to ensure your tasks satisfy this determinism... e.g. seeds are either hard-coded or inputs to a task)
what if the preexisting venv is just the system python ? my base image is python:3.10.10 and i just pip install all requirements in that image . Does that not avoid venv still?
it's good to know that in theory there's a path forward with almost zero overhead . that's what I want .
is it reasonable to expect that with sufficient workers, I can get 50 tasks to run in the same time it takes to run a single one? i cant imagine the apiserver being a noticeable bottleneck .
ugh. again. it launched all these tasks and then just died. logs go silent.
minute of silence between first two msgs and then two more mins until a flood of logs. Basically 3 mins total before this task (which does almost nothing - just using it for testing) starts.
would it be on the pipeline task itself then, since that's what's disappearing?
I will do some experiment comparisons and see if there are package diffs. thanks for the tip.
took me a while to deliver enough functionality to my team to justify working on open source... but I finally go back around to investigating this to write a proper issue, but ended up figuring it out myself and opening a PR:
None
i ran into this recently.
its a small thing but double check the port. should be 443, not 433 as in the docs (typo?) - seems you got this in the screenshot .
no region should be set .
i dont use backblaze but if it helps i can show my digitalocean spaces config . should be comparable .
you can control how much memory elastic has via the compose stack, but in my experience - ive been able to run on a 4 core w 16gb of ram only up to a certain point . for things to feel snappy you really need a lot of memory available once you approach navigating over 100k tasks .
so far under 500k tasks on 16gb of ram dedicated solely to elastic has been stable for us . concurrent execution of more than a couple hundred workers can bring the UI to its knees until complete, so arguably we...
oooh thank you, i was hoping for some sort of debugging tips like that. will do.
from a speed-of-clearing-a-queue perspective, is a services-mode
queue better or worse than having many workers "always up"?