Hi Guys, just curious here, what's was the final issue?
Also out of curiosity, what does that mean? "1.12.2 because some bug that make fastai lag 2x" ?
"regular" worker will run one job at a time, services worker will spin multiple tasks at the same time But their setup (i.e. before running the actual task) is one at a time..
in my case using self-hosted and agent inside a docker container:
47:45 : taks foo pulled
[ git clone, pip install, check that all requirements satisfied, and nothing is downloaded]
48:16 : start training
minute of silence between first two msgs and then two more mins until a flood of logs. Basically 3 mins total before this task (which does almost nothing - just using it for testing) starts.
sometimes I get "lucky" and see something more like what I expect... total experiment time < 1 min (and I have evidence of this happening. logs start-to-finish in sub-minute). But then other times the same task will take 5-10 minutes.
same worker, same queue, just one worker serving it... I am so utterly perplexed by the variation in how long things take. my clearml API server is running on a beefy 32 core machine and not much else is happening right now...
apologies - just trying to keep sensitive data out of screenshot
are you on clearml agent 1.8.0?
(im noticing sometimes im just missing logs such as "Running task id.." entirely)
oooh thank you, i was hoping for some sort of debugging tips like that. will do.
from a speed-of-clearing-a-queue perspective, is a services-mode
queue better or worse than having many workers "always up"?
from task pick-up to "git clone" is now ~30s, much better.
This is "spent" calling apt update && update install && pip install clearml-agent
if you have those preinstalled it should be quick
though as far as I understand, the recommendation is still to not run workers-in-docker like this:
if you do not want it to install anything and just use existing venv (leaving the venv as is) and if something is missing then so be it, then yes sure that the way to go
I think a proper screenshot of the full log with some information redacted is the way to go. Otherwise we are just guessing in the dark
i would love some advice on that though - should I be using services mode + docker and some max # of instances to be spinning up multiple tasks instead?
my thinking was to avoid some of the docker overhead. but i did try this approach previously and found that the container limit wasn't exactly respected.
1.12.2 because some bug that make fastai lag 2x
1.8.1rc2 because it fix an annoying git clone bug
fwiw - i'm starting to wonder if there's a difference between me "resetting the task" vs cloning it.
ha! yup. that was it exactly. I posted about it too None lol
BTW: you can also just add -e "
CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1"
to the docker args (under the Execution tab) to override the setting of the docker.
you can also add " export;
" to the docker startup bash script section (do not add "#/bin/bash" , just the actual script) to get a list of all the environment variables inside the docker, just in case
I know that git clone and pip verify all installed is normal. But for some reason in Michael screenshot, I don't see those steps ...
normally when new package need to be install, it shows up in the Console tab
yeah... still seeing variances from 1m to 10m for the same task. been testing parallel execution for hours.
starting to . thanks for your explanation .
would those containers best be started from something in services mode? or is it possible to get no-overhead with my approach of worker-inside-docker?
i designed my tasks as different functions, based mostly on what metrics to report and artifacts that are best cached (and how to best leverage comparisons of tasks) . they do require cpu, but not a ton.
I'm now experimenting with lumping a lot of stuff into one big task and seeing how this goes instead . i have to be more selective in the reporting of metrics and plots though .
there is almost zero overhead if your docker container alreadyt has everything (including the agent) preinstalled and you set it with CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1
it then should basically just run the code.
i just need to understand what I should be expecting. I thought from putting it into queue in UI to "running my code remotely" (esp with packages preloaded) should be fairly fast turnaround - certainly not three minutes... i'll have to change my whole pipeline design if this is the case)
I can see all the steps like git clone,
git clone has nothing to do with "env setup" this is brining the code, you cannot skip that one, that said, this is why the git itself is cached on the host machine, so it is fast
... There may be some odd package that need to be installed because one of our DS is experimenting ... But all that we can see what is happening.
even if everything is preinstalled, it Verifies the packages match, this might take a long time. It's just pip being pip (if you want the extreme try to do the same with conda, that one is even slower)
the output of that verification stage is no new packages are installed (otherwise good thing we checked 🙂 )
bottom line, if you want to skip the pip verification/installation pass CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1
btw: i'm checking regrading the GH issue
okay that's a similar setup to mine... that's interesting.
much more in line with my expectation.
hard to see with your croppout here an there ...