i would love some advice on that though - should I be using services mode + docker and some max # of instances to be spinning up multiple tasks instead?
my thinking was to avoid some of the docker overhead. but i did try this approach previously and found that the container limit wasn't exactly respected.
im not running in docker mode though
hmmm that might be the first issue. it cannot skip venv creation, it can however use a pre-existing venv (but it will change it every time it installs a missing package)
so setting CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1 in non docker mode has no affect
im not running in docker mode though - im running a clearml worker in a docker container (and then multiplying the container)
Please refer to here None
The doc need to be a bit clearer: one require a path and not just true/false
what if the preexisting venv is just the system python ? my base image is python:3.10.10 and i just pip install all requirements in that image . Does that not avoid venv still?
it's good to know that in theory there's a path forward with almost zero overhead . that's what I want .
is it reasonable to expect that with sufficient workers, I can get 50 tasks to run in the same time it takes to run a single one? i cant imagine the apiserver being a noticeable bottleneck .
We need to focus first on Why is it taking minutes to reach Using env.
In our case, we have a container that have all packages installed straight in the system, no venv in the container. Thus we don't use CLEARML_AGENT_SKIP_PIP_VENV_INSTALL
But then when a task is pulled, I can see all the steps like git clone, a bunch of Requirement already satisfied
.... There may be some odd package that need to be installed because one of our DS is experimenting ... But all that we can see what is happening.
In @<1689446563463565312:profile|SmallTurkey79> case, are you saying the log don't show anything at all ? After it pull the task: 5 minutes pass and no explanation of what those 5min been used for ?
of what task? i'm running lots of them and benchmarking execution times. would you like to see a best case or worst case scenario? (ive kept some experiments for each).
and yeah, in those docs you just linked, "boolean" vars like CLEARML_AGENT_GIT_CLONE_VERBOSE
explicitly say true
so I ended up trying that pattern. but originally i did try 1. let me go back to that now. thank you.
overall I've seen some improvements in execution time using the suggestions in this thread (tysm!) - the preinstalled libs seem to be helping, though some things are still just unbearably slow (one of my larger pipelines took > 1 h to generate a DAG before even starting...).
is there a way for me to toggle CLEARML's log level? I'm doing some manual task-debugging in ipython and think it would be helpful to see network requests and timeouts if they're occurring.
i was having a ton of git clone issues - disabled caching entirely... wonder if that may help too.
tysm for your help! will report back soon.
@<1523701205467926528:profile|AgitatedDove14> About why we stay on 1.12.2 : None
i really dont see how this provides any additional context that the timestamps + crops dont but okay.
- try with the latest RC
1.8.1rc2
, it feels like after git clone, it spend minutes without outputting anything
yeah that is odd , can you run the agent with --debug (add before the daemon
command) , and then at the end of the command add --foreground
Now launch the same task on that queue, you will have a verbose log in the console.
Let us know what you see
okay that's a similar setup to mine... that's interesting.
much more in line with my expectation.
I think a proper screenshot of the full log with some information redacted is the way to go. Otherwise we are just guessing in the dark
I can see all the steps like git clone,
git clone has nothing to do with "env setup" this is brining the code, you cannot skip that one, that said, this is why the git itself is cached on the host machine, so it is fast
... There may be some odd package that need to be installed because one of our DS is experimenting ... But all that we can see what is happening.
even if everything is preinstalled, it Verifies the packages match, this might take a long time. It's just pip being pip (if you want the extreme try to do the same with conda, that one is even slower)
the output of that verification stage is no new packages are installed (otherwise good thing we checked 🙂 )
bottom line, if you want to skip the pip verification/installation pass CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1
btw: i'm checking regrading the GH issue
the timestamps were all that mattered in those.
So "Using env ..." take minutes without any output ?
ah I see. thank you very much!
trying export CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=$(which python)
but I still see Environment setup completed successfully
(it is printed after Running task id
)
it still takes a full 3 minutes between task pulled by worker until Running task id
is this normal? What is happening in these few minutes (besides a git pull / switch)?
oh yes. Using env
until the next message is 2 minutes.
Hi Guys, just curious here, what's was the final issue?
Also out of curiosity, what does that mean? "1.12.2 because some bug that make fastai lag 2x" ?
in my case using self-hosted and agent inside a docker container:
47:45 : taks foo pulled
[ git clone, pip install, check that all requirements satisfied, and nothing is downloaded]
48:16 : start training
apologies - just trying to keep sensitive data out of screenshot
normally when new package need to be install, it shows up in the Console tab
are you on clearml agent 1.8.0?
(im noticing sometimes im just missing logs such as "Running task id.." entirely)