def seeing some that took 7-8 mins whereas others 2-3...
i was having a ton of git clone issues - disabled caching entirely... wonder if that may help too.
tysm for your help! will report back soon.
thank you!
i'll take that design into consideration.
re: CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL in "docker venv mode" im still not quite sure I understand correctly - since the agent is running in a container, as far as it is concerned it may as well be on bare-metal.
is it just that there's no way for that worker to avoid venv? (i.e. the only way to bypass venv is to use docker-mode?)
oh yes. Using env
until the next message is 2 minutes.
is there a way for me to toggle CLEARML's log level? I'm doing some manual task-debugging in ipython and think it would be helpful to see network requests and timeouts if they're occurring.
are you on clearml agent 1.8.0?
(im noticing sometimes im just missing logs such as "Running task id.." entirely)
I'm just working on speeding up the time from "queue experiment" to "my code actually runs remotely" - as of yesterday things would sit for many minutes at a time. trying to see if venv is the culprit .
of what task? i'm running lots of them and benchmarking execution times. would you like to see a best case or worst case scenario? (ive kept some experiments for each).
and yeah, in those docs you just linked, "boolean" vars like CLEARML_AGENT_GIT_CLONE_VERBOSE
explicitly say true
so I ended up trying that pattern. but originally i did try 1. let me go back to that now. thank you.
overall I've seen some improvements in execution time using the suggestions in this thread (tysm!) - the preinstalled libs seem to be helping, though some things are still just unbearably slow (one of my larger pipelines took > 1 h to generate a DAG before even starting...).
from the logs, it feels like after git clone, it spend minutes without outputting anything. AgitatedDove14 Do you know what is the agent suppose to do after git clone ?
I guess a check that all packages is installed ? But then with CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1, what is the agent doing ??
ha! yup. that was it exactly. I posted about it too None lol
i just ran a pipeline that took about 2h (more than half this time was just the DAG), with about a hundred tasks. i'm taking a look at them now to see what the logs show for runtimes.
you should be able to see int the Console tab that show what is happening
ah I see. thank you very much!
trying export CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=$(which python)
but I still see Environment setup completed successfully
(it is printed after Running task id
)
it still takes a full 3 minutes between task pulled by worker until Running task id
is this normal? What is happening in these few minutes (besides a git pull / switch)?
yeah, still noticing that it can be multiple minutes before something starts...
like... what is happening in this time (besides a git clone), now that I set both
export CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=true
export CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=$(which python)
update: it's now been six mins and the task still isn't done. this should have run through in like a minute total end-to-end
yeah... still seeing variances from 1m to 10m for the same task. been testing parallel execution for hours.
normally when new package need to be install, it shows up in the Console tab
i really dont see how this provides any additional context that the timestamps + crops dont but okay.
SmallTurkey79 could you attach the full log of the Task?
also I would recommend "export CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1" (not true
)
Usually binary env vars are 0/1
(I can see that the docs here: None
never mention it, I'll ask them to add that)
I know that git clone and pip verify all installed is normal. But for some reason in Michael screenshot, I don't see those steps ...
I can see all the steps like git clone,
git clone has nothing to do with "env setup" this is brining the code, you cannot skip that one, that said, this is why the git itself is cached on the host machine, so it is fast
... There may be some odd package that need to be installed because one of our DS is experimenting ... But all that we can see what is happening.
even if everything is preinstalled, it Verifies the packages match, this might take a long time. It's just pip being pip (if you want the extreme try to do the same with conda, that one is even slower)
the output of that verification stage is no new packages are installed (otherwise good thing we checked 🙂 )
bottom line, if you want to skip the pip verification/installation pass CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1
btw: i'm checking regrading the GH issue
We need to focus first on Why is it taking minutes to reach Using env.
In our case, we have a container that have all packages installed straight in the system, no venv in the container. Thus we don't use CLEARML_AGENT_SKIP_PIP_VENV_INSTALL
But then when a task is pulled, I can see all the steps like git clone, a bunch of Requirement already satisfied
.... There may be some odd package that need to be installed because one of our DS is experimenting ... But all that we can see what is happening.
In SmallTurkey79 case, are you saying the log don't show anything at all ? After it pull the task: 5 minutes pass and no explanation of what those 5min been used for ?
- try with the latest RC
1.8.1rc2
, it feels like after git clone, it spend minutes without outputting anything
yeah that is odd , can you run the agent with --debug (add before the daemon
command) , and then at the end of the command add --foreground
Now launch the same task on that queue, you will have a verbose log in the console.
Let us know what you see
in my case using self-hosted and agent inside a docker container:
47:45 : taks foo pulled
[ git clone, pip install, check that all requirements satisfied, and nothing is downloaded]
48:16 : start training
"regular" worker will run one job at a time, services worker will spin multiple tasks at the same time But their setup (i.e. before running the actual task) is one at a time..
BTW: you can also just add -e "
CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1"
to the docker args (under the Execution tab) to override the setting of the docker.
you can also add " export;
" to the docker startup bash script section (do not add "#/bin/bash" , just the actual script) to get a list of all the environment variables inside the docker, just in case
starting to . thanks for your explanation .
would those containers best be started from something in services mode? or is it possible to get no-overhead with my approach of worker-inside-docker?
i designed my tasks as different functions, based mostly on what metrics to report and artifacts that are best cached (and how to best leverage comparisons of tasks) . they do require cpu, but not a ton.
I'm now experimenting with lumping a lot of stuff into one big task and seeing how this goes instead . i have to be more selective in the reporting of metrics and plots though .
clearml==1.12.2
clearml_agent v1.8.1rc2
of what task? i'm running lots of them and benchmarking
If you are skipping every installation it should be the same
because if you set CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1
it will not install Anything at all
This is why it's odd to me...
wdyt?
Hi Guys, just curious here, what's was the final issue?
Also out of curiosity, what does that mean? "1.12.2 because some bug that make fastai lag 2x" ?