Badges 1
103 × Eureka!i just ran a pipeline that took about 2h (more than half this time was just the DAG), with about a hundred tasks. i'm taking a look at them now to see what the logs show for runtimes.
I'm just working on speeding up the time from "queue experiment" to "my code actually runs remotely" - as of yesterday things would sit for many minutes at a time. trying to see if venv is the culprit .
yeah, still noticing that it can be multiple minutes before something starts...
like... what is happening in this time (besides a git clone), now that I set both
update: it's now been six mins and the task still isn't done. this should have run through in like a minute total end-to-end
thank you!
i'll take that design into consideration.
re: CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL in "docker venv mode" im still not quite sure I understand correctly - since the agent is running in a container, as far as it is concerned it may as well be on bare-metal.
is it just that there's no way for that worker to avoid venv? (i.e. the only way to bypass venv is to use docker-mode?)
im not running in docker mode though - im running a clearml worker in a docker container (and then multiplying the container)
what if the preexisting venv is just the system python ? my base image is python:3.10.10 and i just pip install all requirements in that image . Does that not avoid venv still?
it's good to know that in theory there's a path forward with almost zero overhead . that's what I want .
is it reasonable to expect that with sufficient workers, I can get 50 tasks to run in the same time it takes to run a single one? i cant imagine the apiserver being a noticeable bottleneck .
of what task? i'm running lots of them and benchmarking execution times. would you like to see a best case or worst case scenario? (ive kept some experiments for each).
and yeah, in those docs you just linked, "boolean" vars like CLEARML_AGENT_GIT_CLONE_VERBOSE
explicitly say true
so I ended up trying that pattern. but originally i did try 1. let me go back to that now. thank you.
overall I've seen some improvements in execution time using the suggestions in this thread (tysm!) - th...
is there a way for me to toggle CLEARML's log level? I'm doing some manual task-debugging in ipython and think it would be helpful to see network requests and timeouts if they're occurring.
i was having a ton of git clone issues - disabled caching entirely... wonder if that may help too.
tysm for your help! will report back soon.
i really dont see how this provides any additional context that the timestamps + crops dont but okay.
but maybe here's a clue. after hanging like that for a while... it seems like the agent restarts (the container it runs in does not)
ah I see. thank you very much!
trying export CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=$(which python)
but I still see Environment setup completed successfully
(it is printed after Running task id
it still takes a full 3 minutes between task pulled by worker until Running task id
is this normal? What is happening in these few minutes (besides a git pull / switch)?
default queue is served with (containerized + custom entrypoint) venv workers (agent services just wasn't working great for me, gave up)
clearml-server-1.15.1, clearml-1.16.2
hoping this really is a 1.16.2 issue. fingers crossed. at this point more pipes are failing than not.
None here's how I'm establishing worker-server (and client-server) comms fwiw
when i run the pipe locally, im using the same script as the workers are in order to poll the apiserver via the ssh tunnel.
so far it seems that turning off cache like this is my "best option"
is it? I can't tell if these delays (DAG-computation) are pipeline-specific (i get that pipeline is just a type of task), but it felt like a different question as I'm asking "are pipelines like this appropriate?"
is there something fundamentally slower about using pipe.start()
at the end of a pipeline vs pipe.run_locally()
perfect. thank you. I verified that this was indeed reproducible on 1.16.0 with a fresh deployment.
minute of silence between first two msgs and then two more mins until a flood of logs. Basically 3 mins total before this task (which does almost nothing - just using it for testing) starts.
sometimes I get "lucky" and see something more like what I expect... total experiment time < 1 min (and I have evidence of this happening. logs start-to-finish in sub-minute). But then other times the same task will take 5-10 minutes.
same worker, same queue, just one worker serving it... I am so utterly perplexed by the variation in how long things take. my clearml API server is running on a beefy 32 core machine and not much else is happening right now...
hello @<1523701087100473344:profile|SuccessfulKoala55>
I appreciate your help. Thank you. Do you happen to have any updates? We had another restart and lost the creds again. So our deployment is in a brittle state on this latest upgrade, and I'm going back to 1.15.1 until I hear back.
yeah... still seeing variances from 1m to 10m for the same task. been testing parallel execution for hours.
are you on clearml agent 1.8.0?
(im noticing sometimes im just missing logs such as "Running task id.." entirely)
worker thinks its in venv mode but is containerized .
apiserver is docker compose stack
ill check logs next time i see it .
currently rushing to ship a model out, so I've just been running smaller experiments slowly hoping to avoid the situation . fingers crossed .
I do this a lot. pipeline params spawn K number of nodes, that collect just like you drew. No decorator being used here, just referencing tasks by id or name/project. I do not use continue on fail at all.
I do this with functions that have the contract ( f(pipe: PipelineController, **kwargs) -> PipelineController
) and a for-loop.
just be aware DAG creation slows down pretty quickly after a dozen or so such loops.
All the images below were made with the same pipeline (just evolved some n...