Reputation
Badges 1
103 × Eureka!i was having a ton of git clone issues - disabled caching entirely... wonder if that may help too.
tysm for your help! will report back soon.
minute of silence between first two msgs and then two more mins until a flood of logs. Basically 3 mins total before this task (which does almost nothing - just using it for testing) starts.
fwiw - i'm starting to wonder if there's a difference between me "resetting the task" vs cloning it.
clearml-server-1.15.1, clearml-1.16.2
mind-blowing... but somehow just later in the same day I got the same pipeline to create its DAG and start running in under a minute.
I don't know what exactly I changed. The pipeline task was run locally (which I've never done before), then cloned to run remotely in my services queue. And then it just flew through the experiment at the pace I expected.
so there's hope. i'll keep stress-testing it and see what causes differences. I was right to suspect that such a simple DAG should not take...
so far it seems that turning off cache like this is my "best option"
oooh thank you, i was hoping for some sort of debugging tips like that. will do.
from a speed-of-clearing-a-queue perspective, is a services-mode
queue better or worse than having many workers "always up"?
I'm just working on speeding up the time from "queue experiment" to "my code actually runs remotely" - as of yesterday things would sit for many minutes at a time. trying to see if venv is the culprit .
thank you!
i'll take that design into consideration.
re: CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL in "docker venv mode" im still not quite sure I understand correctly - since the agent is running in a container, as far as it is concerned it may as well be on bare-metal.
is it just that there's no way for that worker to avoid venv? (i.e. the only way to bypass venv is to use docker-mode?)
I can see agent.vcs_cache.enabled = true
as a printout in the Console, but cannot find docs on how to set this via environment variable, since I'm trying to keep these containers from needing a clearml.conf
file (though I can generate on in the entrypoint script if need be with <EOF>
)
# imports
...
if __name__ == "__main__:
pipe = PipelineController(...)
# after instantiation, before "the code" that creates the pipeline.
# normal tasks can handle task.execute_remotely() at this stage...
pipe = add_steps_to_pipe(pipe)
...
# after the pipeline is defined. best I can tell, *has* to be last thing in code.
pipe.start_locally() # or just .start()
is there a way for me to toggle CLEARML's log level? I'm doing some manual task-debugging in ipython and think it would be helpful to see network requests and timeouts if they're occurring.
I think of draft tasks as "class definitions" that the pipeline uses to create task "objects" out of.
oh it's there, before running task.
from task pick-up to "git clone" is now ~30s, much better.
though as far as I understand, the recommendation is still to not run workers-in-docker like this:
export CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1
export CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=$(which python)
(and fwiw I have this in my entrypoint.sh
)
cat <<EOF > ~/clearml.conf
agent {
vcs_cache {
enabled: true
}
package_manager: {
type: pip,
...
yes i actually have been able to turn on caching after rc2 of the agent! been working much better .
It sounds like you understand the limitations correctly.
As far as I know, it'd be up to you to write your own code that computes the delta between old and new and only re-process the new entries.
The API would let you search through prior experimental results.
so you could load up the prior task, check the ids that showed up in output (maybe you save these as a separate artifact for faster load times), and only evaluate the new inputs. perhaps you copy over the old outputs to the new task...
i understood that part, but noticed that when putting in the code to start remotely, the consequence seems to be that the dag computation happens twice - once on my machine as it runs, and then again remotely (this is at least part of why its slower) . if i put pipe.start earlier in the code, the pipeline fails to execute the actual steps .
this is unlike tasks, which somehow are smart enough to publish in draft form when task.execute_remotely is up top .
do i just leave off pipe.start?
when I do a docker compose down; docker compose up -d
... these disappear.
to be clear... this was not happening before I upgraded to the latest version. That is why I am asking about this.
odd bc I thought I was controlling this... maybe I'm wrong and the env is mis-set.
when i run the pipe locally, im using the same connect.sh script as the workers are in order to poll the apiserver via the ssh tunnel.
yeah locally it did run. I then ran another via UI spawned from the successful one, it showed cached steps and then refused to run the bottom one, disappearing again. No status message, no status reason. (not running... actually dead)
default queue is served with (containerized + custom entrypoint) venv workers (agent services just wasn't working great for me, gave up)
and for what its worth it seems I dont have anything special for agent cloning
i did find agent.vcs_cache.clone_on_pull_fail to be helpful . but yah, updating the agent was the biggest fix
hoping this really is a 1.16.2 issue. fingers crossed. at this point more pipes are failing than not.
(the "magic" of the env detection is nice but man... it has its surprises)
that's the final screenshot. it just shows a bunch of normal "launching ..." steps, and then stops all the sudden.
I would assume a lot of them are logs streaming? So you can try reducing printouts / progress bars. That seems to help for me.
For context: I have noticed the large number of API calls can be a problem when networking is unreliable. It causes a cascade of slow retries and can really hold up task execution. So do be cautious of where work is occurring relative to where the server is, and what connects the two.