
Reputation
Badges 1
103 × Eureka!yup, but you can modify them after task creation in the UI (if its in draft state)
it's upon runtime instantiation of the pipelinecontroller class.
the pipeline is to orchestrate tasks to create more complex functionality, and take advantage of caching, yes.
here I run backtesting (how well did i predict the future), and can control frequency "every week, every month" etc.
so if I increase frequency, I dont need to rerun certain branches of the pipeline and therefore they are cached. another example: if I change something that impacts layer 3 but not layer 1-2, then about half my tasks are cached.
the pictured pipeline is: "create data...
basically the git hash of the executed experiment + a hash on the inputs to the task.
and yes, you're correct. I'd say this is exactly what clearml pipelines offer.
the smartness is simple enough: same inputs are assumed to create the same outputs (it's up to YOU to ensure your tasks satisfy this determinism... e.g. seeds are either hard-coded or inputs to a task)
perfect. thank you. I verified that this was indeed reproducible on 1.16.0 with a fresh deployment.
yup! that's what I was wondering if you'd help me find a way to change the timings of. Is there an option I can override to make the retry more aggressive?
I've definitely narrowed it down to the reverse proxy I'm behind. when I switch to a cloudflare tunnel, the overhead of the network is <1s compared to localhost, everything feels snappy!
But for security reasons, I need to keep using the reverse proxy, hence my question about configuring the silent clearml retries.
I can confirm that simply switching back to 1.15.1
results in persistent "App Credentials" across restarts.
Literally just did :%s/1.16.0/1.15.1/g
, restarted the stack under the older version, created creds, and restarted again... and found them sitting there. So I know my volume mounts and all are good. It's something about the upgrade that caused this.
There's an issue on github that seems to be related, but the discussion under it seems to have digressed. Should I open a new is...
yeah. thats how I've been generating credentials for agents as well as for my dev environment .
thank you very much.
for remote workers, would this env variable get parsed correctly?CLEARML_API_HTTP_RETRIES_BACKOFF_FACTOR=0.1
everything else is 200 except these two
if there's a process I'm not understanding please clarify...
but
(a) i start up the compose stack, log in via web browser as a user . this is on a remote server .
(b) i go to settings and generate a credential
(c) i use that credential to set up my local dev env, editing my clearml.conf
(d) i repeat (b) and use that credential to start up a remote workers to serve queues .
am i misunderstanding something? if there's another way to generate credentials I'm not familiar with it .
I do this a lot. pipeline params spawn K number of nodes, that collect just like you drew. No decorator being used here, just referencing tasks by id or name/project. I do not use continue on fail at all.
I do this with functions that have the contract ( f(pipe: PipelineController, **kwargs) -> PipelineController
) and a for-loop.
just be aware DAG creation slows down pretty quickly after a dozen or so such loops.
All the images below were made with the same pipeline (just evolved some n...
I would assume a lot of them are logs streaming? So you can try reducing printouts / progress bars. That seems to help for me.
For context: I have noticed the large number of API calls can be a problem when networking is unreliable. It causes a cascade of slow retries and can really hold up task execution. So do be cautious of where work is occurring relative to where the server is, and what connects the two.
# imports
...
if __name__ == "__main__:
pipe = PipelineController(...)
# after instantiation, before "the code" that creates the pipeline.
# normal tasks can handle task.execute_remotely() at this stage...
pipe = add_steps_to_pipe(pipe)
...
# after the pipeline is defined. best I can tell, *has* to be last thing in code.
pipe.start_locally() # or just .start()
is it? I can't tell if these delays (DAG-computation) are pipeline-specific (i get that pipeline is just a type of task), but it felt like a different question as I'm asking "are pipelines like this appropriate?"
is there something fundamentally slower about using pipe.start()
at the end of a pipeline vs pipe.run_locally()
?
i understood that part, but noticed that when putting in the code to start remotely, the consequence seems to be that the dag computation happens twice - once on my machine as it runs, and then again remotely (this is at least part of why its slower) . if i put pipe.start earlier in the code, the pipeline fails to execute the actual steps .
this is unlike tasks, which somehow are smart enough to publish in draft form when task.execute_remotely is up top .
do i just leave off pipe.start?
mind-blowing... but somehow just later in the same day I got the same pipeline to create its DAG and start running in under a minute.
I don't know what exactly I changed. The pipeline task was run locally (which I've never done before), then cloned to run remotely in my services queue. And then it just flew through the experiment at the pace I expected.
so there's hope. i'll keep stress-testing it and see what causes differences. I was right to suspect that such a simple DAG should not take...
thanks so much!
I've been running a bunch of tests with timers and seeing an absurd amount of variance. Ive seen parameters connect and task create in seconds and other times it takes 4 minutes.
Since I see timeout connection errors somewhat regularly, I'm wondering if perhaps I'm having networking errors. Is there a way (at the class level) to control the retry logic on connecting to the API server?
my operating theory is that some sort of backoff / timeout (eg 10s) is causing the hig...
starting to . thanks for your explanation .
would those containers best be started from something in services mode? or is it possible to get no-overhead with my approach of worker-inside-docker?
i designed my tasks as different functions, based mostly on what metrics to report and artifacts that are best cached (and how to best leverage comparisons of tasks) . they do require cpu, but not a ton.
I'm now experimenting with lumping a lot of stuff into one big task and seeing how this go...
for now I'm just avoiding restarts of the service, but I do want to get to the bottom of it using a fresh instance.
as a backup plan: is there a way to have an API key set up prior to running docker compose up? Like, I need at least one set of credentials that I can reliably have remote agents use, one that I know persists across restarts and upgrades.
I can see agent.vcs_cache.enabled = true
as a printout in the Console, but cannot find docs on how to set this via environment variable, since I'm trying to keep these containers from needing a clearml.conf
file (though I can generate on in the entrypoint script if need be with <EOF>
)
update: ever since turning off git caching, i've had much more stability. i cannot tell whether it's causing a slow down in task execution though - is the clone a shallow one by default?
and for what its worth it seems I dont have anything special for agent cloning
i did find agent.vcs_cache.clone_on_pull_fail to be helpful . but yah, updating the agent was the biggest fix
yeah i ended up figuring it out . i think we are in similar situations (private git repo w token) . ill take a look at my config tomorrow but from memory, you have to set your env variables and have an option in your config to force https protocol if you're using a token .
yeah. it's using what you see in the UI here.
so if you made a change to a task used in a pipeline (my pipelines are from tasks, not functions... can't speak to that but i think it just generates a hidden task under the hood), point the (draft) task to that commit (assuming it's pushed), or re-run the task. the pipeline picks up from the tasks the API is aware of (by id or by name, in which case it uses latest updated) under the specified project, not from local code.
that part was confusing...
that same pipeline with just 1 date input.
i have the flexibility from the UI to either run a single, a dozen, or a hundred experiments... in parallel.
pipelines are amazing 😃
so far it seems that turning off cache like this is my "best option"
are you on clearml agent 1.8.0?
(im noticing sometimes im just missing logs such as "Running task id.." entirely)
so, I tried this on a fresh deployment, and for some reason that stack allows me to restart without losing App Credentials.
It's just the one that I performed an update on.