Reputation
Badges 1
103 × Eureka!He's asking "what git credentials make sense to use for agents" - regardless of autoscaling or not. I had the same question earlier.
tldr: it depends on your security policies.
@<1719524650926477312:profile|EncouragingFish95> - if you have the ability to create a "service account" in your git provider, perhaps at the org-level, I would do that.
My org's cloud git provider does not enable this functionality, and so we have agreed that it is "acceptable" to have the agent's git credentials...
thank you!
out of curiosity: how come the clearml-webserver upgrades weren't included in this release? was it just to patch the api part of the codebase?
you can control how much memory elastic has via the compose stack, but in my experience - ive been able to run on a 4 core w 16gb of ram only up to a certain point . for things to feel snappy you really need a lot of memory available once you approach navigating over 100k tasks .
so far under 500k tasks on 16gb of ram dedicated solely to elastic has been stable for us . concurrent execution of more than a couple hundred workers can bring the UI to its knees until complete, so arguably we...
it's really frustrating, as I'm trying to debug server behavior (so I'm restarting often), and keep needing to re-create these.
if there's a process I'm not understanding please clarify...
but
(a) i start up the compose stack, log in via web browser as a user . this is on a remote server .
(b) i go to settings and generate a credential
(c) i use that credential to set up my local dev env, editing my clearml.conf
(d) i repeat (b) and use that credential to start up a remote workers to serve queues .
am i misunderstanding something? if there's another way to generate credentials I'm not familiar with it .
hello @<1523701087100473344:profile|SuccessfulKoala55>
I appreciate your help. Thank you. Do you happen to have any updates? We had another restart and lost the creds again. So our deployment is in a brittle state on this latest upgrade, and I'm going back to 1.15.1 until I hear back.
so, I tried this on a fresh deployment, and for some reason that stack allows me to restart without losing App Credentials.
It's just the one that I performed an update on.
for me, it was to set loglevel higher up and reduce the number of prints that my code was doing. since I was using a logger instead of prints, it was pretty easy.
If you're using some framework that spits out its own progress bars, then I'd look into disabling those from options available.
Turning off logs entirely I don't know, will let clearml ppl respond to that.
For sure though the comms of CPU monitoring and epoch monitoring will lead to a lot of calls... but i'll agree 80k seems exce...
thank you very much.
for remote workers, would this env variable get parsed correctly?CLEARML_API_HTTP_RETRIES_BACKOFF_FACTOR=0.1
App Credentials now persist (I upgraded 1.15.1 -> 1.16.1 and the same keys exist!)
thanks!
It sounds like you understand the limitations correctly.
As far as I know, it'd be up to you to write your own code that computes the delta between old and new and only re-process the new entries.
The API would let you search through prior experimental results.
so you could load up the prior task, check the ids that showed up in output (maybe you save these as a separate artifact for faster load times), and only evaluate the new inputs. perhaps you copy over the old outputs to the new task...
thanks so much!
I've been running a bunch of tests with timers and seeing an absurd amount of variance. Ive seen parameters connect and task create in seconds and other times it takes 4 minutes.
Since I see timeout connection errors somewhat regularly, I'm wondering if perhaps I'm having networking errors. Is there a way (at the class level) to control the retry logic on connecting to the API server?
my operating theory is that some sort of backoff / timeout (eg 10s) is causing the hig...
I would assume a lot of them are logs streaming? So you can try reducing printouts / progress bars. That seems to help for me.
For context: I have noticed the large number of API calls can be a problem when networking is unreliable. It causes a cascade of slow retries and can really hold up task execution. So do be cautious of where work is occurring relative to where the server is, and what connects the two.
I can confirm that simply switching back to 1.15.1 results in persistent "App Credentials" across restarts.
Literally just did :%s/1.16.0/1.15.1/g , restarted the stack under the older version, created creds, and restarted again... and found them sitting there. So I know my volume mounts and all are good. It's something about the upgrade that caused this.
There's an issue on github that seems to be related, but the discussion under it seems to have digressed. Should I open a new is...
thanks for the clarification. is there any bypass? (a git diff + git rev parse should take mere milliseconds)
I'm working out of a mono repo, and am beginning to suspect its a cause of slowness. next week ill try moving a pipeline over to a new repo to test if this theory holds any water.
I think of draft tasks as "class definitions" that the pipeline uses to create task "objects" out of.
and yes, you're correct. I'd say this is exactly what clearml pipelines offer.
the smartness is simple enough: same inputs are assumed to create the same outputs (it's up to YOU to ensure your tasks satisfy this determinism... e.g. seeds are either hard-coded or inputs to a task)
I do this a lot. pipeline params spawn K number of nodes, that collect just like you drew. No decorator being used here, just referencing tasks by id or name/project. I do not use continue on fail at all.
I do this with functions that have the contract ( f(pipe: PipelineController, **kwargs) -> PipelineController ) and a for-loop.
just be aware DAG creation slows down pretty quickly after a dozen or so such loops.
All the images below were made with the same pipeline (just evolved some n...
thanks!
I've been experiencing enough weird behavior on my new deployment that I need to stick to 1.15.1 for a bit to get work done. The graphs show up there just fine, and it feels like (since I no longer need auth) it's the more stable choice right now.
When clearml-web receives the updates that are on the main branch now, I'll definitely be rushing to upgrade our images and test the latest again. (for now I'm still running a sidecar container hosting the built version of the web app o...
yeah. it's using what you see in the UI here.
so if you made a change to a task used in a pipeline (my pipelines are from tasks, not functions... can't speak to that but i think it just generates a hidden task under the hood), point the (draft) task to that commit (assuming it's pushed), or re-run the task. the pipeline picks up from the tasks the API is aware of (by id or by name, in which case it uses latest updated) under the specified project, not from local code.
that part was confusing...
ah, I'm self-hosting.
progress bars could easily take up several thousand calls, as it moves with each batch.
would love to know if the # of API calls decreases substantially by turning off auto_connect_streams . please post an update when you have one 😃
mind-blowing... but somehow just later in the same day I got the same pipeline to create its DAG and start running in under a minute.
I don't know what exactly I changed. The pipeline task was run locally (which I've never done before), then cloned to run remotely in my services queue. And then it just flew through the experiment at the pace I expected.
so there's hope. i'll keep stress-testing it and see what causes differences. I was right to suspect that such a simple DAG should not take...
is it? I can't tell if these delays (DAG-computation) are pipeline-specific (i get that pipeline is just a type of task), but it felt like a different question as I'm asking "are pipelines like this appropriate?"
is there something fundamentally slower about using pipe.start() at the end of a pipeline vs pipe.run_locally() ?
ah I see. thank you very much!
trying export CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=$(which python)
but I still see Environment setup completed successfully
(it is printed after Running task id )
it still takes a full 3 minutes between task pulled by worker until Running task id
is this normal? What is happening in these few minutes (besides a git pull / switch)?
Nope still dealing with it .
Oddly enough when i spin up a new instance on the new version, it doesnt seem to happen
are you on clearml agent 1.8.0?
(im noticing sometimes im just missing logs such as "Running task id.." entirely)
starting to . thanks for your explanation .
would those containers best be started from something in services mode? or is it possible to get no-overhead with my approach of worker-inside-docker?
i designed my tasks as different functions, based mostly on what metrics to report and artifacts that are best cached (and how to best leverage comparisons of tasks) . they do require cpu, but not a ton.
I'm now experimenting with lumping a lot of stuff into one big task and seeing how this go...
i really dont see how this provides any additional context that the timestamps + crops dont but okay.