Reputation
Badges 1
103 × Eureka!I would assume a lot of them are logs streaming? So you can try reducing printouts / progress bars. That seems to help for me.
For context: I have noticed the large number of API calls can be a problem when networking is unreliable. It causes a cascade of slow retries and can really hold up task execution. So do be cautious of where work is occurring relative to where the server is, and what connects the two.
everything else is 200 except these two
thanks for the clarification. is there any bypass? (a git diff + git rev parse should take mere milliseconds)
I'm working out of a mono repo, and am beginning to suspect its a cause of slowness. next week ill try moving a pipeline over to a new repo to test if this theory holds any water.
still no graphs showing up, and still seeing this error in the console logs.
(deployment is localhost)
and yes, you're correct. I'd say this is exactly what clearml pipelines offer.
the smartness is simple enough: same inputs are assumed to create the same outputs (it's up to YOU to ensure your tasks satisfy this determinism... e.g. seeds are either hard-coded or inputs to a task)
thanks!
I've been experiencing enough weird behavior on my new deployment that I need to stick to 1.15.1 for a bit to get work done. The graphs show up there just fine, and it feels like (since I no longer need auth) it's the more stable choice right now.
When clearml-web
receives the updates that are on the main branch now, I'll definitely be rushing to upgrade our images and test the latest again. (for now I'm still running a sidecar container hosting the built version of the web app o...
no helm . just docker compose, but yes community edition
took me a while to deliver enough functionality to my team to justify working on open source... but I finally go back around to investigating this to write a proper issue, but ended up figuring it out myself and opening a PR:
None
thanks so much!
I've been running a bunch of tests with timers and seeing an absurd amount of variance. Ive seen parameters connect and task create in seconds and other times it takes 4 minutes.
Since I see timeout connection errors somewhat regularly, I'm wondering if perhaps I'm having networking errors. Is there a way (at the class level) to control the retry logic on connecting to the API server?
my operating theory is that some sort of backoff / timeout (eg 10s) is causing the hig...
hm. yeah i do see something like what you have in the screenshot.
{"meta":{"id":"d7d059b69fc14cba9ba6ff52307c9f67","trx":"d7d059b69fc14cba9ba6ff52307c9f67","endpoint":{"name":"queues.get_queue_metrics","requested_version":"2.30","actual_version":"2.4"},"result_code":200,"result_subcode":0,"result_msg":"OK","error_stack":"","error_data":{}},"data":{"queues":[{"avg_waiting_times":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0...
I do this a lot. pipeline params spawn K number of nodes, that collect just like you drew. No decorator being used here, just referencing tasks by id or name/project. I do not use continue on fail at all.
I do this with functions that have the contract ( f(pipe: PipelineController, **kwargs) -> PipelineController
) and a for-loop.
just be aware DAG creation slows down pretty quickly after a dozen or so such loops.
All the images below were made with the same pipeline (just evolved some n...
yes i actually have been able to turn on caching after rc2 of the agent! been working much better .
the pipeline is to orchestrate tasks to create more complex functionality, and take advantage of caching, yes.
here I run backtesting (how well did i predict the future), and can control frequency "every week, every month" etc.
so if I increase frequency, I dont need to rerun certain branches of the pipeline and therefore they are cached. another example: if I change something that impacts layer 3 but not layer 1-2, then about half my tasks are cached.
the pictured pipeline is: "create data...
yup, but you can modify them after task creation in the UI (if its in draft state)
it's upon runtime instantiation of the pipelinecontroller class.
not quite seeing that one. hoping these views help
yup! that's what I was wondering if you'd help me find a way to change the timings of. Is there an option I can override to make the retry more aggressive?
I've definitely narrowed it down to the reverse proxy I'm behind. when I switch to a cloudflare tunnel, the overhead of the network is <1s compared to localhost, everything feels snappy!
But for security reasons, I need to keep using the reverse proxy, hence my question about configuring the silent clearml retries.
yeah i ended up figuring it out . i think we are in similar situations (private git repo w token) . ill take a look at my config tomorrow but from memory, you have to set your env variables and have an option in your config to force https protocol if you're using a token .
so, i got around this with env vars
in my worker entrypoint script , I do
export CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1
export CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=$(which python)
and for what its worth it seems I dont have anything special for agent cloning
i did find agent.vcs_cache.clone_on_pull_fail to be helpful . but yah, updating the agent was the biggest fix
ah, I'm self-hosting.
progress bars could easily take up several thousand calls, as it moves with each batch.
would love to know if the # of API calls decreases substantially by turning off auto_connect_streams
. please post an update when you have one 😃
N/A (still shows as running despite Abort being sent)
for me, it was to set loglevel higher up and reduce the number of prints that my code was doing. since I was using a logger instead of prints, it was pretty easy.
If you're using some framework that spits out its own progress bars, then I'd look into disabling those from options available.
Turning off logs entirely I don't know, will let clearml ppl respond to that.
For sure though the comms of CPU monitoring and epoch monitoring will lead to a lot of calls... but i'll agree 80k seems exce...
nothing came up in the logs. all 200's
I think of draft tasks as "class definitions" that the pipeline uses to create task "objects" out of.
odd bc I thought I was controlling this... maybe I'm wrong and the env is mis-set.
clearml-server-1.15.1, clearml-1.16.2
hello @<1523701087100473344:profile|SuccessfulKoala55>
I appreciate your help. Thank you. Do you happen to have any updates? We had another restart and lost the creds again. So our deployment is in a brittle state on this latest upgrade, and I'm going back to 1.15.1 until I hear back.
worker thinks its in venv mode but is containerized .
apiserver is docker compose stack
ill check logs next time i see it .
currently rushing to ship a model out, so I've just been running smaller experiments slowly hoping to avoid the situation . fingers crossed .