
Reputation
Badges 1
103 × Eureka!worker thinks its in venv mode but is containerized .
apiserver is docker compose stack
ill check logs next time i see it .
currently rushing to ship a model out, so I've just been running smaller experiments slowly hoping to avoid the situation . fingers crossed .
He's asking "what git credentials make sense to use for agents" - regardless of autoscaling or not. I had the same question earlier.
tldr: it depends on your security policies.
@<1719524650926477312:profile|EncouragingFish95> - if you have the ability to create a "service account" in your git provider, perhaps at the org-level, I would do that.
My org's cloud git provider does not enable this functionality, and so we have agreed that it is "acceptable" to have the agent's git credentials...
thank you very much.
for remote workers, would this env variable get parsed correctly?CLEARML_API_HTTP_RETRIES_BACKOFF_FACTOR=0.1
yeah this problem seems to happen on 1.15.1 and 1.16.2 as well, prior runs were on the same version even. It just feels like it happens absolutely randomly (but often).
just happened again to me.
The pipeline is constructed from tasks, it basically does map/reduce. prepare data -> model training + evaluation -> backtesting performance summary.
It figures out how wide to go by parsing the date range supplied as input parameter. Been running stuff like this for months but only recently did ...
None here's how I'm establishing worker-server (and client-server) comms fwiw
let me downgrade my install of clearml and try again.
yeah, it just shows what I see in the Console, but then immediately goes back to polling for more work (so... instead of running backtest, it exits, no completion message)
ah. a clue! it came right below that but i guess out of order...
that id
is the pipeline that failed
you can control how much memory elastic has via the compose stack, but in my experience - ive been able to run on a 4 core w 16gb of ram only up to a certain point . for things to feel snappy you really need a lot of memory available once you approach navigating over 100k tasks .
so far under 500k tasks on 16gb of ram dedicated solely to elastic has been stable for us . concurrent execution of more than a couple hundred workers can bring the UI to its knees until complete, so arguably we...
enqueuing. pipe.start("default")
but I think it's picking up on my local clearml install instead of what I told it to use.
my tasks have this in them... what's the equivalent for pipeline controllers?
N/A (still shows as running despite Abort being sent)
it happens consistently with this one task that really should be all cache.
I disabled cache in the final step and it seems to run now.
yup! that's what I was wondering if you'd help me find a way to change the timings of. Is there an option I can override to make the retry more aggressive?
I've definitely narrowed it down to the reverse proxy I'm behind. when I switch to a cloudflare tunnel, the overhead of the network is <1s compared to localhost, everything feels snappy!
But for security reasons, I need to keep using the reverse proxy, hence my question about configuring the silent clearml retries.
yeah locally it did run. I then ran another via UI spawned from the successful one, it showed cached steps and then refused to run the bottom one, disappearing again. No status message, no status reason. (not running... actually dead)
damn. I can't believe it. It disappeared again despite having 1.15.1 be the task's clearml version.
I'm going to try running the pipeline locally.
thank you!
i'll take that design into consideration.
re: CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL in "docker venv mode" im still not quite sure I understand correctly - since the agent is running in a container, as far as it is concerned it may as well be on bare-metal.
is it just that there's no way for that worker to avoid venv? (i.e. the only way to bypass venv is to use docker-mode?)
it's pretty reliably happening but the logs are just not informative. just stops midway
odd bc I thought I was controlling this... maybe I'm wrong and the env is mis-set.
I have tried other queues, they're all running the same container.
so far the only thing reliable is pipe.start_locally()
is there a way for me to toggle CLEARML's log level? I'm doing some manual task-debugging in ipython and think it would be helpful to see network requests and timeouts if they're occurring.
I really can't provide a script that matches exactly (though I do plan to publish something like this soon enough), but here's one that's quite close / similar in style:
None where I tried function-steps out instead, but it's a similar architecture for the pipeline (the point of the example was to show how to do a dynamic pipeline)
thanks so much!
I've been running a bunch of tests with timers and seeing an absurd amount of variance. Ive seen parameters connect and task create in seconds and other times it takes 4 minutes.
Since I see timeout connection errors somewhat regularly, I'm wondering if perhaps I'm having networking errors. Is there a way (at the class level) to control the retry logic on connecting to the API server?
my operating theory is that some sort of backoff / timeout (eg 10s) is causing the hig...
its odd... I really dont see tasks except the controller one dying
is it? I can't tell if these delays (DAG-computation) are pipeline-specific (i get that pipeline is just a type of task), but it felt like a different question as I'm asking "are pipelines like this appropriate?"
is there something fundamentally slower about using pipe.start()
at the end of a pipeline vs pipe.run_locally()
?
thanks for the clarification. is there any bypass? (a git diff + git rev parse should take mere milliseconds)
I'm working out of a mono repo, and am beginning to suspect its a cause of slowness. next week ill try moving a pipeline over to a new repo to test if this theory holds any water.
damn, it just happened again... "queued" steps in the viz are actually complete. the pipeline task disappeared again without completion, logs mid-stream.
default queue is served with (containerized + custom entrypoint) venv workers (agent services just wasn't working great for me, gave up)