
Reputation
Badges 1
103 × Eureka!that's the final screenshot. it just shows a bunch of normal "launching ..." steps, and then stops all the sudden.
None here's how I'm establishing worker-server (and client-server) comms fwiw
yeah locally it did run. I then ran another via UI spawned from the successful one, it showed cached steps and then refused to run the bottom one, disappearing again. No status message, no status reason. (not running... actually dead)
damn. I can't believe it. It disappeared again despite having 1.15.1 be the task's clearml version.
I'm going to try running the pipeline locally.
it happens consistently with this one task that really should be all cache.
I disabled cache in the final step and it seems to run now.
damn, it just happened again... "queued" steps in the viz are actually complete. the pipeline task disappeared again without completion, logs mid-stream.
I think i've narrowed this down to the ssh connection approach.
regarding the container that runs the pipeline:
- when I made it stop using autossh tunnels and instead put it on the same machine as the clearml server + used docker network host mode, suddenly the problematic pipeline started completing.
it's just so odd that the pipeline controller task is the only one with an issue. the modeling / data-creation tasks really all seem to complete consistently just fine.
so yeah, best guess n...
N/A (still shows as running despite Abort being sent)
default queue is served with (containerized + custom entrypoint) venv workers (agent services just wasn't working great for me, gave up)
thanks for the clarification. is there any bypass? (a git diff + git rev parse should take mere milliseconds)
I'm working out of a mono repo, and am beginning to suspect its a cause of slowness. next week ill try moving a pipeline over to a new repo to test if this theory holds any water.
it's pretty reliably happening but the logs are just not informative. just stops midway
i would love some advice on that though - should I be using services mode + docker and some max # of instances to be spinning up multiple tasks instead?
my thinking was to avoid some of the docker overhead. but i did try this approach previously and found that the container limit wasn't exactly respected.
what if the preexisting venv is just the system python ? my base image is python:3.10.10 and i just pip install all requirements in that image . Does that not avoid venv still?
it's good to know that in theory there's a path forward with almost zero overhead . that's what I want .
is it reasonable to expect that with sufficient workers, I can get 50 tasks to run in the same time it takes to run a single one? i cant imagine the apiserver being a noticeable bottleneck .
I'm just working on speeding up the time from "queue experiment" to "my code actually runs remotely" - as of yesterday things would sit for many minutes at a time. trying to see if venv is the culprit .
i just need to understand what I should be expecting. I thought from putting it into queue in UI to "running my code remotely" (esp with packages preloaded) should be fairly fast turnaround - certainly not three minutes... i'll have to change my whole pipeline design if this is the case)
oooh thank you, i was hoping for some sort of debugging tips like that. will do.
from a speed-of-clearing-a-queue perspective, is a services-mode
queue better or worse than having many workers "always up"?
I think of draft tasks as "class definitions" that the pipeline uses to create task "objects" out of.
thanks so much!
I've been running a bunch of tests with timers and seeing an absurd amount of variance. Ive seen parameters connect and task create in seconds and other times it takes 4 minutes.
Since I see timeout connection errors somewhat regularly, I'm wondering if perhaps I'm having networking errors. Is there a way (at the class level) to control the retry logic on connecting to the API server?
my operating theory is that some sort of backoff / timeout (eg 10s) is causing the hig...
For digitalocean:
host: "(region). digitaloceanspaces.com:443 "
bucket: “(bucket name)”
key: “(key)”
secret: “(secret)”
multipart: false
secure: true
(verify commented out entirely)
So for you - make sure to add your creds that have the right scope (r/w), and try specifying the bucket .
Then in clearml tasks themselves you tell the task using output_uri=“s3://(region).digitaloceanspaces.com:443/clearml/”
(I import this as a constant from a _constants.py file...
yes i actually have been able to turn on caching after rc2 of the agent! been working much better .
yup, but you can modify them after task creation in the UI (if its in draft state)
it's upon runtime instantiation of the pipelinecontroller class.
that same pipeline with just 1 date input.
i have the flexibility from the UI to either run a single, a dozen, or a hundred experiments... in parallel.
pipelines are amazing 😃
basically the git hash of the executed experiment + a hash on the inputs to the task.
App Credentials now persist (I upgraded 1.15.1 -> 1.16.1 and the same keys exist!)
thanks!
if there's a process I'm not understanding please clarify...
but
(a) i start up the compose stack, log in via web browser as a user . this is on a remote server .
(b) i go to settings and generate a credential
(c) i use that credential to set up my local dev env, editing my clearml.conf
(d) i repeat (b) and use that credential to start up a remote workers to serve queues .
am i misunderstanding something? if there's another way to generate credentials I'm not familiar with it .
I do this a lot. pipeline params spawn K number of nodes, that collect just like you drew. No decorator being used here, just referencing tasks by id or name/project. I do not use continue on fail at all.
I do this with functions that have the contract ( f(pipe: PipelineController, **kwargs) -> PipelineController
) and a for-loop.
just be aware DAG creation slows down pretty quickly after a dozen or so such loops.
All the images below were made with the same pipeline (just evolved some n...