
Reputation
Badges 1
103 × Eureka!that's the final screenshot. it just shows a bunch of normal "launching ..." steps, and then stops all the sudden.
thank you very much.
for remote workers, would this env variable get parsed correctly?CLEARML_API_HTTP_RETRIES_BACKOFF_FACTOR=0.1
sometimes I get "lucky" and see something more like what I expect... total experiment time < 1 min (and I have evidence of this happening. logs start-to-finish in sub-minute). But then other times the same task will take 5-10 minutes.
same worker, same queue, just one worker serving it... I am so utterly perplexed by the variation in how long things take. my clearml API server is running on a beefy 32 core machine and not much else is happening right now...

and yes, you're correct. I'd say this is exactly what clearml pipelines offer.
the smartness is simple enough: same inputs are assumed to create the same outputs (it's up to YOU to ensure your tasks satisfy this determinism... e.g. seeds are either hard-coded or inputs to a task)
yeah. it's using what you see in the UI here.
so if you made a change to a task used in a pipeline (my pipelines are from tasks, not functions... can't speak to that but i think it just generates a hidden task under the hood), point the (draft) task to that commit (assuming it's pushed), or re-run the task. the pipeline picks up from the tasks the API is aware of (by id or by name, in which case it uses latest updated) under the specified project, not from local code.
that part was confusing...
that same pipeline with just 1 date input.
i have the flexibility from the UI to either run a single, a dozen, or a hundred experiments... in parallel.
pipelines are amazing 😃
the pipeline is to orchestrate tasks to create more complex functionality, and take advantage of caching, yes.
here I run backtesting (how well did i predict the future), and can control frequency "every week, every month" etc.
so if I increase frequency, I dont need to rerun certain branches of the pipeline and therefore they are cached. another example: if I change something that impacts layer 3 but not layer 1-2, then about half my tasks are cached.
the pictured pipeline is: "create data...
what if the preexisting venv is just the system python ? my base image is python:3.10.10 and i just pip install all requirements in that image . Does that not avoid venv still?
it's good to know that in theory there's a path forward with almost zero overhead . that's what I want .
is it reasonable to expect that with sufficient workers, I can get 50 tasks to run in the same time it takes to run a single one? i cant imagine the apiserver being a noticeable bottleneck .
oh it's there, before running task.
from task pick-up to "git clone" is now ~30s, much better.
though as far as I understand, the recommendation is still to not run workers-in-docker like this:
export CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1
export CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=$(which python)
(and fwiw I have this in my entrypoint.sh
)
cat <<EOF > ~/clearml.conf
agent {
vcs_cache {
enabled: true
}
package_manager: {
type: pip,
...
i just need to understand what I should be expecting. I thought from putting it into queue in UI to "running my code remotely" (esp with packages preloaded) should be fairly fast turnaround - certainly not three minutes... i'll have to change my whole pipeline design if this is the case)
i really dont see how this provides any additional context that the timestamps + crops dont but okay.
but pretty reliably some proportion of tasks still just take a much longer time. 1m - 10m is a variance i'd really like to understand.
i understood that part, but noticed that when putting in the code to start remotely, the consequence seems to be that the dag computation happens twice - once on my machine as it runs, and then again remotely (this is at least part of why its slower) . if i put pipe.start earlier in the code, the pipeline fails to execute the actual steps .
this is unlike tasks, which somehow are smart enough to publish in draft form when task.execute_remotely is up top .
do i just leave off pipe.start?
you can control how much memory elastic has via the compose stack, but in my experience - ive been able to run on a 4 core w 16gb of ram only up to a certain point . for things to feel snappy you really need a lot of memory available once you approach navigating over 100k tasks .
so far under 500k tasks on 16gb of ram dedicated solely to elastic has been stable for us . concurrent execution of more than a couple hundred workers can bring the UI to its knees until complete, so arguably we...
i ran into this recently.
its a small thing but double check the port. should be 443, not 433 as in the docs (typo?) - seems you got this in the screenshot .
no region should be set .
i dont use backblaze but if it helps i can show my digitalocean spaces config . should be comparable .
fwiw - i'm starting to wonder if there's a difference between me "resetting the task" vs cloning it.
I can confirm that simply switching back to 1.15.1
results in persistent "App Credentials" across restarts.
Literally just did :%s/1.16.0/1.15.1/g
, restarted the stack under the older version, created creds, and restarted again... and found them sitting there. So I know my volume mounts and all are good. It's something about the upgrade that caused this.
There's an issue on github that seems to be related, but the discussion under it seems to have digressed. Should I open a new is...
thank you!
by any chance do you have insights into github.com/allegroai/clearml-server/issues/248 ? dont know if its related to this at all or not, but it is an issue I experienced after upgrading .
perfect. thank you. I verified that this was indeed reproducible on 1.16.0 with a fresh deployment.
this is not about storage access tokens . its about the App Credentials .
those things you set as CLEARML_API_KEY and SECRET so that clients can talk to the api
it happens consistently with this one task that really should be all cache.
I disabled cache in the final step and it seems to run now.
if there's a process I'm not understanding please clarify...
but
(a) i start up the compose stack, log in via web browser as a user . this is on a remote server .
(b) i go to settings and generate a credential
(c) i use that credential to set up my local dev env, editing my clearml.conf
(d) i repeat (b) and use that credential to start up a remote workers to serve queues .
am i misunderstanding something? if there's another way to generate credentials I'm not familiar with it .
it's really frustrating, as I'm trying to debug server behavior (so I'm restarting often), and keep needing to re-create these.
Nope still dealing with it .
Oddly enough when i spin up a new instance on the new version, it doesnt seem to happen
so, I tried this on a fresh deployment, and for some reason that stack allows me to restart without losing App Credentials.
It's just the one that I performed an update on.