
Reputation
Badges 1
103 × Eureka!starting to . thanks for your explanation .
would those containers best be started from something in services mode? or is it possible to get no-overhead with my approach of worker-inside-docker?
i designed my tasks as different functions, based mostly on what metrics to report and artifacts that are best cached (and how to best leverage comparisons of tasks) . they do require cpu, but not a ton.
I'm now experimenting with lumping a lot of stuff into one big task and seeing how this go...
yeah, still noticing that it can be multiple minutes before something starts...
like... what is happening in this time (besides a git clone), now that I set both
export CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=true
export CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=$(which python)
update: it's now been six mins and the task still isn't done. this should have run through in like a minute total end-to-end

so, I tried this on a fresh deployment, and for some reason that stack allows me to restart without losing App Credentials.
It's just the one that I performed an update on.
nothing came up in the logs. all 200's
if there's a process I'm not understanding please clarify...
but
(a) i start up the compose stack, log in via web browser as a user . this is on a remote server .
(b) i go to settings and generate a credential
(c) i use that credential to set up my local dev env, editing my clearml.conf
(d) i repeat (b) and use that credential to start up a remote workers to serve queues .
am i misunderstanding something? if there's another way to generate credentials I'm not familiar with it .
N/A (still shows as running despite Abort being sent)
thank you!
i'll take that design into consideration.
re: CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL in "docker venv mode" im still not quite sure I understand correctly - since the agent is running in a container, as far as it is concerned it may as well be on bare-metal.
is it just that there's no way for that worker to avoid venv? (i.e. the only way to bypass venv is to use docker-mode?)
let me downgrade my install of clearml and try again.
its odd... I really dont see tasks except the controller one dying
enqueuing. pipe.start("default")
but I think it's picking up on my local clearml install instead of what I told it to use.
my tasks have this in them... what's the equivalent for pipeline controllers?
for now I'm just avoiding restarts of the service, but I do want to get to the bottom of it using a fresh instance.
as a backup plan: is there a way to have an API key set up prior to running docker compose up? Like, I need at least one set of credentials that I can reliably have remote agents use, one that I know persists across restarts and upgrades.
minute of silence between first two msgs and then two more mins until a flood of logs. Basically 3 mins total before this task (which does almost nothing - just using it for testing) starts.
this is not about storage access tokens . its about the App Credentials .
those things you set as CLEARML_API_KEY and SECRET so that clients can talk to the api
yeah. thats how I've been generating credentials for agents as well as for my dev environment .
you can control how much memory elastic has via the compose stack, but in my experience - ive been able to run on a 4 core w 16gb of ram only up to a certain point . for things to feel snappy you really need a lot of memory available once you approach navigating over 100k tasks .
so far under 500k tasks on 16gb of ram dedicated solely to elastic has been stable for us . concurrent execution of more than a couple hundred workers can bring the UI to its knees until complete, so arguably we...
damn, it just happened again... "queued" steps in the viz are actually complete. the pipeline task disappeared again without completion, logs mid-stream.
i would love some advice on that though - should I be using services mode + docker and some max # of instances to be spinning up multiple tasks instead?
my thinking was to avoid some of the docker overhead. but i did try this approach previously and found that the container limit wasn't exactly respected.
oh it's there, before running task.
from task pick-up to "git clone" is now ~30s, much better.
though as far as I understand, the recommendation is still to not run workers-in-docker like this:
export CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1
export CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=$(which python)
(and fwiw I have this in my entrypoint.sh
)
cat <<EOF > ~/clearml.conf
agent {
vcs_cache {
enabled: true
}
package_manager: {
type: pip,
...
yup. once again, rebooted and lost my credentials.
perfect. thank you. I verified that this was indeed reproducible on 1.16.0 with a fresh deployment.
what if the preexisting venv is just the system python ? my base image is python:3.10.10 and i just pip install all requirements in that image . Does that not avoid venv still?
it's good to know that in theory there's a path forward with almost zero overhead . that's what I want .
is it reasonable to expect that with sufficient workers, I can get 50 tasks to run in the same time it takes to run a single one? i cant imagine the apiserver being a noticeable bottleneck .
damn. I can't believe it. It disappeared again despite having 1.15.1 be the task's clearml version.
I'm going to try running the pipeline locally.
i really dont see how this provides any additional context that the timestamps + crops dont but okay.
yeah locally it did run. I then ran another via UI spawned from the successful one, it showed cached steps and then refused to run the bottom one, disappearing again. No status message, no status reason. (not running... actually dead)
would it be on the pipeline task itself then, since that's what's disappearing?
I will do some experiment comparisons and see if there are package diffs. thanks for the tip.
trying to run the experiment that kept failing right now, watching logs (they go by fast)... will try to spot anything anamolous