Reputation
Badges 1
103 × Eureka!i would love some advice on that though - should I be using services mode + docker and some max # of instances to be spinning up multiple tasks instead?
my thinking was to avoid some of the docker overhead. but i did try this approach previously and found that the container limit wasn't exactly respected.
i am definitely not seeing it persist after upgrading. previously it wasn't a problem on other upgrades
thank you!
out of curiosity: how come the clearml-webserver upgrades weren't included in this release? was it just to patch the api part of the codebase?
yeah. thats how I've been generating credentials for agents as well as for my dev environment .
App Credentials now persist (I upgraded 1.15.1 -> 1.16.1 and the same keys exist!)
thanks!
damn, it just happened again... "queued" steps in the viz are actually complete. the pipeline task disappeared again without completion, logs mid-stream.
thanks!
I've been experiencing enough weird behavior on my new deployment that I need to stick to 1.15.1 for a bit to get work done. The graphs show up there just fine, and it feels like (since I no longer need auth) it's the more stable choice right now.
When clearml-web
receives the updates that are on the main branch now, I'll definitely be rushing to upgrade our images and test the latest again. (for now I'm still running a sidecar container hosting the built version of the web app o...
it happens consistently with this one task that really should be all cache.
I disabled cache in the final step and it seems to run now.
trying to run the experiment that kept failing right now, watching logs (they go by fast)... will try to spot anything anamolous
nothing came up in the logs. all 200's
it's pretty reliably happening but the logs are just not informative. just stops midway
N/A (still shows as running despite Abort being sent)
I have tried other queues, they're all running the same container.
so far the only thing reliable is pipe.start_locally()
that's the final screenshot. it just shows a bunch of normal "launching ..." steps, and then stops all the sudden.
let me downgrade my install of clearml and try again.
yeah this problem seems to happen on 1.15.1 and 1.16.2 as well, prior runs were on the same version even. It just feels like it happens absolutely randomly (but often).
just happened again to me.
The pipeline is constructed from tasks, it basically does map/reduce. prepare data -> model training + evaluation -> backtesting performance summary.
It figures out how wide to go by parsing the date range supplied as input parameter. Been running stuff like this for months but only recently did ...
ugh. again. it launched all these tasks and then just died. logs go silent.
the workers connect to the clearml server via ssh-tunnels, so they all talk to "localhost" despite being deployed in different places. each task creates artifacts and metrics that are used downstream
I really can't provide a script that matches exactly (though I do plan to publish something like this soon enough), but here's one that's quite close / similar in style:
None where I tried function-steps out instead, but it's a similar architecture for the pipeline (the point of the example was to show how to do a dynamic pipeline)
its odd... I really dont see tasks except the controller one dying
enqueuing. pipe.start("default")
but I think it's picking up on my local clearml install instead of what I told it to use.
my tasks have this in them... what's the equivalent for pipeline controllers?
did you take a look at my connect.sh
script? I dont think it's a problem since only the controller task is the problem.
Is there some sort of culling procedure that kills tasks by any chance? the lack of logs makes me think it's something like that.
I can also try different agent versions.
would it be on the pipeline task itself then, since that's what's disappearing?
I will do some experiment comparisons and see if there are package diffs. thanks for the tip.
thank you very much.
for remote workers, would this env variable get parsed correctly?CLEARML_API_HTTP_RETRIES_BACKOFF_FACTOR=0.1
so, i got around this with env vars
in my worker entrypoint script , I do
export CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1
export CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=$(which python)
and for what its worth it seems I dont have anything special for agent cloning
i did find agent.vcs_cache.clone_on_pull_fail to be helpful . but yah, updating the agent was the biggest fix
not quite seeing that one. hoping these views help
took me a while to deliver enough functionality to my team to justify working on open source... but I finally go back around to investigating this to write a proper issue, but ended up figuring it out myself and opening a PR:
None