
Reputation
Badges 1
103 × Eureka!so far it seems that turning off cache like this is my "best option"
yeah. thats how I've been generating credentials for agents as well as for my dev environment .
# imports
...
if __name__ == "__main__:
pipe = PipelineController(...)
# after instantiation, before "the code" that creates the pipeline.
# normal tasks can handle task.execute_remotely() at this stage...
pipe = add_steps_to_pipe(pipe)
...
# after the pipeline is defined. best I can tell, *has* to be last thing in code.
pipe.start_locally() # or just .start()
thank you very much.
for remote workers, would this env variable get parsed correctly?CLEARML_API_HTTP_RETRIES_BACKOFF_FACTOR=0.1
is it? I can't tell if these delays (DAG-computation) are pipeline-specific (i get that pipeline is just a type of task), but it felt like a different question as I'm asking "are pipelines like this appropriate?"
is there something fundamentally slower about using pipe.start()
at the end of a pipeline vs pipe.run_locally()
?
perfect. thank you. I verified that this was indeed reproducible on 1.16.0 with a fresh deployment.
so, I tried this on a fresh deployment, and for some reason that stack allows me to restart without losing App Credentials.
It's just the one that I performed an update on.
App Credentials now persist (I upgraded 1.15.1 -> 1.16.1 and the same keys exist!)
thanks!
for now I'm just avoiding restarts of the service, but I do want to get to the bottom of it using a fresh instance.
as a backup plan: is there a way to have an API key set up prior to running docker compose up? Like, I need at least one set of credentials that I can reliably have remote agents use, one that I know persists across restarts and upgrades.
I did manage to figure this out with
docker compose stop agent-services
docker compose up --force-recreate --no-deps -d agent-services
and running an export
for the newly generated key.
still though, noticing restarts cause App Credentials to be lost.
everything i just said comes from the screenshotted webpage and is regarding the CLEARML_API_ACCESS_KEY and CLEARML_API_SECRET_KEY env vars.
when i restart clearml server, the keys started disappearing . this was not the case before upgrading
if there's a process I'm not understanding please clarify...
but
(a) i start up the compose stack, log in via web browser as a user . this is on a remote server .
(b) i go to settings and generate a credential
(c) i use that credential to set up my local dev env, editing my clearml.conf
(d) i repeat (b) and use that credential to start up a remote workers to serve queues .
am i misunderstanding something? if there's another way to generate credentials I'm not familiar with it .
it's really frustrating, as I'm trying to debug server behavior (so I'm restarting often), and keep needing to re-create these.
when I do a docker compose down; docker compose up -d
... these disappear.
to be clear... this was not happening before I upgraded to the latest version. That is why I am asking about this.
Nope still dealing with it .
Oddly enough when i spin up a new instance on the new version, it doesnt seem to happen
i am definitely not seeing it persist after upgrading. previously it wasn't a problem on other upgrades
hello @<1523701087100473344:profile|SuccessfulKoala55>
I appreciate your help. Thank you. Do you happen to have any updates? We had another restart and lost the creds again. So our deployment is in a brittle state on this latest upgrade, and I'm going back to 1.15.1 until I hear back.
nothing came up in the logs. all 200's
clearml-server-1.15.1, clearml-1.16.2
yup! that's what I was wondering if you'd help me find a way to change the timings of. Is there an option I can override to make the retry more aggressive?
I've definitely narrowed it down to the reverse proxy I'm behind. when I switch to a cloudflare tunnel, the overhead of the network is <1s compared to localhost, everything feels snappy!
But for security reasons, I need to keep using the reverse proxy, hence my question about configuring the silent clearml retries.
when i run the pipe locally, im using the same connect.sh script as the workers are in order to poll the apiserver via the ssh tunnel.
let me downgrade my install of clearml and try again.
(the "magic" of the env detection is nice but man... it has its surprises)
hoping this really is a 1.16.2 issue. fingers crossed. at this point more pipes are failing than not.
yup. once again, rebooted and lost my credentials.
thanks for the clarification. is there any bypass? (a git diff + git rev parse should take mere milliseconds)
I'm working out of a mono repo, and am beginning to suspect its a cause of slowness. next week ill try moving a pipeline over to a new repo to test if this theory holds any water.
N/A (still shows as running despite Abort being sent)
I have tried other queues, they're all running the same container.
so far the only thing reliable is pipe.start_locally()
ah. a clue! it came right below that but i guess out of order...
that id
is the pipeline that failed