data:image/s3,"s3://crabby-images/500c7/500c7ff097fd31f6acdbe1b6df474cd0d1e107f7" alt="Profile picture"
Reputation
Badges 1
22 × Eureka!ah, thank you for the clarity. A quarterly release schedule makes sense, it's about what I've observed.
Let me know if I can be of any assistance in early testing!
Yup if you scroll through the logs in the console, near the top (post config dump), you’ll see a git clone and checkout to the specific hash.
PS You can actually change this parameter in an experiment’s configuration if it is in draft mode.
Oh yes. I see. Yeah, no ML here actually (doing the testing infra of endpoints), but certainly when there is its an issue.
How does clearml session avoid it? I guess only if autoscaling is used (one worker one machine)?
I think you’d have to run the cleanup service. That’s what seems to be what is controlling deletion based on archived status and some other temporal filters
I believe pipe.connect_configuration
is what you're looking for?
waiting now to see if they disappear.
any problems you may have spotted with the versions used?
project hasn't disappeared just yet. but it's happened twice now
the dataset, task, and pipeline were under the same project name. i'm seeing what happens if the dataset project name was different ( f"{project_name}_data"
). which project would get deleted... the dataset one or the project of the task that kicked it off?
and the answer is...
the project is preserved, the dataset's project hidden.
so ... empty dataset names due to a small typo in parameter override + the choice for the dataset to have the same project name as the task that created it (...
@<1798162804348293120:profile|FlutteringSeahorse49> wants to start HPO though, so the desire is to deploy agents to listen to queues on the slurm cluster (perhaps the controller runs on his laptop).
would that still make sense?
one note is that it happened after I tried deploying a set of workers to a new queue, which she tried to use to run the tasks in parallel instead of our default queue which is only serviced by one worker (a container i built)
i will attempt to start that now.
the project wasn't hidden before. I'm aware of the pipeline tasks being hidden, that makes sense for organization. but the actual project itself as an entirety has a ghost icon.
she created a new project and started working in there, it was visible in the UI... and just now it disappeared again. it's kind of like running the pipeline makes it disappear.
then back to CLI, updated the pipeline to point the tasks to the new queue. run it, shows up in the UI (same container as default worker, just replicated w a new docker-compose and CMD to point to the new queue).
dug deeper. if i'm to make a guess.../root/clearml.conf
-> used on startup of agent-services as a template of sorts to create .clearml_agent.<id>.cfg
on demand -> this task-specific file is used to mount to /tmp/clearml_default.conf
in a new container (docker in docker bc of the socket mounted to the agent-services) -> used to execute the task
you can put task.execute_remotely() to create it in draft mode. I've taken to configuring defaults to run things very quickly just in case i forget though (e.g. placeholder string for dataset, bail out early if not changed… or just do one epoch on a small subset of samples, etc).
probably, but the syntax would be in that of a git diff, so it’d be a touch clunky if you asked me
Are you trying to avoid local development?
but isnt that just the same as running agent in daemon mode? thats what i was hoping James could do
i think he's saying you'd want an intermediary layer that acts like the daemon .
why not run the daemon directly im not sure, but i suspect its bc it doesn't have an "end time" for execution (stays up)
I opened github.com/allegroai/clearml/pull/1083 as an attempt to help catch this.
so when the task completed successfully (changed the queue to default and let it finish instead of aborting), the project disappeared.
maybe an important note: I mounted the same cache directory for the agents.
i think we may have found the frankenbug?
the argument to the dataset name was not being overridden correctly (mistyped), so the default value of an empty string (instead of a placeholder like "CHANGE_ME") in the parent task caused the dataset to basically get created with an empty name, and somehow that hid the whole project, despite hundreds of existing tasks in it.
and no way to un-hide it as far as I can tell?
@<1541954607595393024:profile|BattyCrocodile47> put together None
Can vouch, this works well. Had my server hard reboot (maybe bc of clearml? maybe bc of hardware, maybe both… haven’t figured it out), and busy remote workers still managed to update the backend once it came back up.
Re: backups… what would happen if zipped while running but no work was being performed? Still an issue potentially?
and what happens if docker compose down is run while there’s work in the services queue? Will it be restored? What are the implications if a backup is perform...
thank you!
I'll add a volume mount to the services-agent container, and from what I understand that will become the template it uses?
is this the structure of the file?
None
or is it the "dot" syntax (like what shows up in the console when the task executes / your snippet)?
credentials for the server to do things with s3 will be in /opt/clearml/apiserver.conf.
I'm guessing this is done through code-server?
I'm currently rolling a JupyterHub instance (multiuser, with codeserver inside) on the same machine as clearml-server. That’s where tasks are executed etc. so, all browser dev env.
It sounds like there’s an option to basically bypass this latter step and just use clearml’s credentialing to accomplish much the same thing? Am I understanding clearml-session correctly?