
Reputation
Badges 1
22 × Eureka!Oh yes. I see. Yeah, no ML here actually (doing the testing infra of endpoints), but certainly when there is its an issue.
How does clearml session avoid it? I guess only if autoscaling is used (one worker one machine)?
I'm guessing this is done through code-server?
I'm currently rolling a JupyterHub instance (multiuser, with codeserver inside) on the same machine as clearml-server. That’s where tasks are executed etc. so, all browser dev env.
It sounds like there’s an option to basically bypass this latter step and just use clearml’s credentialing to accomplish much the same thing? Am I understanding clearml-session correctly?
waiting now to see if they disappear.
any problems you may have spotted with the versions used?
project hasn't disappeared just yet. but it's happened twice now
the dataset, task, and pipeline were under the same project name. i'm seeing what happens if the dataset project name was different ( f"{project_name}_data"
). which project would get deleted... the dataset one or the project of the task that kicked it off?
and the answer is...
the project is preserved, the dataset's project hidden.
so ... empty dataset names due to a small typo in parameter override + the choice for the dataset to have the same project name as the task that created it (...
so when the task completed successfully (changed the queue to default and let it finish instead of aborting), the project disappeared.
i think we may have found the frankenbug?
the argument to the dataset name was not being overridden correctly (mistyped), so the default value of an empty string (instead of a placeholder like "CHANGE_ME") in the parent task caused the dataset to basically get created with an empty name, and somehow that hid the whole project, despite hundreds of existing tasks in it.
and no way to un-hide it as far as I can tell?
Oh neat! I want to take a look at this. Only a few more weeks at the client but it’d be nice to reduce the complexity of the software stack if I can before handoff.
Can you please elaborate on the latter point? My jupyterhub’s fully containerized and allows users to select their own containers (from a list i built) at launch, and launch multiple containers at the same time, not sure I follow how toes are stepped on.
then back to CLI, updated the pipeline to point the tasks to the new queue. run it, shows up in the UI (same container as default worker, just replicated w a new docker-compose and CMD to point to the new queue).
the clearml github, search for a file named cleanup service dot py (or something to that effect)
I opened github.com/allegroai/clearml/pull/1083 as an attempt to help catch this.
Weird . I recently implemented a function that talked to this exact endpoint and found it had to exclude the version and api paths . Is there some sort of redirect that happens?
i will attempt to start that now.
@<1541954607595393024:profile|BattyCrocodile47> put together None
Yup if you scroll through the logs in the console, near the top (post config dump), you’ll see a git clone and checkout to the specific hash.
PS You can actually change this parameter in an experiment’s configuration if it is in draft mode.
credentials for the server to do things with s3 will be in /opt/clearml/apiserver.conf.
oh i see. you're talking about the agent-services, not a separate agent in a container.
yup, I've got the same thing going there.
fwiw...
for me, HOST_IP is 0.0.0.0 and the other "HOSTS" env vars don't contain "http" in them.
and my server is publicly reachable, not sure if that matter either.
youre basically asking to sample from a distribution where not all parameters are mutually independent .
the short answer is no- this is not directly supported . optuna needs each hyperparam to be independent, so its up to you to handle the dependencies between parameters yourself unfortunately .
your solution of defining them independently and then using num_layers to potentially ignore other parameters is a valid one .
maybe an important note: I mounted the same cache directory for the agents.
yeah let's step through this, i'm having her execute these steps as we speak.
create a task with the new project name. its created as a draft. can see it in the UI under the new project.
pipeline script is updated with new project name for. execute script to create pipeline. now see in UI under this new project name. nothing hidden.
the pipeline is running. when the queue is default (only serviced by one container with agent in it ( clearml-agent==1.5.2
). abort it. everything is still ...
tasks that create pipelines feels like a hack and i found they dont show up in the UI (have to use the link in the console).
I've found that sometimes i need to right click "Run" a couple of times before the parameters are filled in properly.
one note is that it happened after I tried deploying a set of workers to a new queue, which she tried to use to run the tasks in parallel instead of our default queue which is only serviced by one worker (a container i built)
I think you’d have to run the cleanup service. That’s what seems to be what is controlling deletion based on archived status and some other temporal filters
the project wasn't hidden before. I'm aware of the pipeline tasks being hidden, that makes sense for organization. but the actual project itself as an entirety has a ghost icon.
she created a new project and started working in there, it was visible in the UI... and just now it disappeared again. it's kind of like running the pipeline makes it disappear.
you could also take the route of NOT specifying num_layers, and instead write your own code to create a set of viable layer designs to choose from and pass that as a parameter, so optuna selects from a countable set instead of suggesting integer values .
the downside of this is the lack of gradient information in the optimization process
if you commit but do not push, the metadata tells clearml that it needs to pull a non-existant commit. any changes you made on top may be saved as a diff, but they'd fail to apply.
for clearml to work on un-pushed commits, it'd have to wait for a push to register a new diff target, which can become a problem (what if you have multiple remotes? which one will it wait for?) so rather, it assumes it can access the most recent commit from your remote repo, and records this as the "base" upon whi...