
Reputation
Badges 1
22 × Eureka!i think he's saying you'd want an intermediary layer that acts like the daemon .
why not run the daemon directly im not sure, but i suspect its bc it doesn't have an "end time" for execution (stays up)
the clearml github, search for a file named cleanup service dot py (or something to that effect)
credentials for the server to do things with s3 will be in /opt/clearml/apiserver.conf.
Yup if you scroll through the logs in the console, near the top (post config dump), you’ll see a git clone and checkout to the specific hash.
PS You can actually change this parameter in an experiment’s configuration if it is in draft mode.
I think you’d have to run the cleanup service. That’s what seems to be what is controlling deletion based on archived status and some other temporal filters
I opened github.com/allegroai/clearml/pull/1083 as an attempt to help catch this.
thank you!
I'll add a volume mount to the services-agent container, and from what I understand that will become the template it uses?
is this the structure of the file?
None
or is it the "dot" syntax (like what shows up in the console when the task executes / your snippet)?
maybe an important note: I mounted the same cache directory for the agents.
Can vouch, this works well. Had my server hard reboot (maybe bc of clearml? maybe bc of hardware, maybe both… haven’t figured it out), and busy remote workers still managed to update the backend once it came back up.
Re: backups… what would happen if zipped while running but no work was being performed? Still an issue potentially?
and what happens if docker compose down is run while there’s work in the services queue? Will it be restored? What are the implications if a backup is perform...
so when the task completed successfully (changed the queue to default and let it finish instead of aborting), the project disappeared.
youre basically asking to sample from a distribution where not all parameters are mutually independent .
the short answer is no- this is not directly supported . optuna needs each hyperparam to be independent, so its up to you to handle the dependencies between parameters yourself unfortunately .
your solution of defining them independently and then using num_layers to potentially ignore other parameters is a valid one .
you can put task.execute_remotely() to create it in draft mode. I've taken to configuring defaults to run things very quickly just in case i forget though (e.g. placeholder string for dataset, bail out early if not changed… or just do one epoch on a small subset of samples, etc).
but isnt that just the same as running agent in daemon mode? thats what i was hoping James could do
oh i see. you're talking about the agent-services, not a separate agent in a container.
yup, I've got the same thing going there.
fwiw...
for me, HOST_IP is 0.0.0.0 and the other "HOSTS" env vars don't contain "http" in them.
and my server is publicly reachable, not sure if that matter either.
I ran into something similar during deployment. Hopefully this helps with your debugging: if the agent was launched separately from the rest of the stack, it may not have proper docker-DNS resolution to None . (e.g. if in the same docker-compose, perhaps you didnt add the backend
network field, or if it was launched separately through docker run
without an explicit external network defined)
if the agent's on the same machine, try docker network connect
to add...
probably, but the syntax would be in that of a git diff, so it’d be a touch clunky if you asked me
Are you trying to avoid local development?
ah, thank you for the clarity. A quarterly release schedule makes sense, it's about what I've observed.
Let me know if I can be of any assistance in early testing!
ah . that's a shame its under Enterprise only . no wonder I missed it .
im helping train my friend @<1798162804348293120:profile|FlutteringSeahorse49> on clearml to assist with his astrophysics research, and his university has a slurm cluster . So we're trying to figure out if we can launch an agent process on the cluster to pull work from the clearml queue (fwiw: on their cluster containers is not supported ) .
one note is that it happened after I tried deploying a set of workers to a new queue, which she tried to use to run the tasks in parallel instead of our default queue which is only serviced by one worker (a container i built)
For reproducibility, it kind of makes sense though. The existence of the file is contingent on the worker cloning the source code. I'm sure things can be done to maintain state differently but I personally adapted to the git-based workflow for managing files pretty quickly.
though yes I will admit I had the same thought first: why must I run it each time?
Beware: squash merges will ruin the ability to reproduce the experiment at that time since the git commit will be lost (presuming th...
@<1798162804348293120:profile|FlutteringSeahorse49> wants to start HPO though, so the desire is to deploy agents to listen to queues on the slurm cluster (perhaps the controller runs on his laptop).
would that still make sense?
@<1541954607595393024:profile|BattyCrocodile47> put together None
i will attempt to start that now.
you could also take the route of NOT specifying num_layers, and instead write your own code to create a set of viable layer designs to choose from and pass that as a parameter, so optuna selects from a countable set instead of suggesting integer values .
the downside of this is the lack of gradient information in the optimization process
waiting now to see if they disappear.
any problems you may have spotted with the versions used?
project hasn't disappeared just yet. but it's happened twice now
If you can hit the endpoint with curl, you for sure can hook it up to many frontend frameworks.
Personal recs: gradio, streamlit
Abstract the interaction into a function call, and wrap it all in some UI elements using python.
the project wasn't hidden before. I'm aware of the pipeline tasks being hidden, that makes sense for organization. but the actual project itself as an entirety has a ghost icon.
she created a new project and started working in there, it was visible in the UI... and just now it disappeared again. it's kind of like running the pipeline makes it disappear.
tasks that create pipelines feels like a hack and i found they dont show up in the UI (have to use the link in the console).
I've found that sometimes i need to right click "Run" a couple of times before the parameters are filled in properly.