with what I shared above, I now get:docker: Error response from daemon: network 'host' not found.
Hi TimelyPenguin76 , I guess it tries to spin them down a second time, hence the double print
I am now trying with agent.extra_docker_arguments: ["--network='host'", ]
instead of what I shared above
So I changed ebs_device_name = "/dev/sda1"
, and now I correctly get the 100gb EBS volume mounted on /
. All good ๐
Yes AgitatedDove14 ๐
AMI ami-08e9a0e4210f38cb6
, region: eu-west-1a
there is no error from this side, I think the aws autoscaler just waits for the agent to connect, which will never happen since the agent wonโt start because the userdata script fails
Interestingly, I do see the 100gb volume in the aws console:
It seems that around here, a Task that is created using init remotely in the main process gets its output_uri
parameter ignored
On the cloned experiment, which by default is created in draft mode, you can change the commit to point either a specific commit or the latest commit of the branch
Hi AnxiousSeal95 , I hope you had nice holidays! Thanks for the update! I discovered h2o when looking for ways to deploy dashboards with apps like streamlit. Most likely I will use either streamlit deployed through clearml or h2o as standalone if ClearML won't support deploying apps (which is totally fine, no offense there ๐ )
Setting it after the training correctly updated the task and I was able to store artifacts remotely
I don't think there is an example for this use case in the repo currently, but the code should be fairly simple (below is a rough draft of what it could look like)
` controller_task = Task.init(...)
controller_task.execute_remotely(queue_name="services", clone=False, exit_process=True)
while True:
periodic_task = Task.clone(template_task_id)
# Change parameters of {periodic_task} if necessary
Task.enqueue(periodic_task, queue="default")
time.sleep(TRIGGER_TASK_INTERVAL_SECS) `
Hi PompousParrot44 , you could have a Controller task running in the services queue that periodically schedules the task you want to run
I also tried setting ebs_device_name = "/dev/sdf"
- didn't work
I think waiting for the apt locks to be released with something like this would workstartup_bash_script = [ "#!/bin/bash", "while sudo fuser /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock >/dev/null 2>&1; do echo 'Waiting for other instances of apt to complete...'; sleep 5; done", "sudo apt-get update", ...
Weirdly this throws an error in the autoscaler:
` Spinning new instance type=v100_spot
Error: Failed to start new instance, unexpected '{' in field...
Thats how I would do it, maybe guys from allegro-ai can come up with a better approach ๐
Ho the object is actually available in previous_task.artifacts
Hi there, yes I was able to make it work with some glue code:
Save your model, optimizer, scheduler every epoch Have a separate thread that periodically pulls the instance metadata and check if the instance is marked for stop, in this case, add a custom tag eg. TO_RESUME Have a services that periodically pulls failed experiments from the queue with the tag TO_RESUME, force marking them as stopped instead of failed and reschedule them with as extra-param the last checkpoint
I get the same error when trying to run the task using clearml-agent services-mode with docker, so weird
interestingly, it works on one machine, but not on another one
Ok, now I get ERROR: No matching distribution found for conda==4.9.2 (from -r /tmp/cached-reqscaw2zzji.txt (line 13))
The fileย /tmp/.clearml_agent_out.j7wo7ltp.txt
ย does not exist