Reputation
Badges 1
979 × Eureka!with what I shared above, I now get:docker: Error response from daemon: network 'host' not found.
Hi TimelyPenguin76 , I guess it tries to spin them down a second time, hence the double print
I am now trying with agent.extra_docker_arguments: ["--network='host'", ]
instead of what I shared above
So I changed ebs_device_name = "/dev/sda1"
, and now I correctly get the 100gb EBS volume mounted on /
. All good ๐
Yes AgitatedDove14 ๐
AMI ami-08e9a0e4210f38cb6
, region: eu-west-1a
so what worked for me was the following startup userscript:
` #!/bin/bash
sleep 120
while sudo fuser /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock >/dev/null 2>&1; do echo 'Waiting for other instances of apt to complete...'; sleep 5; done
sudo apt-get update
while sudo fuser /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock >/dev/null 2>&1; do echo 'Waiting for other instances of apt to complete...'; sleep 5; done
sudo apt-get install -y python3-dev python3-pip gcc git build-essential...
there is no error from this side, I think the aws autoscaler just waits for the agent to connect, which will never happen since the agent wonโt start because the userdata script fails
As you can see, more hard waiting (initial sleep), and then before each apt action, make sure there is no lock
the instances takes so much time to start, like 5 mins
edited the aws_auto_scaler.py, actually I think itโs just a typo, I just need to double the brackets
Interestingly, I do see the 100gb volume in the aws console:
It seems that around here, a Task that is created using init remotely in the main process gets its output_uri
parameter ignored
On the cloned experiment, which by default is created in draft mode, you can change the commit to point either a specific commit or the latest commit of the branch
and this works. However, without the trick from UnevenDolphin73 , the following wonโt work (return None):if __name__ == "__main__": task = Task.current_task() task.connect(config) run() from clearml import Task Task.init()
UnevenDolphin73 , task = clearml.Task.get_task(clearml.config.get_remote_task_id())
worked, thanks
Hi AnxiousSeal95 , I hope you had nice holidays! Thanks for the update! I discovered h2o when looking for ways to deploy dashboards with apps like streamlit. Most likely I will use either streamlit deployed through clearml or h2o as standalone if ClearML won't support deploying apps (which is totally fine, no offense there ๐ )
Setting it after the training correctly updated the task and I was able to store artifacts remotely
I don't think there is an example for this use case in the repo currently, but the code should be fairly simple (below is a rough draft of what it could look like)
` controller_task = Task.init(...)
controller_task.execute_remotely(queue_name="services", clone=False, exit_process=True)
while True:
periodic_task = Task.clone(template_task_id)
# Change parameters of {periodic_task} if necessary
Task.enqueue(periodic_task, queue="default")
time.sleep(TRIGGER_TASK_INTERVAL_SECS) `
Hi PompousParrot44 , you could have a Controller task running in the services queue that periodically schedules the task you want to run
I also tried setting ebs_device_name = "/dev/sdf"
- didn't work
I think waiting for the apt locks to be released with something like this would workstartup_bash_script = [ "#!/bin/bash", "while sudo fuser /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock >/dev/null 2>&1; do echo 'Waiting for other instances of apt to complete...'; sleep 5; done", "sudo apt-get update", ...
Weirdly this throws an error in the autoscaler:
` Spinning new instance type=v100_spot
Error: Failed to start new instance, unexpected '{' in field...