Reputation
Badges 1
979 × Eureka!with what I shared above, I now get:docker: Error response from daemon: network 'host' not found.
Hi TimelyPenguin76 , I guess it tries to spin them down a second time, hence the double print
I am now trying with agent.extra_docker_arguments: ["--network='host'", ]
instead of what I shared above
So I changed ebs_device_name = "/dev/sda1"
, and now I correctly get the 100gb EBS volume mounted on /
. All good đ
Ok, I got the following error when uploading the table as an artifact:ValueError('Task object can only be updated if created or in_progress')
Yes AgitatedDove14 đ
AMI ami-08e9a0e4210f38cb6
, region: eu-west-1a
the deep learning AMI from nvidia (Ubuntu 18.04)
so what worked for me was the following startup userscript:
` #!/bin/bash
sleep 120
while sudo fuser /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock >/dev/null 2>&1; do echo 'Waiting for other instances of apt to complete...'; sleep 5; done
sudo apt-get update
while sudo fuser /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock >/dev/null 2>&1; do echo 'Waiting for other instances of apt to complete...'; sleep 5; done
sudo apt-get install -y python3-dev python3-pip gcc git build-essential...
there is no error from this side, I think the aws autoscaler just waits for the agent to connect, which will never happen since the agent wonât start because the userdata script fails
Now it starts, Iâll see if this solves the issue
As you can see, more hard waiting (initial sleep), and then before each apt action, make sure there is no lock
the instances takes so much time to start, like 5 mins
edited the aws_auto_scaler.py, actually I think itâs just a typo, I just need to double the brackets
Interestingly, I do see the 100gb volume in the aws console:
ok, what is your problem then?
Could you please share the stacktrace?
It seems that around here, a Task that is created using init remotely in the main process gets its output_uri
parameter ignored
I killed both trains-agent and restarted one to have a clean start. This way it correctly spin up docker containers for services tasks. So probably the bug comes when a bug occurs while setting up a task, it cannot go back to the main task. I would need to do some tests to validate that hypothesis though
On the cloned experiment, which by default is created in draft mode, you can change the commit to point either a specific commit or the latest commit of the branch
and this works. However, without the trick from UnevenDolphin73 , the following wonât work (return None):if __name__ == "__main__": task = Task.current_task() task.connect(config) run() from clearml import Task Task.init()
AgitatedDove14 , my âuncommitted changesâ ends with... if __name__ == "__main__": task = clearml.Task.get_task(clearml.config.get_remote_task_id()) task.connect(config) run() from clearml import Task Task.init()