Reputation
Badges 1
979 × Eureka!with what I shared above, I now get:docker: Error response from daemon: network 'host' not found.
mmh it looks like what I was looking for, I will give it a try π
Hi TimelyPenguin76 , I guess it tries to spin them down a second time, hence the double print
Ok, but that means this cleanup code should live somewhere else than inside the task itself right? Otherwise it won't be executed since the task will be killed
The parent task is a data_processing task, therefore I retrieve it so that I can then data_processed = parent_task.artifacts["data_processed"]
I am looking for a way to gracefully stop the task (clean up artifacts, shutdown backend service) on the agent
GrumpyPenguin23 yes, it is the latest
AgitatedDove14 , what I was looking for was: parent_task = Task.get_task(task.parent)
(I use trains-agent 0.16.1 and trains 0.16.2)
I am now trying with agent.extra_docker_arguments: ["--network='host'", ]
instead of what I shared above
AgitatedDove14 I finally solved it: The problem was --network='host'
should be --network=host
I am doing:try: score = get_score_for_task(subtask) except: score = pd.NA finally: df_scores = df_scores.append(dict(task=subtask.id, score=score, ignore_index=True) task.upload_artifact("metric_summary", df_scores)
Also maybe we are not on the same page - by clean up, I mean kill a detached subprocess on the machine executing the agent
So I changed ebs_device_name = "/dev/sda1"
, and now I correctly get the 100gb EBS volume mounted on /
. All good π
Interesting! Something like that would be cool yes! I just realized that custom plugins in Mattermost are written in Go, could be a good hackday for me π to learn go
Ok, I got the following error when uploading the table as an artifact:ValueError('Task object can only be updated if created or in_progress')
Yes AgitatedDove14 π
AMI ami-08e9a0e4210f38cb6
, region: eu-west-1a
the deep learning AMI from nvidia (Ubuntu 18.04)
so what worked for me was the following startup userscript:
` #!/bin/bash
sleep 120
while sudo fuser /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock >/dev/null 2>&1; do echo 'Waiting for other instances of apt to complete...'; sleep 5; done
sudo apt-get update
while sudo fuser /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock >/dev/null 2>&1; do echo 'Waiting for other instances of apt to complete...'; sleep 5; done
sudo apt-get install -y python3-dev python3-pip gcc git build-essential...
there is no error from this side, I think the aws autoscaler just waits for the agent to connect, which will never happen since the agent wonβt start because the userdata script fails
Now it starts, Iβll see if this solves the issue
Awesome, thanks WackyRabbit7 , AgitatedDove14 !
Ok thanks! And for this?
Would it be possible to support such use case? (have the clearml-agent setting-up a different python version when a task needs it?)
As you can see, more hard waiting (initial sleep), and then before each apt action, make sure there is no lock