Reputation
Badges 1
979 × Eureka!Here is the console with some errors
Yes, I set:auth { cookies { httponly: true secure: true domain: ".clearml.xyz.com" max_age: 99999999999 } }
It always worked for me this way
SuccessfulKoala55 I found the issue thanks to you: I changed a bit the domain but didnβt update the apiserver.auth.cookies.domain
setting - I did it, restarted and now it works π Thanks!
with what I shared above, I now get:docker: Error response from daemon: network 'host' not found.
mmh it looks like what I was looking for, I will give it a try π
Hi TimelyPenguin76 , I guess it tries to spin them down a second time, hence the double print
with the CLI, on a conda env located in /data
Ok, but that means this cleanup code should live somewhere else than inside the task itself right? Otherwise it won't be executed since the task will be killed
Hi AgitatedDove14 , I donβt see any in the https://pytorch.org/ignite/_modules/ignite/handlers/early_stopping.html#EarlyStopping but I guess I could overwrite it and add one?
The parent task is a data_processing task, therefore I retrieve it so that I can then data_processed = parent_task.artifacts["data_processed"]
I am looking for a way to gracefully stop the task (clean up artifacts, shutdown backend service) on the agent
GrumpyPenguin23 yes, it is the latest
AgitatedDove14 , what I was looking for was: parent_task = Task.get_task(task.parent)
(I use trains-agent 0.16.1 and trains 0.16.2)
I am now trying with agent.extra_docker_arguments: ["--network='host'", ]
instead of what I shared above
AgitatedDove14 I finally solved it: The problem was --network='host'
should be --network=host
I am doing:try: score = get_score_for_task(subtask) except: score = pd.NA finally: df_scores = df_scores.append(dict(task=subtask.id, score=score, ignore_index=True) task.upload_artifact("metric_summary", df_scores)
Also maybe we are not on the same page - by clean up, I mean kill a detached subprocess on the machine executing the agent
So I changed ebs_device_name = "/dev/sda1"
, and now I correctly get the 100gb EBS volume mounted on /
. All good π
Interesting! Something like that would be cool yes! I just realized that custom plugins in Mattermost are written in Go, could be a good hackday for me π to learn go
Ok, I got the following error when uploading the table as an artifact:ValueError('Task object can only be updated if created or in_progress')
Yes AgitatedDove14 π
Thanks for your input TenseOstrich47 , I was considering using a secret manager now, I guess that's the best option. I can move the secrets wherever I need them to be to make it work π
AMI ami-08e9a0e4210f38cb6
, region: eu-west-1a
the deep learning AMI from nvidia (Ubuntu 18.04)
so what worked for me was the following startup userscript:
` #!/bin/bash
sleep 120
while sudo fuser /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock >/dev/null 2>&1; do echo 'Waiting for other instances of apt to complete...'; sleep 5; done
sudo apt-get update
while sudo fuser /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock >/dev/null 2>&1; do echo 'Waiting for other instances of apt to complete...'; sleep 5; done
sudo apt-get install -y python3-dev python3-pip gcc git build-essential...
there is no error from this side, I think the aws autoscaler just waits for the agent to connect, which will never happen since the agent wonβt start because the userdata script fails