Ok, so what worked for me in the end was:config = task.connect_configuration(read_yaml(conf_path)) cfg = OmegaConf.create(config._to_dict())
The clean up service is awesome, but it would require to have another agent running in services mode in the same machine, which I would rather avoid
Also maybe we are not on the same page - by clean up, I mean kill a detached subprocess on the machine executing the agent
SuccessfulKoala55 I want to avoid writing creds in plain text in the config file
Something was triggered, you can see the CPU usage starting right when the instance became unresponsive - maybe a merge operation from ES?
my agents are all .16 and I install trains 0.16rc2 in each Task being executed by the agent
Will the from clearml import Task raise an error if no clearml.conf exists? Or only when actual features requiring to define the server (such as Task.init ) will be called
How exactly is the clearml-agent killing the task?
Ok AgitatedDove14 SuccessfulKoala55 I made some progress in my investigation:
I can exactly pinpoint the change that introduced the bug, it is the one changing the endpoint "events.get_task_log", min_version="2.9"
In the firefox console > Network, I can edit an events.get_task_log and change the URL from …/api/v2.9/events.get_task_log to …/api/v2.8/events.get_task_log (to use the endpoint "events.get_task_log", min_version="1.7" ) and then all the logs are ...
Hi SuccessfulKoala55 , super that’s what I was looking for
SuccessfulKoala55 I found the issue thanks to you: I changed a bit the domain but didn’t update the apiserver.auth.cookies.domain setting - I did it, restarted and now it works 🙂 Thanks!
The task is created using Task.clone() yes
AgitatedDove14 , my “uncommitted changes” ends with... if __name__ == "__main__": task = clearml.Task.get_task(clearml.config.get_remote_task_id()) task.connect(config) run() from clearml import Task Task.init()
because I cannot locate libcudart or because cudnn_version = 0?
I came up with the same code, thanks for the fast answer (yes having a setter for that would be cool!)
no, one worker (trains-agent-1) "forget from time to time" the current experiment he is running and picks another experiment on top of the one he is currently running
So it seems like it doesn't copy /root/clearml.conf and it doesn't pass the environment variables (CLEARML_API_HOST, CLEARML_API_ACCESS_KEY, CLEARML_API_SECRET_KEY)
I am using an old version of the aws autoscaler, so the instance has the following user data executed:echo "{clearml_conf}" >>/root/clearml.conf ... python -m clearml_agent --config-file '/root/clearml.conf' daemon --detached --queue '{queue}' --docker --cpu-only
Ok, but that means this cleanup code should live somewhere else than inside the task itself right? Otherwise it won't be executed since the task will be killed
The task requires this service, so the task starts it on the machine - Then I want to make sure the service is closed by the task upon completion/failure/abortion
Ha I see, it is not supported by the autoscaler > https://github.com/allegroai/clearml/blob/282513ac33096197f82e8f5ed654948d97584c35/trains/automation/aws_auto_scaler.py#L120-L125