Reputation
Badges 1
981 × Eureka!might be worth documenting 😄
So it seems like it doesn't copy /root/clearml.conf and it doesn't pass the environment variables (CLEARML_API_HOST, CLEARML_API_ACCESS_KEY, CLEARML_API_SECRET_KEY)
I am using an old version of the aws autoscaler, so the instance has the following user data executed:echo "{clearml_conf}" >>/root/clearml.conf ... python -m clearml_agent --config-file '/root/clearml.conf' daemon --detached --queue '{queue}' --docker --cpu-only
Ok, but that means this cleanup code should live somewhere else than inside the task itself right? Otherwise it won't be executed since the task will be killed
The task requires this service, so the task starts it on the machine - Then I want to make sure the service is closed by the task upon completion/failure/abortion
Ha I see, it is not supported by the autoscaler > https://github.com/allegroai/clearml/blob/282513ac33096197f82e8f5ed654948d97584c35/trains/automation/aws_auto_scaler.py#L120-L125
I ended up dropping omegaconf altogether
as for disk space: I have 21Gb available (8Gb used), /opt/trains/data folder is about 600Mo
I followed https://github.com/NVIDIA/nvidia-docker/issues/1034#issuecomment-520282450 and now it seems to be setting up properly
btw SuccessfulKoala55 the parameter is not documented in https://allegro.ai/clearml/docs/docs/references/clearml_ref.html#sdk-development-worker
Hi PompousParrot44 , you could have a Controller task running in the services queue that periodically schedules the task you want to run
Thanks SuccessfulKoala55 ! So CLEARML_NO_DEFAULT_SERVER=1 by default, right?
yes, exactly: I run python my_script.py , the script executes, creates the task, calls task.remote_execute(exit_process=True) and returns to bash. Then, in the bash console, after some time, I see some messages being logged from clearml
Add carriage return flush support using the sdk.development.worker.console_cr_flush_period configuration setting (GitHub trains Issue 181)
The task I cloned from is not the one I though
You are right, thanks! I was trying to move /opt/trains/data to an external disk, mounted at /data
, causing it to unregister from the server (and thus not remain there).
Do you mean that the agent actively notifies the server that it is going down? or the server infers that the agent is down after a timeout?
The clean up service is awesome, but it would require to have another agent running in services mode in the same machine, which I would rather avoid
Also maybe we are not on the same page - by clean up, I mean kill a detached subprocess on the machine executing the agent
SuccessfulKoala55 I want to avoid writing creds in plain text in the config file
Something was triggered, you can see the CPU usage starting right when the instance became unresponsive - maybe a merge operation from ES?
my agents are all .16 and I install trains 0.16rc2 in each Task being executed by the agent
Will the from clearml import Task raise an error if no clearml.conf exists? Or only when actual features requiring to define the server (such as Task.init ) will be called