Reputation
Badges 1
981 × Eureka!Ok, but that means this cleanup code should live somewhere else than inside the task itself right? Otherwise it won't be executed since the task will be killed
The task requires this service, so the task starts it on the machine - Then I want to make sure the service is closed by the task upon completion/failure/abortion
Ha I see, it is not supported by the autoscaler > https://github.com/allegroai/clearml/blob/282513ac33096197f82e8f5ed654948d97584c35/trains/automation/aws_auto_scaler.py#L120-L125
I ended up dropping omegaconf altogether
as for disk space: I have 21Gb available (8Gb used), /opt/trains/data folder is about 600Mo
I followed https://github.com/NVIDIA/nvidia-docker/issues/1034#issuecomment-520282450 and now it seems to be setting up properly
btw SuccessfulKoala55 the parameter is not documented in https://allegro.ai/clearml/docs/docs/references/clearml_ref.html#sdk-development-worker
Hi PompousParrot44 , you could have a Controller task running in the services queue that periodically schedules the task you want to run
Thanks SuccessfulKoala55 ! So CLEARML_NO_DEFAULT_SERVER=1 by default, right?
yes, exactly: I run python my_script.py , the script executes, creates the task, calls task.remote_execute(exit_process=True) and returns to bash. Then, in the bash console, after some time, I see some messages being logged from clearml
Add carriage return flush support using the sdk.development.worker.console_cr_flush_period configuration setting (GitHub trains Issue 181)
The task I cloned from is not the one I though
, causing it to unregister from the server (and thus not remain there).
Do you mean that the agent actively notifies the server that it is going down? or the server infers that the agent is down after a timeout?
SuccessfulKoala55 They do have the right filepath, eg:https://***.com:8081/my-project-name/experiment_name.b1fd9df5f4d7488f96d928e9a3ab7ad4/metrics/metric_name/predictions/sample_00000001.png
For some reason the configuration object gets updated at runtime toresource_configurations = null queues = null extra_trains_conf = "" extra_vm_bash_script = ""
I will let the team answer you on that one 🙂
ProxyDictPostWrite._to_dict() will recursively convert to dict and OmegaConf will not complain then
I think that somehow somewhere a reference to the figure is still living, so plt.close("all") and gc cannot free the figure and it ends up accumulating. I don't know where yet
/data/shared/miniconda3/bin/python /data/shared/miniconda3/bin/clearml-agent daemon --services-mode --detached --queue services --create-queue --docker ubuntu:18.04 --cpu-only
That gave me
Running in Docker mode (v19.03 and above) - using default docker image: nvidia/cuda running python3
Building Task 94jfk2479851047c18f1fa60c1364b871 inside docker: ubuntu:18.04
Starting docker build
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ERRO[0000] error waiting for container: context canceled
Probably 6. I think because of some reason, it did not go back to main trains-agent. Nevertheless I am not sure, because a second task could start. It could also be that the second was aborted for some reason while installing task requirements (not system requirements, so executing the trains-agent setup within the docker container) and therefore again it couldn't go back to main trains-agent. But ps -aux shows that the trains-agent is stuck running the first experiment, not the second...
So most likely trains was masking the original error, it might be worth investigating to help other users in the future
Thanks! I would like to use this opportunity to split the indices into multiple shards, as explained here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-split-index.html#indices-split-index
AgitatedDove14 So I’ll just replace task = clearml.Task.get_task(clearml.config.get_remote_task_id()) with Task.init() and wait for your fix 🙂
` resource_configurations {
A100 {
instance_type = "p3.2xlarge"
is_spot = false
availability_zone = "us-east-1b"
ami_id = "ami-04c0416d6bd8e4b1f"
ebs_device_name = "/dev/xvda"
ebs_volume_size = 100
ebs_volume_type = "gp3"
}
}
queues {
aws_a100 = [["A100", 15]]
}
extra_trains_conf = """
agent.package_manager.system_site_packages = true
agent.package_manager.pip_version = "==20.2.3"
"""
extra_vm_bash_script = """
sudo apt-get install -y libsm6 libxext6 libx...