
Reputation
Badges 1
981 × Eureka!I execute the clearml-agent this way:/home/machine/miniconda3/envs/py36/bin/python3 /home/machine/miniconda3/envs/py36/bin/clearml-agent daemon --services-mode --cpu-only --queue services --create-queue --log-level DEBUG --detached
Sorry both of you, my problem was actually lying somewhere else (both buckets are in the same region) - thanks for you time!
with my hack yes, without, no
The clean up service is awesome, but it would require to have another agent running in services mode in the same machine, which I would rather avoid
Indeed, I actually had the old configuration that was not JSON - I converted to json, now works π
Not really because this is difficult to control: I use the AWS autoscaler with ubuntu AMI and when an instance is created, packages are updated, and I don't know which python version I get, + changing the python version of the OS is not really recommended
Nevermind, nvidia-smi command fails in that instance, the problem lies somewhere else
AgitatedDove14 If I call explicitly task.get_logger().report_scalar("test", str(parse_args.local_rank), 1., 0)
, this will log as expected one value per process, so reporting works
SuccessfulKoala55 I was able to recreate the indices in the new ES cluster. I specified number_of_shards: 4
for the events-log-d1bd92a3b039400cbafc60a7a5b1e52b
index. I then copied the documents from the old ES using the _reindex
API. The index is 7.5Gb on one shard.
Now I see that this index on the new ES cluster is ~19.4Gb π€ The index is divided into the 4 shards, but each shard is between 4.7Gb and 5Gb!
I was expecting to have the same index size as in the previous e...
Guys the experiments I had running didn't fail, they just waited and reconnected, this is crazy cool
AgitatedDove14 SuccessfulKoala55 I just saw that clearml-server 1.4.0 was released, congrats π π Was this bug fixed with this new version?
the api-server shows when starting:clearml-apiserver | [2021-07-13 11:09:34,552] [9] [INFO] [clearml.es_factory] Using override elastic host
`
clearml-apiserver | [2021-07-13 11:09:34,552] [9] [INFO] [clearml.es_factory] Using override elastic port 9200
...
clearml-apiserver | [2021-07-13 11:09:38,407] [9] [WARNING] [clearml.initialize] Could not connect to ElasticSearch Service. Retry 1 of 4. Waiting for 30sec
clearml-apiserver | [2021-07-13 11:10:08,414] [9] [WARNING] [clearml.initia...
and with this setup I can use GPU without any problem, meaning that the wheel does contain the cuda runtime
And I am wondering if only the main process (rank=0) should attach the ClearMLLogger or if all the processes within the node should do that
The host is accessible, I can ping it and even run curl "
http://internal-aws-host-name:9200/_cat/shards "
and get results from the local machine
The number of documents in the old and the new env are the same though π€ I really donβt understand where this extra space used comes from
I made sure before deleting the old index that the number of docs matched
I will let the team answer you on that one π
Interesting idea! (I assume for reporting only, not configuration)
Yes for reporting only - Also to understand which version is used by the agent to define the torch wheel downloaded
regrading the cuda check with
nvcc
, I'm not saying this is a perfect solution, I just mentioned that this is how this is currently done.
I'm actually not sure if there is an easy way to get it from nvidia-smi interface, worth checking though ...
Ok, but when nvcc
is not ava...
But I can do:
` $ python
import torch
torch.cuda.is_available()
True
torch.backends.cudnn.version()
8005 `
the first problem I had, that didnβt gave useful infos, was that docker was not installed in the agent machine x)
ha nice, where can I find the mapping template of the original clearml so that I can copy and adapt?
You mean it will resolve by itself in the following days or should I do something? Or there is nothing to do and it will stay this way?
I did change the replica setting on the same index yes, I reverted it back from 1 to 0 afterwards
That said, you might have accessed the artifacts before any of them were registered
I called task.wait_for_status() to make sure the task is done
I also did run sudo apt install nvidia-cuda-toolkit