It could be that either the clearml-server has bad behaviour while clean up is ongoing or even after.
Okay, it seems like it just takes some time to delete and to reflect in the WebUI. So when I try to delete again, actually a deletion process seems already to be running in the background.
I created an github issue because the problem with the slow deletion still exists. https://github.com/allegroai/clearml/issues/586#issue-1142916619
SuccessfulKoala55 So what happens is, that always when/after the cleanup_service runs, clearml will throw these kind of errors
[root@dc01deffca35 elasticsearch]# curl
`
{
"cluster_name" : "clearml",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 10,
"active_shards" : 10,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 10,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_nu...
I will try again tomorrow. It s getting late! Thank you for helping so far!
And how do I specify this in the output_uri
? The default file server is specified by passing True
. How would I specify to use the second?
One more thing: The cuda_version that clearml finds automatically is wrong.
I just tried to envrionment setup steps that clearml-agent is doing locally, but with my environment.yml instead of the one that clearml generates.
By host you mean the machine on which the agent is running? How does clearml-agent find the cuda_version?
AnxiousSeal95 Thanks a lot. Seems to be working fine for me. I see the clearml-agent version that pip installs in the docker is now fixed to the host version 🙂 PyTorch Nightly is also installed correctly now!
Well, I guess no hurdles vs. safety is inherently no solvable. I am all for hurdles, if it is clear how to overcome it. And in my opinion referring to clearml-init
is something which makes sense from a developer and a user perspective.
Yea, and the script ends with clearml.Task - INFO - Waiting to finish uploads
So my network seems to be fine. Downloading artifacts from the server to the agents is around 100 MB/s, while uploading from the agent to the server is slow.
An upload of 11GB took around 20 hours which cannot be right. Do you have any idea whether ClearML could have something to do with this slow upload speed? If not I am going to start debugging with the hardware/network.
But it is not related to network speed, rather to clearml. I simple file transfer test gives me approximately 1 GBit/s transfer rate between the server and the agent, which is to be expected from the 1Gbit/s network.