Reputation
Badges 1
606 × Eureka!SuccessfulKoala55 So what happens is, that always when/after the cleanup_service runs, clearml will throw these kind of errors
[root@dc01deffca35 elasticsearch]# curl
`
{
"cluster_name" : "clearml",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 10,
"active_shards" : 10,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 10,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_nu...
I restarted it after I got the errors, because as everyone knows, turning it off and on usually works 😄
I will try again tomorrow. It s getting late! Thank you for helping so far!
Mhhm, then maybe it is not clear 😂 to me how clearml.Task is meant to be used. I thought of it as being a container for all the information regarding a single experiment that is reflected on the server-side and by this in the WebUI. Now I init() a Task and it will show in the WebUI. I thought after initialization I can still update the task to my liking, i.e. it being a documentation of my experiment.
Then I could also do this:# My custom very special use case task = Task() task = task.load_statedict(await Task.load_or_create(task_name)) await task.synchronize() await run_code_analysis() task.add_requirement("myreq") await task.synchronize()
And how do I specify this in the output_uri
? The default file server is specified by passing True
. How would I specify to use the second?
Any idea why deletion of artifacts on my second fileserver does not work?
fileserver_datasets: networks: - backend - frontend command: - fileserver container_name: clearml-fileserver-datasets image: allegroai/clearml:latest restart: unless-stopped volumes: - /opt/clearml/logs:/var/log/clearml - /opt/clearml/data/fileserver-datasets:/mnt/fileserver - /opt/clearml/config:/opt/clearml/config ports: - "8082:8081"
ClearML successfu...
One more thing: The cuda_version that clearml finds automatically is wrong.
I will debug this myself a little more.
Locally it works fine.
Let me check again.
I just tried to envrionment setup steps that clearml-agent is doing locally, but with my environment.yml instead of the one that clearml generates.
By host you mean the machine on which the agent is running? How does clearml-agent find the cuda_version?
AnxiousSeal95 Thanks a lot. Seems to be working fine for me. I see the clearml-agent version that pip installs in the docker is now fixed to the host version 🙂 PyTorch Nightly is also installed correctly now!
Hey Martin, thank you for answering!
I see your point, however in my opinion this is really unexpected behavior. Sure, I can do some work to make it "safe", but shouldn't that be default. So throw an error without clearml.conf and expect CLEARML_USE_DEFAULT_SERVER=1
` .
Well, I guess no hurdles vs. safety is inherently no solvable. I am all for hurdles, if it is clear how to overcome it. And in my opinion referring to clearml-init
is something which makes sense from a developer and a user perspective.
Wouldn't it be enough to just require a call to clearml-init
and throw an error when running without clearml.conf
which tells the user to run clearml-init first?