Hi, I'M Getting This Long Error When Running

Answered

Hi, i'm getting this long error when running task.execute_remotely(queue_name="1gpu", exit_process=True) . I also notices an error Failed to fetching activity worker statistics when i clicked on the worker in the ClearML UI. I do see a task created in the UI but it would be 'aborted'.

python runremotetrain.py ClearML Task: created new task id=78f6fe7a591947539cb7a4fb2b6a3b91 ClearML results page: Traceback (most recent call last): File "runremotetrain.py", line 4, in <module> task.execute_remotely(queue_name="1gpu", exit_process=True) File "/mnt/DATA/projects/clearml/clearml-usage-examples/detectron2/codes/venv/lib/python3.7/site-packages/clearml/task.py", line 1942, in execute_remotely Task.enqueue(task, queue_name=queue_name) File "/mnt/DATA/projects/clearml/clearml-usage-examples/detectron2/codes/venv/lib/python3.7/site-packages/clearml/task.py", line 989, in enqueue res = cls._send(session=session, req=req) File "/mnt/DATA/projects/clearml/clearml-usage-examples/detectron2/codes/venv/lib/python3.7/site-packages/clearml/backend_interface/base.py", line 89, in _send raise SendError(res, error_msg) clearml.backend_interface.session.SendError: Action failed <500/100: tasks.enqueue/v1.0 (General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-05', '_type': '_doc', '_id': '6Bvxk3kBGPc6t8mGAJ9R', 'status': 503, 'error': {'type':..., extra_info=[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-05][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-05][0]] containing [index {[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-05][_doc][6Bvxk3kBGPc6t8mGAJ9R], source[{"timestamp":1621684715598,"queue":"f03539fca75f461ab3e6297186bdb045","average_waiting_time":0,"queue_length":0}]}]])> (queue=f03539fca75f461ab3e6297186bdb045, task=78f6fe7a591947539cb7a4fb2b6a3b91)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Votes Newest

Answers 30

Wait

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Well, Elastic is used to store and index all experiment and worker metrics

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

yes, previously run experiments. I will just kill clearml-elastic container if that may solve the problem.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Hi SubstantialElk6 , another thing that can be checked is the health of the particular ES indices. Can you please run the below command in the clearml-elastic container and post the results here?
curl -XGET

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AppetizingMouse58
				
					0

Alright thanks, i will work on that.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

docker exec clearml-elastic curl

zsh: no matches found:

No no, do:
docker exec -it clearml-elastic /bin/bashAnd than from the bash inside the container, do:
curl

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

What about
curl grep UNASSIGNED

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Can you do:
curlfrom inside the elastic container?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I'm starting to suspect Docker Desktop 🙂

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

If you're running your own server - no, there is no limitation

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

docker exec clearml-elastic curl zsh: no matches found:

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

(trying to get as much info as possible first)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Well, assuming your server data is properly mounted outside of the dockers, restarting the server will be just fine. Can you verify the externally mapped folders actually contain data? (in /Users/jax/clearml/data/mongo , /Users/jax/clearml/data/elastic_7 etc...)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Can you try:
curl | grep UNASSIGNED

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

BTW - spotted some issues with your docker-compose, namely:
` apiserver:
...
volumes:
...
- /Users/jax/clearml/config:/Users/jax/clearml/config

...
environment:
  ...
  CLEARML__apiserver__pre_populate__zip_files: "/Users/jax/clearml/db-pre-populate" `Specifically, you can't have internal docker-image mounted directories to be identical to your external host directories - the server looks for the configuration (and other stuff) in fixed internal directories - your changes, among others,  effectively make the server ignore the additional configuration files you might add (this is repeated in the  ` fileserver `  section as well)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

It seems both the queue_metrics and worker_stats indices are in red status

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

(which is very rare - we're running lots of instances of the clearml server for a very long time and I've never encountered this issue 😞 )

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

As for backup - this is usually done by backing up the mounted clearml/data/ folder, but it is usually done when the server is down (otherwise the data maybe not be backed-up correctly)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

[root@2c7498711bef elasticsearch]# curl { "cluster_name" : "clearml", "status" : "red", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 4, "active_shards" : 4, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 8, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 33.33333333333333 }

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

[root@2c7498711bef elasticsearch]# curl { "index" : "events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b", "shard" : 0, "primary" : false, "current_state" : "unassigned", "unassigned_info" : { "reason" : "CLUSTER_RECOVERED", "at" : "2021-05-22T11:33:38.932Z", "last_allocation_status" : "no_attempt" }, "can_allocate" : "no", "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes", "node_allocation_decisions" : [ { "node_id" : "yHcROdoNRWGfm0Ry062NGQ", "node_name" : "clearml", "transport_address" : "172.18.0.3:9300", "node_attributes" : { "ml.machine_memory" : "8349540352", "xpack.installed" : "true", "ml.max_open_jobs" : "20" }, "node_decision" : "no", "deciders" : [ { "decider" : "same_shard", "decision" : "NO", "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0], node[yHcROdoNRWGfm0Ry062NGQ], [P], s[STARTED], a[id=J-nHYEXBSymWrgpo7J2bLw]]" } ] } ] }

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

and yes, there are stuff in there. In fact its been running for a few weeks with no issue. This appears to have happened after i added new workers, though i can't be sure this is the cause. Is there a limit to the number of workers that i can add for community edition?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Ok, i guess i will have to kill the whole thing and refresh it.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

SubstantialElk6 Both indices that are red are not critical for the ClearML functioning and can be deleted like this:
curl -XDELETE ' ' curl -XDELETE ' 'For the analysis of the possible reasons that lead to it can you please collect the full ES logs to the file and send it here?
sudo docker logs clearml-elastic > log.txt 2>&1

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AppetizingMouse58
				
					0

sorry,

curl

Did you try it again?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Can i somehow perform an export or backup?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

I'm not familiar with elastic. What role does elastic play in ClearML?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Also, I doubt the reason was that you added workers, more likely something happened in the Elastic - be it some disk issue or something else

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Thanks that did solve the problem, the tasks are running again.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

[root@2c7498711bef elasticsearch]# curl -XGET yellow open events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b 4hAFNtGkRr-CHNGnUYfbTA 1 1 4724 271 660.9kb 660.9kb yellow open events-log-d1bd92a3b039400cbafc60a7a5b1e52b M3qgFy1HRU2PibDOr1YOdw 1 1 1221 20 1013.6kb 1013.6kb red open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-05 EQK8mnlhRxCrrKK3clcUFA 1 1 red open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-05 LZBWjeupRiuDPM50EB-0Ow 1 1 yellow open events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b CP9YJDjMQLOJ3KNJ9ydobA 1 1 16 0 24.7kb 24.7kb yellow open events-plot-d1bd92a3b039400cbafc60a7a5b1e52b FCwT4QMkR1G1aaqxvEzEnQ 1 1 3 0 12.7kb 12.7kb

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Write your answer

2K Views

30 Answers

4 years ago

2 years ago