Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, I'M Getting This Long Error When Running

Hi, i'm getting this long error when running task.execute_remotely(queue_name="1gpu", exit_process=True) . I also notices an error Failed to fetching activity worker statistics when i clicked on the worker in the ClearML UI. I do see a task created in the UI but it would be 'aborted'.

python runremotetrain.py ClearML Task: created new task id=78f6fe7a591947539cb7a4fb2b6a3b91 ClearML results page: Traceback (most recent call last): File "runremotetrain.py", line 4, in <module> task.execute_remotely(queue_name="1gpu", exit_process=True) File "/mnt/DATA/projects/clearml/clearml-usage-examples/detectron2/codes/venv/lib/python3.7/site-packages/clearml/task.py", line 1942, in execute_remotely Task.enqueue(task, queue_name=queue_name) File "/mnt/DATA/projects/clearml/clearml-usage-examples/detectron2/codes/venv/lib/python3.7/site-packages/clearml/task.py", line 989, in enqueue res = cls._send(session=session, req=req) File "/mnt/DATA/projects/clearml/clearml-usage-examples/detectron2/codes/venv/lib/python3.7/site-packages/clearml/backend_interface/base.py", line 89, in _send raise SendError(res, error_msg) clearml.backend_interface.session.SendError: Action failed <500/100: tasks.enqueue/v1.0 (General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-05', '_type': '_doc', '_id': '6Bvxk3kBGPc6t8mGAJ9R', 'status': 503, 'error': {'type':..., extra_info=[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-05][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-05][0]] containing [index {[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-05][_doc][6Bvxk3kBGPc6t8mGAJ9R], source[{"timestamp":1621684715598,"queue":"f03539fca75f461ab3e6297186bdb045","average_waiting_time":0,"queue_length":0}]}]])> (queue=f03539fca75f461ab3e6297186bdb045, task=78f6fe7a591947539cb7a4fb2b6a3b91)

  
  
Posted 3 years ago
Votes Newest

Answers 30


As for backup - this is usually done by backing up the mounted clearml/data/ folder, but it is usually done when the server is down (otherwise the data maybe not be backed-up correctly)

  
  
Posted 3 years ago

Can you do:
curlfrom inside the elastic container?

  
  
Posted 3 years ago

Alright thanks, i will work on that.

  
  
Posted 3 years ago

Can i somehow perform an export or backup?

  
  
Posted 3 years ago

?

  
  
Posted 3 years ago

Ok, i guess i will have to kill the whole thing and refresh it.

  
  
Posted 3 years ago

Can you try:
curl | grep UNASSIGNED

  
  
Posted 3 years ago

I'm not familiar with elastic. What role does elastic play in ClearML?

  
  
Posted 3 years ago

It seems both the queue_metrics and worker_stats indices are in red status

  
  
Posted 3 years ago

docker exec clearml-elastic curl zsh: no matches found:

  
  
Posted 3 years ago

Thanks that did solve the problem, the tasks are running again.

  
  
Posted 3 years ago

[root@2c7498711bef elasticsearch]# curl { "cluster_name" : "clearml", "status" : "red", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 4, "active_shards" : 4, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 8, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 33.33333333333333 }

  
  
Posted 3 years ago

sorry, 

curl 

Did you try it again?

  
  
Posted 3 years ago

Also, I doubt the reason was that you added workers, more likely something happened in the Elastic - be it some disk issue or something else

  
  
Posted 3 years ago

Hi SubstantialElk6 , another thing that can be checked is the health of the particular ES indices. Can you please run the below command in the clearml-elastic container and post the results here?
curl -XGET

  
  
Posted 3 years ago

I'm starting to suspect Docker Desktop 🙂

  
  
Posted 3 years ago

Well, assuming your server data is properly mounted outside of the dockers, restarting the server will be just fine. Can you verify the externally mapped folders actually contain data? (in /Users/jax/clearml/data/mongo , /Users/jax/clearml/data/elastic_7 etc...)

  
  
Posted 3 years ago

docker exec clearml-elastic curl

zsh: no matches found:

No no, do:
docker exec -it clearml-elastic /bin/bashAnd than from the bash inside the container, do:
curl

  
  
Posted 3 years ago

Wait

  
  
Posted 3 years ago

(which is very rare - we're running lots of instances of the clearml server for a very long time and I've never encountered this issue 😞 )

  
  
Posted 3 years ago

SubstantialElk6 Both indices that are red are not critical for the ClearML functioning and can be deleted like this:
curl -XDELETE ' ' curl -XDELETE ' 'For the analysis of the possible reasons that lead to it can you please collect the full ES logs to the file and send it here?
sudo docker logs clearml-elastic > log.txt 2>&1

  
  
Posted 3 years ago

[root@2c7498711bef elasticsearch]# curl { "index" : "events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b", "shard" : 0, "primary" : false, "current_state" : "unassigned", "unassigned_info" : { "reason" : "CLUSTER_RECOVERED", "at" : "2021-05-22T11:33:38.932Z", "last_allocation_status" : "no_attempt" }, "can_allocate" : "no", "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes", "node_allocation_decisions" : [ { "node_id" : "yHcROdoNRWGfm0Ry062NGQ", "node_name" : "clearml", "transport_address" : "172.18.0.3:9300", "node_attributes" : { "ml.machine_memory" : "8349540352", "xpack.installed" : "true", "ml.max_open_jobs" : "20" }, "node_decision" : "no", "deciders" : [ { "decider" : "same_shard", "decision" : "NO", "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0], node[yHcROdoNRWGfm0Ry062NGQ], [P], s[STARTED], a[id=J-nHYEXBSymWrgpo7J2bLw]]" } ] } ] }

  
  
Posted 3 years ago

What about
curl grep UNASSIGNED

  
  
Posted 3 years ago

yes, previously run experiments. I will just kill clearml-elastic container if that may solve the problem.

  
  
Posted 3 years ago

Well, Elastic is used to store and index all experiment and worker metrics

  
  
Posted 3 years ago

[root@2c7498711bef elasticsearch]# curl -XGET yellow open events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b 4hAFNtGkRr-CHNGnUYfbTA 1 1 4724 271 660.9kb 660.9kb yellow open events-log-d1bd92a3b039400cbafc60a7a5b1e52b M3qgFy1HRU2PibDOr1YOdw 1 1 1221 20 1013.6kb 1013.6kb red open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-05 EQK8mnlhRxCrrKK3clcUFA 1 1 red open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-05 LZBWjeupRiuDPM50EB-0Ow 1 1 yellow open events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b CP9YJDjMQLOJ3KNJ9ydobA 1 1 16 0 24.7kb 24.7kb yellow open events-plot-d1bd92a3b039400cbafc60a7a5b1e52b FCwT4QMkR1G1aaqxvEzEnQ 1 1 3 0 12.7kb 12.7kb

  
  
Posted 3 years ago

BTW - spotted some issues with your docker-compose, namely:
` apiserver:
...
volumes:
...
- /Users/jax/clearml/config:/Users/jax/clearml/config

...
environment:
  ...
  CLEARML__apiserver__pre_populate__zip_files: "/Users/jax/clearml/db-pre-populate" `Specifically, you can't have internal docker-image mounted directories to be identical to your external host directories - the server looks for the configuration (and other stuff) in fixed internal directories - your changes, among others,  effectively make the server ignore the additional configuration files you might add (this is repeated in the  ` fileserver `  section as well)
  
  
Posted 3 years ago

If you're running your own server - no, there is no limitation

  
  
Posted 3 years ago

and yes, there are stuff in there. In fact its been running for a few weeks with no issue. This appears to have happened after i added new workers, though i can't be sure this is the cause. Is there a limit to the number of workers that i can add for community edition?

  
  
Posted 3 years ago

(trying to get as much info as possible first)

  
  
Posted 3 years ago
1K Views
30 Answers
3 years ago
one year ago
Tags
Similar posts