HI ResponsiveCamel97
What's the clearml-server version? How do you spin the server on your k8s cluster, helm ?
[2021-06-11 15:24:36,885] [9] [ERROR] [clearml.service_repo] Returned 500 for queues.get_next_task in 60007ms, msg=General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06', '_type': '_doc', '_id': 'PkGr-3kBBPcUBw4n5Acx', 'status': 503, 'error': {'type':..., extra_info=[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0]] containing [index {[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][_doc][PkGr-3kBBPcUBw4n5Acx], source[_na_]}]] [2021-06-11 15:24:39,424] [9] [ERROR] [clearml.__init__] Failed processing worker status report Traceback (most recent call last): File "/opt/clearml/apiserver/bll/workers/__init__.py", line 149, in status_report machine_stats=report.machine_stats, File "/opt/clearml/apiserver/bll/workers/__init__.py", line 416, in _log_stats_to_es es_res = elasticsearch.helpers.bulk(self.es_client, actions) File "/usr/local/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 396, in bulk for ok, item in streaming_bulk(client, actions, *args, **kwargs): File "/usr/local/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 326, in streaming_bulk **kwargs File "/usr/local/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 246, in _process_bulk_chunk for item in gen: File "/usr/local/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 185, in _process_bulk_chunk_success raise BulkIndexError("%i document(s) failed to index." % len(errors), errors) elasticsearch.helpers.errors.BulkIndexError: ('8 document(s) failed to index.', [{'index': {'_index': 'worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06', '_type': '_doc', '_id': 'P0Gr-3kBBPcUBw4n7gce', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0]] containing [8] requests]'}, 'data': {'timestamp': 1623417920000, 'worker': 'test:bd28:cpu:2', 'company': 'clearml', 'task': None, 'category': 'cpu', 'metric': 'cpu_temperature', 'variant': '0', 'value': 43.0}}}, {'index': {'_index': 'worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06', '_type': '_doc', '_id': 'QEGr-3kBBPcUBw4n7gce', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0]] containing [8] requests]'}, 'data': {'timestamp': 1623417920000, 'worker': 'test:bd28:cpu:2', 'company': 'clearml', 'task': None, 'category': 'cpu', 'metric': 'cpu_usage', 'variant': '0', 'value': 3.334}}}, {'index': {'_index': 'worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06', '_type': '_doc', '_id': 'QUGr-3kBBPcUBw4n7gce', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0]] containing [8] requests]'}, 'data': {'timestamp': 1623417920000, 'worker': 'test:bd28:cpu:2', 'company': 'clearml', 'task': None, 'category': 'disk', 'metric': 'disk_free_home', 'variant': 'total', 'value': 58.1}}}, {'index': {'_index': 'worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06', '_type': '_doc', '_id': 'QkGr-3kBBPcUBw4n7gce', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0]] containing [8] requests]'}, 'data': {'timestamp': 1623417920000, 'worker': 'test:bd28:cpu:2', 'company': 'clearml', 'task': None, 'category': 'disk', 'metric': 'disk_write', 'variant': 'total', 'value': 0.009}}}, {'index': {'_index': 'worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06', '_type': '_doc', '_id': 'Q0Gr-3kBBPcUBw4n7gce', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0]] containing [8] requests]'}, 'data': {'timestamp': 1623417920000, 'worker': 'test:bd28:cpu:2', 'company': 'clearml', 'task': None, 'category': 'memory', 'metric': 'memory_free', 'variant': 'total', 'value': 113848.816}}}, {'index': {'_index': 'worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06', '_type': '_doc', '_id': 'REGr-3kBBPcUBw4n7gce', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0]] containing [8] requests]'}, 'data': {'timestamp': 1623417920000, 'worker': 'test:bd28:cpu:2', 'company': 'clearml', 'task': None, 'category': 'memory', 'metric': 'memory_used', 'variant': 'total', 'value': 13401.186}}}, {'index': {'_index': 'worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06', '_type': '_doc', '_id': 'RUGr-3kBBPcUBw4n7gce', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0]] containing [8] requests]'}, 'data': {'timestamp': 1623417920000, 'worker': 'test:bd28:cpu:2', 'company': 'clearml', 'task': None, 'category': 'network', 'metric': 'network_rx', 'variant': 'total', 'value': 0.001}}}, {'index': {'_index': 'worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06', '_type': '_doc', '_id': 'RkGr-3kBBPcUBw4n7gce', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0]] containing [8] requests]'}, 'data': {'timestamp': 1623417920000, 'worker': 'test:bd28:cpu:2', 'company': 'clearml', 'task': None, 'category': 'network', 'metric': 'network_tx', 'variant': 'total', 'value': 0.001}}}]) [2021-06-11 15:24:39,426] [9] [ERROR] [clearml.service_repo] Returned 500 for workers.status_report in 60008ms, msg=General data error (Failed processing worker status report): err=8 document(s) failed to index.
AgitatedDove14 I can try but are you sure this will help?
ResponsiveCamel97
could you attach the full log?
Error 101 : Inconsistent data encountered in document: document=Output, field=model
Okay this point to a migration issue from 0.17 to 1.0
First try to upgrade to 1.0 then to 1.0.2
(I would also upgrade a single apiserver instance, once it is done, then you can spin the rest)
Make sense ?
old 0.17
new 1.0.2
partly used helm charts, we are used yaml files from helm, but we rewrote part about pvc and our clearml locate in several nodes
Can you share the modified help/yaml ?
Yep, here in attachment, clearml and pvc
Did you run any specific migration script after the upgrade ?
nope, I’ve copied data from fileservers and elasticsearch plus made mongodump
How many apiserver instances do you have ?
1 apiserver container
How did you configure the elastic container? is it booting?
Standard configuration (clearml.yaml). Elastic works
Can you share the modified help/yaml ?
Did you run any specific migration script after the upgrade ?
How many apiserver instances do you have ?
How did you configure the elastic container? is it booting?
webserver 127.0.0.1 - - [11/Jun/2021:14:32:02 +0000] “GET /version.json HTTP/1.1” 304 0 “*/projects/cbe22f65c9b74898b5496c48fffda75b/experiments/3fc89b411cf14240bf1017f17c58916b/execution?columns=selected&columns=type&columns=name&columns=tags&columns=status&columns=project.name&columns=users&columns=started&columns=last_update&columns=last_iteration&columns=parent.name&order=last_update” “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)
for example webserver
ResponsiveCamel97 is looks like one of the shards in ES is not active, I suggest using ES API to query the cluster status and the reason for the shards status
Many thanks
2 indexes didn’t work. I deleted them and new ones were created automatically.