ResponsiveCamel97
could you attach the full log?
webserver 127.0.0.1 - - [11/Jun/2021:14:32:02 +0000] “GET /version.json HTTP/1.1” 304 0 “*/projects/cbe22f65c9b74898b5496c48fffda75b/experiments/3fc89b411cf14240bf1017f17c58916b/execution?columns=selected&columns=type&columns=name&columns=tags&columns=status&columns=project.name&columns=users&columns=started&columns=last_update&columns=last_iteration&columns=parent.name&order=last_update” “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)
for example webserver
HI ResponsiveCamel97
What's the clearml-server version? How do you spin the server on your k8s cluster, helm ?
old 0.17
new 1.0.2
partly used helm charts, we are used yaml files from helm, but we rewrote part about pvc and our clearml locate in several nodes
Error 101 : Inconsistent data encountered in document: document=Output, field=model
Okay this point to a migration issue from 0.17 to 1.0
First try to upgrade to 1.0 then to 1.0.2
(I would also upgrade a single apiserver instance, once it is done, then you can spin the rest)
Make sense ?
[2021-06-11 15:24:36,885] [9] [ERROR] [clearml.service_repo] Returned 500 for queues.get_next_task in 60007ms, msg=General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06', '_type': '_doc', '_id': 'PkGr-3kBBPcUBw4n5Acx', 'status': 503, 'error': {'type':..., extra_info=[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0]] containing [index {[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][_doc][PkGr-3kBBPcUBw4n5Acx], source[_na_]}]] [2021-06-11 15:24:39,424] [9] [ERROR] [clearml.__init__] Failed processing worker status report Traceback (most recent call last): File "/opt/clearml/apiserver/bll/workers/__init__.py", line 149, in status_report machine_stats=report.machine_stats, File "/opt/clearml/apiserver/bll/workers/__init__.py", line 416, in _log_stats_to_es es_res = elasticsearch.helpers.bulk(self.es_client, actions) File "/usr/local/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 396, in bulk for ok, item in streaming_bulk(client, actions, *args, **kwargs): File "/usr/local/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 326, in streaming_bulk **kwargs File "/usr/local/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 246, in _process_bulk_chunk for item in gen: File "/usr/local/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 185, in _process_bulk_chunk_success raise BulkIndexError("%i document(s) failed to index." % len(errors), errors) elasticsearch.helpers.errors.BulkIndexError: ('8 document(s) failed to index.', [{'index': {'_index': 'worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06', '_type': '_doc', '_id': 'P0Gr-3kBBPcUBw4n7gce', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0]] containing [8] requests]'}, 'data': {'timestamp': 1623417920000, 'worker': 'test:bd28:cpu:2', 'company': 'clearml', 'task': None, 'category': 'cpu', 'metric': 'cpu_temperature', 'variant': '0', 'value': 43.0}}}, {'index': {'_index': 'worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06', '_type': '_doc', '_id': 'QEGr-3kBBPcUBw4n7gce', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0]] containing [8] requests]'}, 'data': {'timestamp': 1623417920000, 'worker': 'test:bd28:cpu:2', 'company': 'clearml', 'task': None, 'category': 'cpu', 'metric': 'cpu_usage', 'variant': '0', 'value': 3.334}}}, {'index': {'_index': 'worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06', '_type': '_doc', '_id': 'QUGr-3kBBPcUBw4n7gce', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0]] containing [8] requests]'}, 'data': {'timestamp': 1623417920000, 'worker': 'test:bd28:cpu:2', 'company': 'clearml', 'task': None, 'category': 'disk', 'metric': 'disk_free_home', 'variant': 'total', 'value': 58.1}}}, {'index': {'_index': 'worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06', '_type': '_doc', '_id': 'QkGr-3kBBPcUBw4n7gce', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0]] containing [8] requests]'}, 'data': {'timestamp': 1623417920000, 'worker': 'test:bd28:cpu:2', 'company': 'clearml', 'task': None, 'category': 'disk', 'metric': 'disk_write', 'variant': 'total', 'value': 0.009}}}, {'index': {'_index': 'worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06', '_type': '_doc', '_id': 'Q0Gr-3kBBPcUBw4n7gce', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0]] containing [8] requests]'}, 'data': {'timestamp': 1623417920000, 'worker': 'test:bd28:cpu:2', 'company': 'clearml', 'task': None, 'category': 'memory', 'metric': 'memory_free', 'variant': 'total', 'value': 113848.816}}}, {'index': {'_index': 'worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06', '_type': '_doc', '_id': 'REGr-3kBBPcUBw4n7gce', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0]] containing [8] requests]'}, 'data': {'timestamp': 1623417920000, 'worker': 'test:bd28:cpu:2', 'company': 'clearml', 'task': None, 'category': 'memory', 'metric': 'memory_used', 'variant': 'total', 'value': 13401.186}}}, {'index': {'_index': 'worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06', '_type': '_doc', '_id': 'RUGr-3kBBPcUBw4n7gce', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0]] containing [8] requests]'}, 'data': {'timestamp': 1623417920000, 'worker': 'test:bd28:cpu:2', 'company': 'clearml', 'task': None, 'category': 'network', 'metric': 'network_rx', 'variant': 'total', 'value': 0.001}}}, {'index': {'_index': 'worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06', '_type': '_doc', '_id': 'RkGr-3kBBPcUBw4n7gce', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0]] containing [8] requests]'}, 'data': {'timestamp': 1623417920000, 'worker': 'test:bd28:cpu:2', 'company': 'clearml', 'task': None, 'category': 'network', 'metric': 'network_tx', 'variant': 'total', 'value': 0.001}}}]) [2021-06-11 15:24:39,426] [9] [ERROR] [clearml.service_repo] Returned 500 for workers.status_report in 60008ms, msg=General data error (Failed processing worker status report): err=8 document(s) failed to index.
AgitatedDove14 I can try but are you sure this will help?
Can you share the modified help/yaml ?
Did you run any specific migration script after the upgrade ?
How many apiserver instances do you have ?
How did you configure the elastic container? is it booting?
Can you share the modified help/yaml ?
Yep, here in attachment, clearml and pvc
Did you run any specific migration script after the upgrade ?
nope, I’ve copied data from fileservers and elasticsearch plus made mongodump
How many apiserver instances do you have ?
1 apiserver container
How did you configure the elastic container? is it booting?
Standard configuration (clearml.yaml). Elastic works
ResponsiveCamel97 is looks like one of the shards in ES is not active, I suggest using ES API to query the cluster status and the reason for the shards status
Many thanks
2 indexes didn’t work. I deleted them and new ones were created automatically.