Reputation
Badges 1
36 × Eureka!Infrastructure in k8s
but when I check healt of cluster, I’ve got green statuscurl localhost:9200/_cluster/health
` {"cluster_name":"clearml","status":"green","timed_out":false,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":41,"active_shards":41,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_nu...
what interesting, that a new experiments clearml can delete without any problems
but old archived experiments, clearml didn’t want remove
sure
First command outputcurl -XGET
`
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10 xjVdUpdDReCv5g11c4IGFw 1 0 10248782 0 536.6mb 536.6mb
green open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-11 YuxjrptlTh2MlOCU7ykMkA 1 0 13177592 0 695....
Nothing)
I’ll talk to the developers and I think I figured out how to solve this problem
I’ve tried with these two
` >>> client.tasks.get_all(system_tags=["archived"])
+----------------------------------+------------------------------------------------------------+
| id | name |
+----------------------------------+------------------------------------------------------------+
| 378c8e80c3dd4ff8901f04f00824acbd | ab-ai-767-easy |
| c575db3f302441c6a977f52c...
Clearml in kubernetes
worker nodes are bare metal and they are not in k8s yet :(
` - env:
- name: bootstrap.memory_lock
value: "true"
- name: cluster.name
value: clearml
- name: cluster.routing.allocation.node_initial_primaries_recoveries
value: "500"
- name: cluster.routing.allocation.disk.watermark.low
value: 500mb
- name: cluster.routing.allocation.disk.watermark.high
value: 500mb
- name: cluster.routing.allocation.disk.watermark.flood_stage
value: 500mb
...
` [2021-06-11 15:24:36,885] [9] [ERROR] [clearml.service_repo] Returned 500 for queues.get_next_task in 60007ms, msg=General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06', '_type': '_doc', '_id': 'PkGr-3kBBPcUBw4n5Acx', 'status': 503, 'error': {'type':..., extra_info=[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[queue_metrics...
Recently, the free space on pv ended and the cluster switched to read_only_allow_delete. I’ve tried remove old experiments, but it didn’t help and I got the same error.
Then I changed the size of the PV and added an extra 50Gb
Looks like it helped and now the service is working, but I still get this bug.
Anyway, if there was any additional information for troubleshooting or backups on the site would be very cool.
Yet the experiments have stopped normally. In the body of the experiment writes aborted, but at the same time I see it on the dashboard
and I still see this error in the logs[2022-06-20 13:24:27,777] [9] [WARNING] [elasticsearch] POST
` [status:N/A request
:60.060s]
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 449, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 444, in _make_request
httplib_response = conn.getresponse()
File "/...
AgitatedDove14 I can try but are you sure this will help?
Developers complain that the experiments are long hung in the status of Pending
more than 10 minutes
When I load http://app.clearml.my.domain.com I get Status Code: 426 at http://app.clearml.my.domain.com/v2.13/login.supported_modes (for example)
At the moment I’ve downloaded helmchart and added support proxy_http_version 1.1; in nginx. Then everything works
Thank you, I understand, but the developers want all packages to be in one place
I recovered the ES data from the backup
It helped.
In our case some packages are taken from /usr/lib/python3/dist-packages, others from the local environment and this causes a conflict when importing the attr module
old 0.17
new 1.0.2
partly used helm charts, we are used yaml files from helm, but we rewrote part about pvc and our clearml locate in several nodes
Can you share the modified help/yaml ?
Yep, here in attachment, clearml and pvc
Did you run any specific migration script after the upgrade ?
nope, I’ve copied data from fileservers and elasticsearch plus made mongodump
How many apiserver instances do you have ?
1 apiserver container
How did you configure the elastic container? is it booting?
Standard configuration (clearml.yaml). Elastic works
But if I don’t want that new venv to inherit everything? I prepared my own image and want use this venv
at the moment ES has the following resourcesLimits: cpu: 2 memory: 10G Requests: cpu: 2 memory: 10G
We launched ES with these parameters at the time of the problems
Also I tried delete tasks by api, like this:
` >>> from clearml_agent import APIClient
client = APIClient()
client.tasks.get_all(system_tags=["archived"])
+----------------------------------+------------------------------------------------------------+
| id | name |
+----------------------------------+------------------------------------------------------------+
| 41cb804da24747abb362fb5ca0414fe6 | 15....
webserver 127.0.0.1 - - [11/Jun/2021:14:32:02 +0000] “GET /version.json HTTP/1.1” 304 0 “*/projects/cbe22f65c9b74898b5496c48fffda75b/experiments/3fc89b411cf14240bf1017f17c58916b/execution?columns=selected&columns=type&columns=name&columns=tags&columns=status&columns=project.name&columns=users&columns=started&columns=last_update&columns=last_iteration&columns=parent.name&order=last_update” “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)
for example webserver
Delete, reset
looks like something with index
` index shard time type stage source_host source_node target_host target_node repository snapshot files files_recovered files_percent files_total bytes bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b 0 2.4h existing_store done n/a n/a 10.18.13.96 cle...
And developers complain to me that they can’t start experiment
` APIError: code 500/100: General data error (ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='elasticsearch-service', port='9200'): Read timed out. (read timeout=60)))
Failed deleting old session ffaa2192fb9045359e7c9827ff5e1e55
APIError: code 500/100: General data error (ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='elasticsearch-service', port='9200'): Read timed out. (read timeo...