
Reputation
Badges 1
36 × Eureka!Delete, reset
looks like something with index
` index shard time type stage source_host source_node target_host target_node repository snapshot files files_recovered files_percent files_total bytes bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b 0 2.4h existing_store done n/a n/a 10.18.13.96 cle...
ok, lets try
but it’s a lot of resources
sure
First command outputcurl -XGET
`
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10 xjVdUpdDReCv5g11c4IGFw 1 0 10248782 0 536.6mb 536.6mb
green open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-11 YuxjrptlTh2MlOCU7ykMkA 1 0 13177592 0 695....
what interesting, that a new experiments clearml can delete without any problems
but old archived experiments, clearml didn’t want remove
Developers complain that the experiments are long hung in the status of Pending
more than 10 minutes
Anyway, if there was any additional information for troubleshooting or backups on the site would be very cool.
Clearml in kubernetes
worker nodes are bare metal and they are not in k8s yet :(
at the moment ES has the following resourcesLimits: cpu: 2 memory: 10G Requests: cpu: 2 memory: 10G
We launched ES with these parameters at the time of the problems
And one more questions
Could I provide an argument for docker not in clearml.conf, but in the start daemon?
for example
clearml-agent --config-file ~/clearml.conf daemon --docker agent-image-test “-v /home/trains/clearml-agent-data/3/.cache:/root/.cache” --queue test --create-queue --foreground --gpus=3
Or I can do it only in clearml.conf?
When I load http://app.clearml.my.domain.com I get Status Code: 426 at http://app.clearml.my.domain.com/v2.13/login.supported_modes (for example)
At the moment I’ve downloaded helmchart and added support proxy_http_version 1.1; in nginx. Then everything works
But if I don’t want that new venv to inherit everything? I prepared my own image and want use this venv
Also I tried delete tasks by api, like this:
` >>> from clearml_agent import APIClient
client = APIClient()
client.tasks.get_all(system_tags=["archived"])
+----------------------------------+------------------------------------------------------------+
| id | name |
+----------------------------------+------------------------------------------------------------+
| 41cb804da24747abb362fb5ca0414fe6 | 15....
I’ve tried with these two
` >>> client.tasks.get_all(system_tags=["archived"])
+----------------------------------+------------------------------------------------------------+
| id | name |
+----------------------------------+------------------------------------------------------------+
| 378c8e80c3dd4ff8901f04f00824acbd | ab-ai-767-easy |
| c575db3f302441c6a977f52c...
Yet the experiments have stopped normally. In the body of the experiment writes aborted, but at the same time I see it on the dashboard
In our case some packages are taken from /usr/lib/python3/dist-packages, others from the local environment and this causes a conflict when importing the attr module
old 0.17
new 1.0.2
partly used helm charts, we are used yaml files from helm, but we rewrote part about pvc and our clearml locate in several nodes
I recovered the ES data from the backup
It helped.
` - env:
- name: bootstrap.memory_lock
value: "true"
- name: cluster.name
value: clearml
- name: cluster.routing.allocation.node_initial_primaries_recoveries
value: "500"
- name: cluster.routing.allocation.disk.watermark.low
value: 500mb
- name: cluster.routing.allocation.disk.watermark.high
value: 500mb
- name: cluster.routing.allocation.disk.watermark.flood_stage
value: 500mb
...
Recently, the free space on pv ended and the cluster switched to read_only_allow_delete. I’ve tried remove old experiments, but it didn’t help and I got the same error.
I think per task we use clearml-task? but yes, this needs permanently, like config clearml.conf we have 4 gpu, and for each, we have a separate cache
I don’t want to make 4 cleaml.conf files
Yes, it’s the same. I realized my failure and now everything works) many thanks
Many thanks
2 indexes didn’t work. I deleted them and new ones were created automatically.
and I still see this error in the logs[2022-06-20 13:24:27,777] [9] [WARNING] [elasticsearch] POST
` [status:N/A request
:60.060s]
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 449, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 444, in _make_request
httplib_response = conn.getresponse()
File "/...
Can you share the modified help/yaml ?
Yep, here in attachment, clearml and pvc
Did you run any specific migration script after the upgrade ?
nope, I’ve copied data from fileservers and elasticsearch plus made mongodump
How many apiserver instances do you have ?
1 apiserver container
How did you configure the elastic container? is it booting?
Standard configuration (clearml.yaml). Elastic works
webserver 127.0.0.1 - - [11/Jun/2021:14:32:02 +0000] “GET /version.json HTTP/1.1” 304 0 “*/projects/cbe22f65c9b74898b5496c48fffda75b/experiments/3fc89b411cf14240bf1017f17c58916b/execution?columns=selected&columns=type&columns=name&columns=tags&columns=status&columns=project.name&columns=users&columns=started&columns=last_update&columns=last_iteration&columns=parent.name&order=last_update” “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)
for example webserver
Infrastructure in k8s
but when I check healt of cluster, I’ve got green statuscurl localhost:9200/_cluster/health
` {"cluster_name":"clearml","status":"green","timed_out":false,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":41,"active_shards":41,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_nu...
I just hided elastic IP in the second output
` [2021-06-11 15:24:36,885] [9] [ERROR] [clearml.service_repo] Returned 500 for queues.get_next_task in 60007ms, msg=General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06', '_type': '_doc', '_id': 'PkGr-3kBBPcUBw4n5Acx', 'status': 503, 'error': {'type':..., extra_info=[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[queue_metrics...