Reputation
Badges 1
36 × Eureka!Nothing)
I’ll talk to the developers and I think I figured out how to solve this problem
webserver 127.0.0.1 - - [11/Jun/2021:14:32:02 +0000] “GET /version.json HTTP/1.1” 304 0 “*/projects/cbe22f65c9b74898b5496c48fffda75b/experiments/3fc89b411cf14240bf1017f17c58916b/execution?columns=selected&columns=type&columns=name&columns=tags&columns=status&columns=project.name&columns=users&columns=started&columns=last_update&columns=last_iteration&columns=parent.name&order=last_update” “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)
for example webserver
Many thanks
2 indexes didn’t work. I deleted them and new ones were created automatically.
Can you share the modified help/yaml ?
Yep, here in attachment, clearml and pvc
Did you run any specific migration script after the upgrade ?
nope, I’ve copied data from fileservers and elasticsearch plus made mongodump
How many apiserver instances do you have ?
1 apiserver container
How did you configure the elastic container? is it booting?
Standard configuration (clearml.yaml). Elastic works
` [2021-06-11 15:24:36,885] [9] [ERROR] [clearml.service_repo] Returned 500 for queues.get_next_task in 60007ms, msg=General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06', '_type': '_doc', '_id': 'PkGr-3kBBPcUBw4n5Acx', 'status': 503, 'error': {'type':..., extra_info=[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[queue_metrics...
In our case some packages are taken from /usr/lib/python3/dist-packages, others from the local environment and this causes a conflict when importing the attr module
Clearml in kubernetes
worker nodes are bare metal and they are not in k8s yet :(
I think per task we use clearml-task? but yes, this needs permanently, like config clearml.conf we have 4 gpu, and for each, we have a separate cache
I don’t want to make 4 cleaml.conf files
` - env:
- name: bootstrap.memory_lock
value: "true"
- name: cluster.name
value: clearml
- name: cluster.routing.allocation.node_initial_primaries_recoveries
value: "500"
- name: cluster.routing.allocation.disk.watermark.low
value: 500mb
- name: cluster.routing.allocation.disk.watermark.high
value: 500mb
- name: cluster.routing.allocation.disk.watermark.flood_stage
value: 500mb
...
Thank you, I understand, but the developers want all packages to be in one place
Yes, it’s the same. I realized my failure and now everything works) many thanks
old 0.17
new 1.0.2
partly used helm charts, we are used yaml files from helm, but we rewrote part about pvc and our clearml locate in several nodes
But if I don’t want that new venv to inherit everything? I prepared my own image and want use this venv
Recently, the free space on pv ended and the cluster switched to read_only_allow_delete. I’ve tried remove old experiments, but it didn’t help and I got the same error.
And one more questions
Could I provide an argument for docker not in clearml.conf, but in the start daemon?
for example
clearml-agent --config-file ~/clearml.conf daemon --docker agent-image-test “-v /home/trains/clearml-agent-data/3/.cache:/root/.cache” --queue test --create-queue --foreground --gpus=3
Or I can do it only in clearml.conf?
Also I tried delete tasks by api, like this:
` >>> from clearml_agent import APIClient
client = APIClient()
client.tasks.get_all(system_tags=["archived"])
+----------------------------------+------------------------------------------------------------+
| id | name |
+----------------------------------+------------------------------------------------------------+
| 41cb804da24747abb362fb5ca0414fe6 | 15....
Infrastructure in k8s
but when I check healt of cluster, I’ve got green statuscurl localhost:9200/_cluster/health
` {"cluster_name":"clearml","status":"green","timed_out":false,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":41,"active_shards":41,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_nu...
what interesting, that a new experiments clearml can delete without any problems
but old archived experiments, clearml didn’t want remove
Anyway, if there was any additional information for troubleshooting or backups on the site would be very cool.
I recovered the ES data from the backup
It helped.
AgitatedDove14 I can try but are you sure this will help?
Yet the experiments have stopped normally. In the body of the experiment writes aborted, but at the same time I see it on the dashboard
ok, lets try
but it’s a lot of resources
I just hided elastic IP in the second output
Delete, reset
looks like something with index
` index shard time type stage source_host source_node target_host target_node repository snapshot files files_recovered files_percent files_total bytes bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b 0 2.4h existing_store done n/a n/a 10.18.13.96 cle...
at the moment ES has the following resourcesLimits: cpu: 2 memory: 10G Requests: cpu: 2 memory: 10G
We launched ES with these parameters at the time of the problems
When I load http://app.clearml.my.domain.com I get Status Code: 426 at http://app.clearml.my.domain.com/v2.13/login.supported_modes (for example)
At the moment I’ve downloaded helmchart and added support proxy_http_version 1.1; in nginx. Then everything works