Reputation
Badges 1
36 × Eureka!And one more questions
Could I provide an argument for docker not in clearml.conf, but in the start daemon?
for example
clearml-agent --config-file ~/clearml.conf daemon --docker agent-image-test “-v /home/trains/clearml-agent-data/3/.cache:/root/.cache” --queue test --create-queue --foreground --gpus=3
Or I can do it only in clearml.conf?
Also I tried delete tasks by api, like this:
` >>> from clearml_agent import APIClient
client = APIClient()
client.tasks.get_all(system_tags=["archived"])
+----------------------------------+------------------------------------------------------------+
| id | name |
+----------------------------------+------------------------------------------------------------+
| 41cb804da24747abb362fb5ca0414fe6 | 15....
Recently, the free space on pv ended and the cluster switched to read_only_allow_delete. I’ve tried remove old experiments, but it didn’t help and I got the same error.
AgitatedDove14 I can try but are you sure this will help?
Thank you, I understand, but the developers want all packages to be in one place
Can you share the modified help/yaml ?
Yep, here in attachment, clearml and pvc
Did you run any specific migration script after the upgrade ?
nope, I’ve copied data from fileservers and elasticsearch plus made mongodump
How many apiserver instances do you have ?
1 apiserver container
How did you configure the elastic container? is it booting?
Standard configuration (clearml.yaml). Elastic works
In our case some packages are taken from /usr/lib/python3/dist-packages, others from the local environment and this causes a conflict when importing the attr module
Yes, it’s the same. I realized my failure and now everything works) many thanks
When I load http://app.clearml.my.domain.com I get Status Code: 426 at http://app.clearml.my.domain.com/v2.13/login.supported_modes (for example)
At the moment I’ve downloaded helmchart and added support proxy_http_version 1.1; in nginx. Then everything works
old 0.17
new 1.0.2
partly used helm charts, we are used yaml files from helm, but we rewrote part about pvc and our clearml locate in several nodes
Then I changed the size of the PV and added an extra 50Gb
Looks like it helped and now the service is working, but I still get this bug.
Nothing)
I’ll talk to the developers and I think I figured out how to solve this problem
` [2021-06-11 15:24:36,885] [9] [ERROR] [clearml.service_repo] Returned 500 for queues.get_next_task in 60007ms, msg=General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06', '_type': '_doc', '_id': 'PkGr-3kBBPcUBw4n5Acx', 'status': 503, 'error': {'type':..., extra_info=[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[queue_metrics...
sure
First command outputcurl -XGET
`
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10 xjVdUpdDReCv5g11c4IGFw 1 0 10248782 0 536.6mb 536.6mb
green open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-11 YuxjrptlTh2MlOCU7ykMkA 1 0 13177592 0 695....
Delete, reset
looks like something with index
` index shard time type stage source_host source_node target_host target_node repository snapshot files files_recovered files_percent files_total bytes bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b 0 2.4h existing_store done n/a n/a 10.18.13.96 cle...
I just hided elastic IP in the second output
and I still see this error in the logs[2022-06-20 13:24:27,777] [9] [WARNING] [elasticsearch] POST
` [status:N/A request
:60.060s]
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 449, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 444, in _make_request
httplib_response = conn.getresponse()
File "/...
` - env:
- name: bootstrap.memory_lock
value: "true"
- name: cluster.name
value: clearml
- name: cluster.routing.allocation.node_initial_primaries_recoveries
value: "500"
- name: cluster.routing.allocation.disk.watermark.low
value: 500mb
- name: cluster.routing.allocation.disk.watermark.high
value: 500mb
- name: cluster.routing.allocation.disk.watermark.flood_stage
value: 500mb
...
ok, lets try
but it’s a lot of resources
I’ve tried with these two
` >>> client.tasks.get_all(system_tags=["archived"])
+----------------------------------+------------------------------------------------------------+
| id | name |
+----------------------------------+------------------------------------------------------------+
| 378c8e80c3dd4ff8901f04f00824acbd | ab-ai-767-easy |
| c575db3f302441c6a977f52c...
And developers complain to me that they can’t start experiment
` APIError: code 500/100: General data error (ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='elasticsearch-service', port='9200'): Read timed out. (read timeout=60)))
Failed deleting old session ffaa2192fb9045359e7c9827ff5e1e55
APIError: code 500/100: General data error (ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='elasticsearch-service', port='9200'): Read timed out. (read timeo...
at the moment ES has the following resourcesLimits: cpu: 2 memory: 10G Requests: cpu: 2 memory: 10G
We launched ES with these parameters at the time of the problems
Developers complain that the experiments are long hung in the status of Pending
more than 10 minutes
what interesting, that a new experiments clearml can delete without any problems
but old archived experiments, clearml didn’t want remove
Anyway, if there was any additional information for troubleshooting or backups on the site would be very cool.
I recovered the ES data from the backup
It helped.
I think per task we use clearml-task? but yes, this needs permanently, like config clearml.conf we have 4 gpu, and for each, we have a separate cache
I don’t want to make 4 cleaml.conf files