Hi ImmenseMole52 , did you do any changes in the docker compose file? If yes, then can you please send your version of the file?
Hi SubstantialElk6 , another thing that can be checked is the health of the particular ES indices. Can you please run the below command in the clearml-elastic container and post the results here?curl -XGET
SubstantialElk6 Both indices that are red are not critical for the ClearML functioning and can be deleted like this:curl -XDELETE '
' curl -XDELETE '
'
For the analysis of the possible reasons that lead to it can you please collect the full ES logs to the file and send it here?sudo docker logs clearml-elastic > log.txt 2>&1
The tasks themselves will stay until you succeed to delete them from the client. Here we tried to see why deleting their data from ES timed out. From what I see no data was actually deleted (most likely because of the previous delete efforts that actually deleted the data though caused time out in the apiserver). What seems problematic is the amount of time that each operation took (19 and 16 seconds). It may be due to insufficient memory/cpu allocation to ES container or due to the 50Gb inde...
With what memory setting do you run ES? How much memory and cpu is currently occupied by ES container?
Hi CooperativeFox72 , how much free space do you have on your disk now? If you run du on your /opt/trains/data/elastic_7 folder in let's say 5 mins intervals do you see the folder size is growing?
IdealPanda97 Is your user id 1000? If not then this maybe the reason and chown -R 1000:1000 may help. Elasticsearch in the docker runs with user 1000. Another reason maybe some other elasticsearch process or docker running on your machine and holding the lock in the data folder. If there are any then please try stopping them. If neither of the above helps then there is an option of manually deleting .lock files from the elastic data folder. Of course the data should be backed up before this....
@<1585078752969232384:profile|FantasticDuck7> What volume mappings do you have for the async_delete service in the docker-compose.yaml file?
Ok, so there is no mapping for the whole config folder or specific config file that you changed. That's why async_delete does not get your updated configuration. You can do one of the following: either add here mapping for the specific file like you did earlier or map the whole config folder like apiserver service does:
- /opt/clearml/config:/opt/clearml/config
The second way is probably more flexible
@<1523701868901961728:profile|ReassuredTiger98> Strange:( in 1.10 we already had the code for clearing ES scrolls created during the task deletion. I would recommend upgrading to the latest release v1.12.1 anyway. In addition you can instruct ES to allow more open scrolls like below. By default it is limited to 500.
Hi @<1585078752969232384:profile|FantasticDuck7> , there is an apiserver configuration file apiserver->config->default->services->storage_credentials.conf
It contains the parameters for accessing files on the external storages like s3, google or azure. Please provide the same minio server access parameters as you do for the SDK configuration.
The actual deletion is performed by the async_delete service. You can inspect its logs with "sudo docker logs async_delete" command. Before configuring...
Can you run 'ls -al' in the /opt/trains/data folder and also in the /opt/trains/data/elastic_7 folder and send the output?
Hi UnevenDolphin73 . how many artifacts do you have on this task? We are storing task metadata in Mongo and there is a limit of 16Mb per a single document. While the artifact itself is not stored under the task there is some metadata (notably the uri and display_data/preview) that is stored for each artifact
Hi ExasperatedCrocodile76 , what version of the clearml server are you using? You can see it in the bottom right corner of the Settings screen
Hi IdealPanda97 , can you please check your available disk space and available RAM? According to the logs all the services (Elastic, Mongo, Redis) fail to start
Hi Elior, chances are that you do not have enough space for Elasticsearch on your storage. Please check the ES logs and increase the available disk space.
Actually the task logs will be lost. The tasks themselves and their reported metrics and plots would stay. The command is the following:curl -XDELETE localhost:9200/events-log-d1bd92a3b039400cbafc60a7a5b1e52b
It seems that index events-log-d1bd92a3b039400cbafc60a7a5b1e52b got corrupted. In case there are no backups the only choice would be to delete this index from elasticsearch
Hi RattyFish27 , it seems that there is some issue with Elasticsearch cluster. Can you please run the following commands on the server and paste here their output?curl -XGET
curl -XGET
Are you running them on the computer that hosts the server docker containers. What is the port binding for elasticsearch in your docker compose?
Ok, it seems that elasticsearch ports are open for internal communication but not for the host. Can you please add the following section to elasticsearch service in docker compose and restart the dockers?ports: - "9200:9200"
After that the commands should work from host
👍 I would say either deploying elasticsearch cluster consisting of several nodes with replication or doing daily backups:
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/snapshot-restore.html
Apart from it is worth making sure that ES is running in a stable environment (no abrupt restarts) and with enough RAM.
No, there was a problem with the particular version migration. The temporary index creation allowed to this and all subsequent migrations to run successfully. So for now your DB is properly aligned with the latest ClearML and the future upgrades should work fine.
The data that you sent looks fine. It seems that you actually has these iterations in Elasticsearch. To check whether it is the case please run the following command in the shell on your host. You should get the first 10 task events with the smallest iterations:curl -XGET -H "Content-Type: application/json" localhost:9200/events-training_stats_scalar*/_search?pretty -d' { "query": { "term": {"task": "d45ecb5ad7084175bd83dd39777b10c5"} }, "sort": {"iter": "asc"} }'
Hi MassiveHippopotamus56
Can you please open the browser developer tools, navigate to scalar tabs for one of the experiments that show wrong iteration and copy here the request payload and response for the events.scala_metrics_iter_histogram call?
Hi VexedPeacock35 , I suspect that Elasticsearch works too hard and periodically misses timeouts on recording events. How much memory and CPU is it using? Can you increase the memory that is allocated to it and see whether this helps?
Hi CooperativeFox72 , there was a typo in the index creation instructions ("comapny" instead of "company"). Please try the following sequence in mongo shell and then starting the apiserver:use auth db.user.createIndex({"name": 1, "company": 1})
Are you sure that it was performed fully according to the suggested sequence? The error that you posted says that v3.6 data is incompatible with v4.4 and suggests version 4.2 or earlier. Step 3 starts with mongo 4.0 that should be able to open v3.6 data. And then a number of gradual updates through versions 4.0->4.2->4.4 is performed
I mean it is not possible to open v3.6 data in version 4.4. That's why the steps 3-10 are there