Would adding a ILM (index lifecycle management) be an appropriate solution?
ha sorry it’s actually the number of shards that increased
Here is (left) the data disk (/opt/clearml) and right the OS disk
I cannot ssh into the machine
That's very strange - since the server runs in docker, I don't see how it can cause the EC2 instance to be unavailable - can you check the EC2 panel to see what might be the problem?
If I understood correctly, setting
index.number_of_shards = 2
(instead of 1) would create a second shard for the large index, splitting it into two shards? This
seems to say that it’s not possible to change this value after the index creation, is it true?
Well, as long as you're using a single node, it should indeed alleviate the shard disk size limit, but I'm not sure ES will handle that too well. In any case, you can't change that for existing indices, you can modify the mapping template and reindex the existing index (you'll need to index to another name, delete the original and create an alias to the original name as the new index can't be renamed...)
Also what is the benefit of having by default
index.number_of_shards = 1
for the metrics and the logs indices? Having more allows to scale and later move them in separate nodes if needed - the default heap size being 2Gb, it should be possible, or?
Well, as long as you use a single node, multiple shards offer no scale improvement
Well, currently the open source clearml apiserver depends on the index names in order to retrieve information for experiments, so manipulating these indices externally won't work. What you can do is use an ES cluster and change the index mappings to more than one shard, which would split the index into multiple parts (the size limit is actually per shard, not index). This is all stuff addressed by the paid version, I think
how would it interact with the clearml-server api service? would it be completely transparent?
There's a reason for the ES index max size 😞
Also what is the benefit of having by default index.number_of_shards = 1
for the metrics and the logs indices? Having more allows to scale and later move them in separate nodes if needed - the default heap size being 2Gb, it should be possible, or?
Something was triggered, you can see the CPU usage starting right when the instance became unresponsive - maybe a merge operation from ES?
The open-source version doesn't enforce any max size - it just keeps indexing
maybe a merge operation from ES
Might be
Again, it's possible - it's also possible that it takes so much memory the system is relying heavily on the swap
Thanks! With this I’ll probably be able to reduce the cluster size to be on the safe side for a couple of months at least :)
but according to the disks graphs, the OS disk is being used, but not the data disk
Thanks for the hint, I’ll check the paid version, but I’d like first to understand how much efforts it would be to fix the current situation by myself 🙂
In any case, restarting the instance without shutting down the server in an orderly fashion always has the risk of damaging the database storage (mongo/elastic etc.)
Seems like it just went unresponsive at some point
SuccessfulKoala55 Since 2 hours I get 504 errors and I cannot ssh into the machine. AWS reports that instance health checks fail. Is it safe to restart the instance?
There’s a reason for the ES index max size
Does ClearML enforce a max index size? what typically happens when that limit is reached?
Is there a way to break-down all the document to identify the biggest ones?
In case of scalars, they're all roughly the same, it's only a matter of which task reported more, so an aggregation by task_id would help you in figuring out which tasks are more costly
Is there a way to delete several :monitor:gpu and :monitor:machine time series?
Yes, these contain specific metric
and variant
document fields (you can look at a single document to figure out what they are), so an ES _delete_by_query
request can be used to remove all documents containing these scalars. Remember however, that _delete_by_query
is performance-intensive, so it will probably take much more time than simply deleting documents.
Is there a way to downsample some time series (eg. loss)?
Well, in this context, down-sampling a specific time-series is either:
Removing specific documents from that series, OR Reading all series documents in a script, down-sampling in memory, writing new documents for the new values and deleting old documents (either by query or by ID)
SuccessfulKoala55 I am looking for ways to free some space and I have the following questions:
Is there a way to break-down all the document to identify the biggest ones? Is there a way to delete several :monitor:gpu and :monitor:machine time series? Is there a way to downsample some time series (eg. loss)?
Well, as long as you’re using a single node, it should indeed alleviate the shard disk size limit, but I’m not sure ES will handle that too well. In any case, you can’t change that for existing indices, you can modify the mapping template and reindex the existing index (you’ll need to index to another name, delete the original and create an alias to the original name as the new index can’t be renamed...)
Ok thanks!
Well, as long as you use a single node, multiple shards offer no scale improvement
But you can move shards from one node to another much more easily than changing the number of shards of an index, this is an advantage
can it be that the merge op takes so much filesystem cache that the rest of the system becomes unresponsive?
SuccessfulKoala55 Thanks! If I understood correctly, setting index.number_of_shards = 2
(instead of 1) would create a second shard for the large index, splitting it into two shards? This https://stackoverflow.com/a/32256100 seems to say that it’s not possible to change this value after the index creation, is it true?
And the ES behavior always depends on the machine and memory - I've seen machines that could handle 100GB indices 🙂