Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, I Am Getting The Following Errors In The Experiments I Am Currently Running:

Hi, I am getting the following errors in the experiments I am currently running:
` 2021-06-25 17:11:47,911 - clearml.Metrics - ERROR - Action failed <504/0: events.add_batch (<html>

<head><title>504 Gateway Time-out</title></head> <body> <center><h1>504 Gateway Time-out</h1></center> </body> </html> )> `I haven’t changed anything from the server side, would you have any idea when can such error appear?

  
  
Posted 2 years ago
Votes Newest

Answers 30


4 cpus, 8Gb

  
  
Posted 2 years ago

I cannot ssh into the machine

That's very strange - since the server runs in docker, I don't see how it can cause the EC2 instance to be unavailable - can you check the EC2 panel to see what might be the problem?

  
  
Posted 2 years ago

In any case, restarting the instance without shutting down the server in an orderly fashion always has the risk of damaging the database storage (mongo/elastic etc.)

  
  
Posted 2 years ago

Seems like it just went unresponsive at some point

  
  
Posted 2 years ago

Would adding a ILM (index lifecycle management) be an appropriate solution?

  
  
Posted 2 years ago

maybe a merge operation from ES

Might be

  
  
Posted 2 years ago

but according to the disks graphs, the OS disk is being used, but not the data disk

  
  
Posted 2 years ago

There's a reason for the ES index max size 😞

  
  
Posted 2 years ago

Thanks for the hint, I’ll check the paid version, but I’d like first to understand how much efforts it would be to fix the current situation by myself 🙂

  
  
Posted 2 years ago

SuccessfulKoala55 I am looking for ways to free some space and I have the following questions:
Is there a way to break-down all the document to identify the biggest ones? Is there a way to delete several :monitor:gpu and :monitor:machine time series? Is there a way to downsample some time series (eg. loss)?

  
  
Posted 2 years ago

Well, currently the open source clearml apiserver depends on the index names in order to retrieve information for experiments, so manipulating these indices externally won't work. What you can do is use an ES cluster and change the index mappings to more than one shard, which would split the index into multiple parts (the size limit is actually per shard, not index). This is all stuff addressed by the paid version, I think

  
  
Posted 2 years ago

SuccessfulKoala55 Thanks! If I understood correctly, setting index.number_of_shards = 2 (instead of 1) would create a second shard for the large index, splitting it into two shards? This https://stackoverflow.com/a/32256100 seems to say that it’s not possible to change this value after the index creation, is it true?

  
  
Posted 2 years ago

Again, it's possible - it's also possible that it takes so much memory the system is relying heavily on the swap

  
  
Posted 2 years ago

SuccessfulKoala55 Since 2 hours I get 504 errors and I cannot ssh into the machine. AWS reports that instance health checks fail. Is it safe to restart the instance?

  
  
Posted 2 years ago

from 10 to 11

  
  
Posted 2 years ago

Also what is the benefit of having by default index.number_of_shards = 1 for the metrics and the logs indices? Having more allows to scale and later move them in separate nodes if needed - the default heap size being 2Gb, it should be possible, or?

  
  
Posted 2 years ago

Well, as long as you’re using a single node, it should indeed alleviate the shard disk size limit, but I’m not sure ES will handle that too well. In any case, you can’t change that for existing indices, you can modify the mapping template and reindex the existing index (you’ll need to index to another name, delete the original and create an alias to the original name as the new index can’t be renamed...)

Ok thanks!

Well, as long as you use a single node, multiple shards offer no scale improvement

But you can move shards from one node to another much more easily than changing the number of shards of an index, this is an advantage

  
  
Posted 2 years ago

Is there a way to break-down all the document to identify the biggest ones?

In case of scalars, they're all roughly the same, it's only a matter of which task reported more, so an aggregation by task_id would help you in figuring out which tasks are more costly

Is there a way to delete several :monitor:gpu and :monitor:machine time series?

Yes, these contain specific metric and variant document fields (you can look at a single document to figure out what they are), so an ES _delete_by_query request can be used to remove all documents containing these scalars. Remember however, that _delete_by_query is performance-intensive, so it will probably take much more time than simply deleting documents.

Is there a way to downsample some time series (eg. loss)?

Well, in this context, down-sampling a specific time-series is either:
Removing specific documents from that series, OR Reading all series documents in a script, down-sampling in memory, writing new documents for the new values and deleting old documents (either by query or by ID)

  
  
Posted 2 years ago

Here is (left) the data disk (/opt/clearml) and right the OS disk

  
  
Posted 2 years ago

11

  
  
Posted 2 years ago

The open-source version doesn't enforce any max size - it just keeps indexing

  
  
Posted 2 years ago

There’s a reason for the ES index max size

Does ClearML enforce a max index size? what typically happens when that limit is reached?

  
  
Posted 2 years ago

can it be that the merge op takes so much filesystem cache that the rest of the system becomes unresponsive?

  
  
Posted 2 years ago

Something was triggered, you can see the CPU usage starting right when the instance became unresponsive - maybe a merge operation from ES?

  
  
Posted 2 years ago

And the ES behavior always depends on the machine and memory - I've seen machines that could handle 100GB indices 🙂

  
  
Posted 2 years ago

ha sorry it’s actually the number of shards that increased

  
  
Posted 2 years ago

If I understood correctly, setting 

index.number_of_shards = 2

 (instead of 1) would create a second shard for the large index, splitting it into two shards? This 

 seems to say that it’s not possible to change this value after the index creation, is it true?

Well, as long as you're using a single node, it should indeed alleviate the shard disk size limit, but I'm not sure ES will handle that too well. In any case, you can't change that for existing indices, you can modify the mapping template and reindex the existing index (you'll need to index to another name, delete the original and create an alias to the original name as the new index can't be renamed...)

Also what is the benefit of having by default 

index.number_of_shards = 1

 for the metrics and the logs indices? Having more allows to scale and later move them in separate nodes if needed - the default heap size being 2Gb, it should be possible, or?

Well, as long as you use a single node, multiple shards offer no scale improvement

  
  
Posted 2 years ago

Number of shards?

  
  
Posted 2 years ago

Thanks! With this I’ll probably be able to reduce the cluster size to be on the safe side for a couple of months at least :)

  
  
Posted 2 years ago

how would it interact with the clearml-server api service? would it be completely transparent?

  
  
Posted 2 years ago