
Reputation
Badges 1
981 × Eureka!ha sorry itâs actually the number of shards that increased
Sorry, what I meant is that it is not documented anywhere that the agent should run in docker mode, hence my confusion
Sure, itâs because of a very annoying bug that I shared in this https://clearml.slack.com/archives/CTK20V944/p1648647503942759 , that I couldnât solve so far.
Iâm not sure you can downgrade that easily ...
Yea thatâs what I thought, thatâs a bit of pain for me now, I hope I can find a way to fix the bug somehow
SO I updated the config with:resource_configurations { A100 { instance_type = "p3.2xlarge" is_spot = false availability_zone = "us-east-1b" ami_id = "ami-04c0416d6bd8e4b1f" ebs_device_name = "/dev/xvda" ebs_volume_size = 100 ebs_volume_type = "gp3" key_name = "<my-key-name>" security_group_ids = ["<my-sg-id>"] subnet_id = "<my-subnet-id>" } }
but I get in the logs of the autoscaler:
` Warning! exception occurred: An error occurred (InvalidParam...
in the UI the value is correct one (not empty, a string)
In the comparison the problem will be the same, right? If I choose last/min/max values, it wonât tell me the corresponding values for others metrics. I could switch to graphs, group by metric and look manually for the corresponding values, but that becomes quickly cumbersome as the number of experiments compared grow
Ok, so after updating to trains==0.16.2rc0, my problem is different: when I clone a task, update its script and enqueue it, it does not have any Hyper-parameters/argv section in the UI
should I try to roll back to clearml-server 1.0.2? I am very anxious nowâŚ
Iâve reindexed the data for the logs, now the mappings are correct but I am missing one month of data, I have literally no idea where this data is/how it disappeared
Iâve set dynamic: âstrictâ in the template of the logs index and I was able to keep the same mapping after doing the reindex
the reindexing operation showed no error and copied everything
I reindexed only the logs to a new index afterwards, I am now doing the same with the metrics since they cannot be displayed in the UI because of their wrong dynamic mappings
SuccessfulKoala55 I am using ES 7.6.2
I made sure before deleting the old index that the number of docs matched
I see that I have several volumes:
` $ docker volume ls
DRIVER VOLUME NAME
local 5b0bfe5ab1a3d645bd635b2fb6f2aefd2b657d566019343c8305959903996c67
local 43b60287d60db798dc9d1defe1d7d861334c9c8299aefad6da2f20db278cfc5b
local 1406d50aa65ab55d323500d1fb23f19adfc6e721261ab6103a59d20e82146099
local 7367a215bd42a4e888e5d88ce708bf74aedc48a6e9417c72a19739cb80f25e6d
local 7413c39f5e4b6568304832d9d2e925ebdbf47ad31ad22d77830d3618af79237b
local a55cb71edff48c2138a5da9d8d1e26df3b...
Ok I have a very different problem now: I did the following to restart the ES cluster:docker-compose down docker-compose up -d
And now the cluster is empty. I think docker simply created a new volume instead of reusing the previous one, which was always the case so far.
AppetizingMouse58 btw I had to delete the old logs index before creating the alias, otherwise ES wonât let me create an alias with the same name as an existing index
amazon linux
Now I am trying to restart the cluster with docker-compose and specifying the last volume, how can I do that?
So it can be that when restarting the docker-compose, it used another volume, hence the loss of data
my docker-compose for the master node of the ES cluster is the following:
` version: "3.6"
services:
elasticsearch:
container_name: clearml-elastic
environment:
ES_JAVA_OPTS: -Xms2g -Xmx2g
bootstrap.memory_lock: "true"
cluster.name: clearml-es
cluster.initial_master_nodes: clearml-es-n1, clearml-es-n2, clearml-es-n3
cluster.routing.allocation.node_initial_primaries_recoveries: "500"
cluster.routing.allocation.disk.watermark.low: 500mb
clust...
Is it safe to turn off replication while a reindex operation is happening? the reindexing is rather slow and I am wondering if turning of replication will speed up the process
well I still see some ES errors in the logs
` clearml-apiserver | [2021-07-07 14:02:17,009] [9] [ERROR] [clearml.service_repo] Returned 500 for events.add_batch in 65750ms, msg=General data error: err=('500 document(s) failed to index.', [{'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': '_doc', '_id': 'c2068648d2fe5da975665985f44c20b6', 'status':..., extra_info=[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not...
Maybe the agent could be adapted to have a max_batch_size parameter?
but not as much as the ELB reports
SuccessfulKoala55 I deleted all :monitor:machine
and :monitor:gpu
series, but only deleted ~20M documents out of 320M documents in the events-training_debug_image-xyz
. I would like now to understand which experiments contain most of the document to delete them. I would like to aggregate the number of document per experiment. Is there a way do that using the ES REST api?