what would be the name of these vars?
SuccessfulKoala55 I was able to recreate the indices in the new ES cluster. I specified number_of_shards: 4 for the events-log-d1bd92a3b039400cbafc60a7a5b1e52b index. I then copied the documents from the old ES using the _reindex API. The index is 7.5Gb on one shard.
Now I see that this index on the new ES cluster is ~19.4Gb π€ The index is divided into the 4 shards, but each shard is between 4.7Gb and 5Gb!
I was expecting to have the same index size as in the previous e...
I managed to do it by using logger.report_scalar, thanks!
Would you like me to open an issue for that or will you fix it?
And I can verify that ~/trains.conf exists in the su home folder
But you might want to double check
Actually I think I am approaching the problem from the wrong angle
Doing it the other way around works:
` cfg = OmegaConf.create(read_yaml(conf_yaml_path))
config = task.connect(cfg)
type(config)
<class 'omegaconf.dictconfig.DictConfig'> `
That said, v1.3.1 is already out, with what seems like a fix:
So you mean 1.3.1 should fix this bug?
erf, I have the same problem with ProxyDictPreWrite π What is the use case of this one ?
CostlyOstrich36 super thanks for confirming! I have then the follow-up question: are the artifacts duplicated (copied)? or just referenced?
This works well when I run the agent in virtualenv mode (remove --docker )
This one doesnβt have _to_dict unfortunately
Not really because this is difficult to control: I use the AWS autoscaler with ubuntu AMI and when an instance is created, packages are updated, and I don't know which python version I get, + changing the python version of the OS is not really recommended
I mean that I have a taskA (controller) that is in charge of creating a taskB with the same argv parameters (I just change the entry point of taskB)
on /data or /opt/clearml? these are two different disks
I created a snapshot of both disks
line 13 is empty π€
my docker-compose for the master node of the ES cluster is the following:
` version: "3.6"
services:
elasticsearch:
container_name: clearml-elastic
environment:
ES_JAVA_OPTS: -Xms2g -Xmx2g
bootstrap.memory_lock: "true"
cluster.name: clearml-es
cluster.initial_master_nodes: clearml-es-n1, clearml-es-n2, clearml-es-n3
cluster.routing.allocation.node_initial_primaries_recoveries: "500"
cluster.routing.allocation.disk.watermark.low: 500mb
clust...
Does the agent install the nvidia-container toolkit, so that GPUs of the instance can be accessed from inside the docker running jupyterlab?
sorry, the clearml-session. The error is the one I shared at the beginning of this thread
Yes I did, I found the problem: docker-compose was using trains-server 0.15 because it didn't see the new version of trains-server. Hence I had trains-server 0.15 running with ES7.
-> I deleted all the containers and it successfully pulled trains-server 0.16. Now everything is running properly π
continue_last_task is almost what I want, the only problem with it is that it will start the task even if the task is completed
nothing wrong from ClearML side π
Alright, thanks for the answer! Seems legit then π
TimelyPenguin76 , no, Iβve only set the sdk.aws.s3.region = eu-central-1 param
, causing it to unregister from the server (and thus not remain there).
Do you mean that the agent actively notifies the server that it is going down? or the server infers that the agent is down after a timeout?