Well, as long as you’re using a single node, it should indeed alleviate the shard disk size limit, but I’m not sure ES will handle that too well. In any case, you can’t change that for existing indices, you can modify the mapping template and reindex the existing index (you’ll need to index to another name, delete the original and create an alias to the original name as the new index can’t be renamed...)
Ok thanks!
Well, as long as you use a single node, multiple shards offer no sca...
can it be that the merge op takes so much filesystem cache that the rest of the system becomes unresponsive?
The number of documents in the old and the new env are the same though 🤔 I really don’t understand where this extra space used comes from
Here is (left) the data disk (/opt/clearml) and right the OS disk
it also happens without hitting F5 after some time (~hours)
Here is the console with some errors
Yes, I set:auth { cookies { httponly: true secure: true domain: ".clearml.xyz.com" max_age: 99999999999 } }
It always worked for me this way
"Can only use wildcard queries on keyword and text fields - not on [iter] which is of type [long]"
if I want to resume a training on multi gpu, I will need to call this function on each process to send the weights to each gpu
SuccessfulKoala55 Thanks! If I understood correctly, setting index.number_of_shards = 2
(instead of 1) would create a second shard for the large index, splitting it into two shards? This https://stackoverflow.com/a/32256100 seems to say that it’s not possible to change this value after the index creation, is it true?
Would adding a ILM (index lifecycle management) be an appropriate solution?
Ha nice, makes perfect sense thanks AgitatedDove14 !
AgitatedDove14 I made some progress:
In clearml.conf of the agent, I set: sdk.development.report_use_subprocess = false
(because I had the feeling that Task._report_subprocess_enabled = False
wasn’t taken into account) I’ve set task.set_initial_iteration(0)
Now I was able to get the followin graph after resuming -
SuccessfulKoala55
In the docker-compose file, you have an environment setting for the apiserver service host and port (CLEARML_ELASTIC_SERVICE_HOST and CLEARML_ELASTIC_SERVICE_PORT) - changing those will allow you to point the server to another ES service
The ES cluster is running in another machine, how can I set its IP in CLEARML_ELASTIC_SERVICE_HOST
? I would need to add host
to the networks of the apiserver service somehow? How can I do that?
ha sorry it’s actually the number of shards that increased
I am not sure I can do both operations at the same time (migration + splitting), do you think it’s better to do splitting first or migration first?
Setting to redis from version 6.2 to 6.2.11 fixed it but I have new issues now 😄
Nevermind, nvidia-smi command fails in that instance, the problem lies somewhere else
Still failing with the same error 😞
I now have a different question: when installing torch from wheels files, I am guaranteed to have the corresponding cuda library and cudnn together right?
I am still confused though - from the get started page of pytorch website, when choosing "conda", the generated installation command includes cudatoolkit, while when choosing "pip" it only uses a wheel file.
Does that mean the wheel file contains cudatoolkit (cuda runtime)?
alright I am starting to get a better picture of this puzzle
From https://discuss.pytorch.org/t/please-help-me-understand-installation-for-cuda-on-linux/14217/4 it looks like my assumption is correct: there is no need for cudatoolkit to be installed since wheels already contain all cuda/cudnn libraries required by torch
Thanks SuccessfulKoala55 for the answer! One followup question:
When I specify:agent.package_manager.pip_version: '==20.2.3'
in the trains.conf, I get:trains_agent: ERROR: Failed parsing /home/machine1/trains.conf (ParseException): Expected end of text, found '=' (at char 326), (line:7, col:37)