Reputation
Badges 1
979 × Eureka!Would adding a ILM (index lifecycle management) be an appropriate solution?
Ha nice, makes perfect sense thanks AgitatedDove14 !
AgitatedDove14 I made some progress:
In clearml.conf of the agent, I set: sdk.development.report_use_subprocess = false
(because I had the feeling that Task._report_subprocess_enabled = False
wasn’t taken into account) I’ve set task.set_initial_iteration(0)
Now I was able to get the followin graph after resuming -
Opened an issue with the logs here > None
Sure yes! As you can see I just added the blocklogging: driver: "json-file" options: max-size: "200k" max-file: "10"
To all services. Also in this docker-compose I removed the external binding of the ports for mongo/redis/es
SuccessfulKoala55
In the docker-compose file, you have an environment setting for the apiserver service host and port (CLEARML_ELASTIC_SERVICE_HOST and CLEARML_ELASTIC_SERVICE_PORT) - changing those will allow you to point the server to another ES service
The ES cluster is running in another machine, how can I set its IP in CLEARML_ELASTIC_SERVICE_HOST
? I would need to add host
to the networks of the apiserver service somehow? How can I do that?
ha sorry it’s actually the number of shards that increased
I am not sure I can do both operations at the same time (migration + splitting), do you think it’s better to do splitting first or migration first?
Setting to redis from version 6.2 to 6.2.11 fixed it but I have new issues now 😄
Nevermind, nvidia-smi command fails in that instance, the problem lies somewhere else
Still failing with the same error 😞
I now have a different question: when installing torch from wheels files, I am guaranteed to have the corresponding cuda library and cudnn together right?
I am still confused though - from the get started page of pytorch website, when choosing "conda", the generated installation command includes cudatoolkit, while when choosing "pip" it only uses a wheel file.
Does that mean the wheel file contains cudatoolkit (cuda runtime)?
alright I am starting to get a better picture of this puzzle
From https://discuss.pytorch.org/t/please-help-me-understand-installation-for-cuda-on-linux/14217/4 it looks like my assumption is correct: there is no need for cudatoolkit to be installed since wheels already contain all cuda/cudnn libraries required by torch
Thanks SuccessfulKoala55 for the answer! One followup question:
When I specify:agent.package_manager.pip_version: '==20.2.3'
in the trains.conf, I get:trains_agent: ERROR: Failed parsing /home/machine1/trains.conf (ParseException): Expected end of text, found '=' (at char 326), (line:7, col:37)
Sure, just sent you a screenshot in PM
The host is accessible, I can ping it and even run curl "
http://internal-aws-host-name:9200/_cat/shards "
and get results from the local machine
yes, because it won’t install the local package which has this setup.py with the problem in its install_requires described in my previous message
ha nice, where can I find the mapping template of the original clearml so that I can copy and adapt?
I am using pip as a package manager, but i start the trains-agent inside a conda env 😄
Yes, but a minor one. I would need to do more experiments to understand what is going on with pip skipping some packages but reinstalling others.