Reputation
Badges 1
979 × Eureka!SuccessfulKoala55
In the docker-compose file, you have an environment setting for the apiserver service host and port (CLEARML_ELASTIC_SERVICE_HOST and CLEARML_ELASTIC_SERVICE_PORT) - changing those will allow you to point the server to another ES service
The ES cluster is running in another machine, how can I set its IP in CLEARML_ELASTIC_SERVICE_HOST
? I would need to add host
to the networks of the apiserver service somehow? How can I do that?
ha sorry it’s actually the number of shards that increased
I am not sure I can do both operations at the same time (migration + splitting), do you think it’s better to do splitting first or migration first?
Setting to redis from version 6.2 to 6.2.11 fixed it but I have new issues now 😄
Nevermind, nvidia-smi command fails in that instance, the problem lies somewhere else
Still failing with the same error 😞
I now have a different question: when installing torch from wheels files, I am guaranteed to have the corresponding cuda library and cudnn together right?
I am still confused though - from the get started page of pytorch website, when choosing "conda", the generated installation command includes cudatoolkit, while when choosing "pip" it only uses a wheel file.
Does that mean the wheel file contains cudatoolkit (cuda runtime)?
alright I am starting to get a better picture of this puzzle
From https://discuss.pytorch.org/t/please-help-me-understand-installation-for-cuda-on-linux/14217/4 it looks like my assumption is correct: there is no need for cudatoolkit to be installed since wheels already contain all cuda/cudnn libraries required by torch
Thanks SuccessfulKoala55 for the answer! One followup question:
When I specify:agent.package_manager.pip_version: '==20.2.3'
in the trains.conf, I get:trains_agent: ERROR: Failed parsing /home/machine1/trains.conf (ParseException): Expected end of text, found '=' (at char 326), (line:7, col:37)
Sure, just sent you a screenshot in PM
The host is accessible, I can ping it and even run curl "
http://internal-aws-host-name:9200/_cat/shards "
and get results from the local machine
yes, because it won’t install the local package which has this setup.py with the problem in its install_requires described in my previous message
ha nice, where can I find the mapping template of the original clearml so that I can copy and adapt?
I am using pip as a package manager, but i start the trains-agent inside a conda env 😄
Yes, but a minor one. I would need to do more experiments to understand what is going on with pip skipping some packages but reinstalling others.
CostlyOstrich36 super thanks for confirming! I have then the follow-up question: are the artifacts duplicated (copied)? or just referenced?
So I want to be able to visualise it quickly as a table in the UI and be able to download it as a dataframe, which of report_media or artifact is better?
` # Set the python version to use when creating the virtual environment and launching the experiment
# Example values: "/usr/bin/python3" or "/usr/local/bin/python3.6"
# The default is the python executing the clearml_agent
python_binary: ""
# ignore any requested python version (Default: False, if a Task was using a
# specific python version and the system supports multiple python the agent will use the requested python version)
# ignore_requested_python_version: ...
I mean when sending data from the clearml-agents, does it block the training while sending metrics or is it done in parallel from the main thread?
That would be awesome, yes, only from my side I have 0 knowledge of the pip codebase 😄
Still investigating, task.data.last_iteration
is correct (equal to engine.state["iteration"]
) when I resume the training
You mean it will resolve by itself in the following days or should I do something? Or there is nothing to do and it will stay this way?