AgitatedDove14 Yes exactly! it is shown in the recording above
But I would need to reindex everything right? Is that a expensive operation?
SuccessfulKoala55
In the docker-compose file, you have an environment setting for the apiserver service host and port (CLEARML_ELASTIC_SERVICE_HOST and CLEARML_ELASTIC_SERVICE_PORT) - changing those will allow you to point the server to another ES service
The ES cluster is running in another machine, how can I set its IP in CLEARML_ELASTIC_SERVICE_HOST ? I would need to add host to the networks of the apiserver service somehow? How can I do that?
I don’t think it is, I was rather wondering how you handled it to understand potential sources of slow down in the training code
I can also access these files directly if I enter the url in the browser
there is no error from this side, I think the aws autoscaler just waits for the agent to connect, which will never happen since the agent won’t start because the userdata script fails
Ok, so what worked for me in the end was:config = task.connect_configuration(read_yaml(conf_path)) cfg = OmegaConf.create(config._to_dict())
but most likely I need to update the perms of /data as well
without the envs, I had error: ValueError: Could not get access credentials for ' s3://my-bucket ' , check configuration file ~/trains.conf After using envs, I got error: ImportError: cannot import name 'IPV6_ADDRZ_RE' from 'urllib3.util.url'
Ha I see, it is not supported by the autoscaler > https://github.com/allegroai/clearml/blob/282513ac33096197f82e8f5ed654948d97584c35/trains/automation/aws_auto_scaler.py#L120-L125
The host is accessible, I can ping it and even run curl " http://internal-aws-host-name:9200/_cat/shards " and get results from the local machine
Ok, in that case it probably doesn’t work, because if the default value is 10 secs, it doesn’t match what I get in the logs of the experiment: every second the tqdm adds a new line
in the controller, I want to upload an artifact and start a task that will query that artifact and I want to make sure that the artifact exists when the task will try to retrieve it
I execute the clearml-agent this way:/home/machine/miniconda3/envs/py36/bin/python3 /home/machine/miniconda3/envs/py36/bin/clearml-agent daemon --services-mode --cpu-only --queue services --create-queue --log-level DEBUG --detached
Hi CostlyOstrich36 , I am not using Hydra, only OmegaConf, so you mean just calling OmegaConf.load should be enough?
Yea thats what I thought, I do have trains server 0.15
I think clearml-agent tries to execute /usr/bon/python3.6 to start the task, instead of using the python used to start clearml-agent
My bad, alpine is so light it doesnt have bash
"Can only use wildcard queries on keyword and text fields - not on [iter] which is of type [long]"
AgitatedDove14 It was only on comparison as far as I remember
but I also make sure to write the trains.conf to the root directory in this bash script:echo " sdk.aws.s3.key = *** sdk.aws.s3.secret = *** " > ~/trains.conf ... python3 -m trains_agent --config-file "~/trains.conf" ...
File "devops/valid.py", line 80, in valid(parse_args) File "devops/valid.py", line 41, in valid valid_task.output_uri = args.artifacts File "/data/.trains/venvs-builds/3.6/lib/python3.6/site-packages/trains/task.py", line 695, in output_uri ", check configuration file ~/trains.conf".format(value)) ValueError: Could not get access credentials for 's3://ml-artefacts' , check configuration file ~/trains.conf
oh seems like it is not synced, thank you for noticing (it will be taken care immediately)
Thank you!
does not contain a specific wheel for cuda117 to x86, they use the pip defualt one
Yes so indeed they don't provide support for earlier cuda versions on latest torch versions. But I should still be able to install torch==1.11.0+cu115 even if I have cu117. Before that is what the clearml-agent was doing
Would be very cool if you could include this use case!
` # Set the python version to use when creating the virtual environment and launching the experiment
# Example values: "/usr/bin/python3" or "/usr/local/bin/python3.6"
# The default is the python executing the clearml_agent
python_binary: ""
# ignore any requested python version (Default: False, if a Task was using a
# specific python version and the system supports multiple python the agent will use the requested python version)
# ignore_requested_python_version: ...
Sure, it’s because of a very annoying bug that I shared in this https://clearml.slack.com/archives/CTK20V944/p1648647503942759 , that I couldn’t solve so far.
I’m not sure you can downgrade that easily ...
Yea that’s what I thought, that’s a bit of pain for me now, I hope I can find a way to fix the bug somehow
what would be the name of these vars?