Reputation
Badges 1
981 × Eureka!AgitatedDove14 Yes exactly! it is shown in the recording above
But I would need to reindex everything right? Is that a expensive operation?
SuccessfulKoala55
In the docker-compose file, you have an environment setting for the apiserver service host and port (CLEARML_ELASTIC_SERVICE_HOST and CLEARML_ELASTIC_SERVICE_PORT) - changing those will allow you to point the server to another ES service
The ES cluster is running in another machine, how can I set its IP in CLEARML_ELASTIC_SERVICE_HOST ? I would need to add host to the networks of the apiserver service somehow? How can I do that?
SuccessfulKoala55 Am I doing/saying something wrong regarding the problem of flushing every 5 secs (See my previous message)
I donât think it is, I was rather wondering how you handled it to understand potential sources of slow down in the training code
I can also access these files directly if I enter the url in the browser
there is no error from this side, I think the aws autoscaler just waits for the agent to connect, which will never happen since the agent wonât start because the userdata script fails
Ok, so what worked for me in the end was:config = task.connect_configuration(read_yaml(conf_path)) cfg = OmegaConf.create(config._to_dict())
but most likely I need to update the perms of /data as well
without the envs, I had error: ValueError: Could not get access credentials for ' s3://my-bucket ' , check configuration file ~/trains.conf After using envs, I got error: ImportError: cannot import name 'IPV6_ADDRZ_RE' from 'urllib3.util.url'
They indeed do auto-rotate when you limit the size of the logs
Ha I see, it is not supported by the autoscaler > https://github.com/allegroai/clearml/blob/282513ac33096197f82e8f5ed654948d97584c35/trains/automation/aws_auto_scaler.py#L120-L125
The host is accessible, I can ping it and even run curl " http://internal-aws-host-name:9200/_cat/shards " and get results from the local machine
Ok, in that case it probably doesnât work, because if the default value is 10 secs, it doesnât match what I get in the logs of the experiment: every second the tqdm adds a new line
in the controller, I want to upload an artifact and start a task that will query that artifact and I want to make sure that the artifact exists when the task will try to retrieve it
I execute the clearml-agent this way:/home/machine/miniconda3/envs/py36/bin/python3 /home/machine/miniconda3/envs/py36/bin/clearml-agent daemon --services-mode --cpu-only --queue services --create-queue --log-level DEBUG --detached
Hi CostlyOstrich36 , I am not using Hydra, only OmegaConf, so you mean just calling OmegaConf.load should be enough?
Yea thats what I thought, I do have trains server 0.15
I think clearml-agent tries to execute /usr/bon/python3.6 to start the task, instead of using the python used to start clearml-agent
My bad, alpine is so light it doesnt have bash
"Can only use wildcard queries on keyword and text fields - not on [iter] which is of type [long]"
AgitatedDove14 It was only on comparison as far as I remember
Ping CostlyOstrich36 AgitatedDove14 SuccessfulKoala55 Just making sure this wasn't missed đ
but I also make sure to write the trains.conf to the root directory in this bash script:echo " sdk.aws.s3.key = *** sdk.aws.s3.secret = *** " > ~/trains.conf ... python3 -m trains_agent --config-file "~/trains.conf" ...
File "devops/valid.py", line 80, in valid(parse_args) File "devops/valid.py", line 41, in valid valid_task.output_uri = args.artifacts File "/data/.trains/venvs-builds/3.6/lib/python3.6/site-packages/trains/task.py", line 695, in output_uri ", check configuration file ~/trains.conf".format(value)) ValueError: Could not get access credentials for 's3://ml-artefacts' , check configuration file ~/trains.conf
oh seems like it is not synced, thank you for noticing (it will be taken care immediately)
Thank you!
does not contain a specific wheel for cuda117 to x86, they use the pip defualt one
Yes so indeed they don't provide support for earlier cuda versions on latest torch versions. But I should still be able to install torch==1.11.0+cu115 even if I have cu117. Before that is what the clearml-agent was doing
Would be very cool if you could include this use case!