
Reputation
Badges 1
92 × Eureka!Hi SuccessfulKoala55 ,
I down the server:
` [ec2-user@ip-172-31-26-41 ~]$ sudo docker-compose -f /opt/clearml/docker-compose.yml down
WARNING: The CLEARML_HOST_IP variable is not set. Defaulting to a blank string.
WARNING: The CLEARML_AGENT_GIT_USER variable is not set. Defaulting to a blank string.
WARNING: The CLEARML_AGENT_GIT_PASS variable is not set. Defaulting to a blank string.
Stopping clearml-webserver ... done
Stopping clearml-agent-services ... done
Stopping clearml-apiserver...
` [2021-01-24 17:02:25,660] [8] [INFO] [trains.service_repo] Returned 200 for queues.get_all in 2ms
[2021-01-24 17:02:25,674] [8] [INFO] [trains.service_repo] Returned 200 for queues.get_next_task in 8ms
[2021-01-24 17:02:26,696] [8] [INFO] [trains.service_repo] Returned 200 for events.add_batch in 36ms
[2021-01-24 17:02:26,742] [8] [INFO] [trains.service_repo] Returned 200 for events.add_batch in 78ms
[2021-01-24 17:02:27,169] [8] [INFO] [trains.service_repo] Returned 200 for projects.get_al...
the index creation:[ec2-user@ip-172-31-26-41 ~]$ sudo docker exec -it clearml-mongo /bin/bash root@3fc365193ed0:/# mongo MongoDB shell version v3.6.5 connecting to: mongodb://127.0.0.1:27017 MongoDB server version: 3.6.5 Welcome to the MongoDB shell. For interactive help, type "help". For more comprehensive documentation, see
Questions? Try the support group
`
Server has startup warnings:
2021-01-25T05:58:37.309+0000 I CONTROL [initandlisten]
2021-01-25T05:58:37.309+0000 I C...
SuccessfulKoala55 and AppetizingMouse58 Thanks you very much!!
I have a future question:
Does this fix should harm in future cleraml-server upgrade?
Or what the best practice to upgrade after doing it?
does it ok that it looks for files in /opt/trains
? since we move all to /opt/clearml
no?File "/opt/trains/apiserver/mongo/initialize/migration.py"
someone in my company started a training 😥 , will do it after it will finish.. and will update
Thanks you are the best 🙏
I reproduced the stuck with this code..
But for now only with my env , when I tried to create new env only with the packages that this code needed it wont stuck.
So maybe the problem is conflict between packages?
Hey... Thanks for checking with me.
I didn't have time yet but will check it and let you know..
I don't have time to debug it yet.. will update more when I will have more time..
Thanks 🙏
The hang is still happening in trains==0.15.2rc0
AgitatedDove14 Thanks, I am trying it..
Hi CumbersomeCormorant74 ,
This is a server we installed.
The server version is: 0.17
We checked with Chrome, and FireFox
Thanks, ophir
For now we are using AWS batch for running those experiments.
because like this we don`t have to hold machines who waits for the jobs
Thanks!! you are the best..
I will give it a try when the runs will finish
how long? 😅
I am now stuck inCopying index events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b
for more then 40 min 😥
Thanks for the quick replay.
This will set more time before the timeout right?
Maybe there is a way to do something like:task.freeze_monitor() download() task.defrost_monitor()
OK thanks for the answer.. I will usetask.set_resource_monitor_iteration_timeout(seconds_from_start=1800)
as you suggested for now..
If you will add something like I suggest can you notify me?
I am sure you add this timeout for a reason.
Probably since increasing the timeout can affect other functionality. .
Am I wrong?
Hi SuccessfulKoala55 , yes for now I will like to start moving what inside the /opt/trains/data/fileserver..
because as I understand the logs and graphs are saved in elastic so I think it will not be easy to move them as well right?
Thanks for the reply,
I saw that it prefer to change the fileserver in trains.conf to s3://XXX
So, I changed this as I wrote before.
Thanks I just want to avoid giving the credentials to every user.
If it won't possible, I will do it..