
Reputation
Badges 1
92 × Eureka!Does it possible to know in advance where the Agent will clone the code?
Or running a link command just before the execution of the code?
the index creation:[ec2-user@ip-172-31-26-41 ~]$ sudo docker exec -it clearml-mongo /bin/bash root@3fc365193ed0:/# mongo MongoDB shell version v3.6.5 connecting to: mongodb://127.0.0.1:27017 MongoDB server version: 3.6.5 Welcome to the MongoDB shell. For interactive help, type "help". For more comprehensive documentation, see
Questions? Try the support group
`
Server has startup warnings:
2021-01-25T05:58:37.309+0000 I CONTROL [initandlisten]
2021-01-25T05:58:37.309+0000 I C...
Thanks!! you are the best..
I will give it a try when the runs will finish
Hi SuccessfulKoala55 Thanks for the replay..
So for now, if I like to upgrade to the latest trains-server
but on another machine and keep all the data.
what is the best practice?
Thanks again 🙂
So for now I am leaving this issue...
Thanks a lot 🙏 🙌
I didn't try trains-agent yet, does it support using AWS batch?
For now we are using AWS batch for running those experiments.
because like this we don`t have to hold machines who waits for the jobs
how long? 😅
I am now stuck inCopying index events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b
for more then 40 min 😥
I reproduced the stuck with this code..
But for now only with my env , when I tried to create new env only with the packages that this code needed it wont stuck.
So maybe the problem is conflict between packages?
Hey... Thanks for checking with me.
I didn't have time yet but will check it and let you know..
AgitatedDove14 Hi, sorry for the long delay.
I tried to use 0.16 instead of 0.13.1.
I didn't have time to debug it (I am overwhelming with work right now).
But it doesn't work the same as 0.13.1. I am still getting some hanging in my eval process.
I am don't know if it just slower or really stuck since I killed it and move back to 0.13.1 until my busy time will pass.
Thanks
Sure, love to do it when I have more time 🙂
The hang is still happening in trains==0.15.2rc0
I don't have time to debug it yet.. will update more when I will have more time..
Thanks 🙏
AgitatedDove14 Thanks, I am trying it..
Thanks AgitatedDove14 ,
I need to check with my boss that it is OK to share more code, will let you know..
But I will give 0.16 a try when it will release.
🙏
hey, I test it, it looks it works, still it takes much time (mainly in the second run of the code, it part of my eval process)
I am trying to reproduce it with little example
SuccessfulKoala55 it still stuck on the same line .. does it should be like this?
I just need it to ran the docker and run the command inside it no?
Ok looks It is starting the training...
Thanks 💯
It is now stacking after:
` 2021-03-09 14:54:07
task 609a976a889748d6a6e4baf360ef93b4 pulled from 8e47f5b0694e426e814f0855186f560e by worker ov-01:gpu1
2021-03-09 14:54:08
running Task 609a976a889748d6a6e4baf360ef93b4 inside default docker image: MyDockerImage:v0
2021-03-09 14:54:08
Executing: ['docker', 'run', '-t', '--gpus', '"device=1"', '-e', 'CLEARML_WORKER_ID=ov-01:gpu1', '-e', 'CLEARML_DOCKER_IMAGE=MyDockerImage:v0', '-v', '/tmp/.clearml_agent.jvxowhq4.cfg:/root/clearml.conf', '-v', '/...
Hi AppetizingMouse58 , I had around 200GB when I started the migration now I have 169GB/
And yes, It looks it is growing was 9.4GB and now 9.5G