Reputation
Badges 1
979 × Eureka!/data/shared/miniconda3/bin/python /data/shared/miniconda3/bin/clearml-agent daemon --services-mode --detached --queue services --create-queue --docker ubuntu:18.04 --cpu-only
And I can verify that ~/trains.conf exists in the su home folder
And I do that each time I want to create a subtask. This way I am sure to retrieve the task if it already exists
AnxiousSeal95 The main reason for me to not use clearml-serving triton is the lack of documentation tbh π I am not sure how to make my pytorch model run there
edited the aws_auto_scaler.py, actually I think itβs just a typo, I just need to double the brackets
Alright, how can I then mount a volume of the disk?
Yea thats what I thought, I do have trains server 0.15
Task.get_project_object().default_output_destination = None
Would you like me to open an issue for that or will you fix it?
That would be amazing!
AgitatedDove14 WOW, thanks a lot! I will dig into that π
It broke the shift holding to select multiple experiments btw
I am also interested in the clearml-serving part π
meaning the RestAPI returns nothing, is that correct
Yes exactly, this is the response from the api server when I try to scroll down on the console to get more logs
AgitatedDove14 SuccessfulKoala55 I just saw that clearml-server 1.4.0 was released, congrats π π Was this bug fixed with this new version?
Hi TimelyPenguin76 ,
trains-server: 0.16.1-320
trains: 0.15.1
trains-agent: 0.16
v0.17.5rc2
This is what I get with mprof
on this snippet above (I killed the program after the bar reaches 100%, otherwise it hangs trying to upload all the figures)
Hi @<1523701205467926528:profile|AgitatedDove14> @<1537605940121964544:profile|EnthusiasticShrimp49> , the issue above seemed to be the memory leak and it looks like there is no problem from clearml side.
I trained successfully without mem leak with num_workers=0 and I am now testing with num_workers=8.
Sorry for the false positive :man-bowing:
Disclaimer: I didn't check this will reproduce the bug, but that's all the components that should reproduce it: a for loop creating figures and clearml logging them
mmmh it fails, but if I connect to the instance and execute ulimit -n
, I do see65535
while the tasks I send to this agent fail with:OSError: [Errno 24] Too many open files: '/root/.commons/images/aserfgh.png'
and from the task itself, I run:import subprocess print(subprocess.check_output("ulimit -n", shell=True))
Which gives me in the logs of the task:b'1024'
So nnofiles is still 1024, the default value, but not when I ssh, damn. Maybe rebooting would work
Ok to be fair I get the same curve even when I remove clearml from the snippet, not sure why