Reputation
Badges 1
979 × Eureka!And I do that each time I want to create a subtask. This way I am sure to retrieve the task if it already exists
AnxiousSeal95 The main reason for me to not use clearml-serving triton is the lack of documentation tbh π I am not sure how to make my pytorch model run there
edited the aws_auto_scaler.py, actually I think itβs just a typo, I just need to double the brackets
Alright, how can I then mount a volume of the disk?
Yea thats what I thought, I do have trains server 0.15
Task.get_project_object().default_output_destination = None
Would you like me to open an issue for that or will you fix it?
That would be amazing!
AgitatedDove14 WOW, thanks a lot! I will dig into that π
It broke the shift holding to select multiple experiments btw
I am also interested in the clearml-serving part π
meaning the RestAPI returns nothing, is that correct
Yes exactly, this is the response from the api server when I try to scroll down on the console to get more logs
AgitatedDove14 SuccessfulKoala55 I just saw that clearml-server 1.4.0 was released, congrats π π Was this bug fixed with this new version?
Hi TimelyPenguin76 ,
trains-server: 0.16.1-320
trains: 0.15.1
trains-agent: 0.16
v0.17.5rc2
This is what I get with mprof
on this snippet above (I killed the program after the bar reaches 100%, otherwise it hangs trying to upload all the figures)
Hi @<1523701205467926528:profile|AgitatedDove14> @<1537605940121964544:profile|EnthusiasticShrimp49> , the issue above seemed to be the memory leak and it looks like there is no problem from clearml side.
I trained successfully without mem leak with num_workers=0 and I am now testing with num_workers=8.
Sorry for the false positive :man-bowing:
Disclaimer: I didn't check this will reproduce the bug, but that's all the components that should reproduce it: a for loop creating figures and clearml logging them
mmmh it fails, but if I connect to the instance and execute ulimit -n
, I do see65535
while the tasks I send to this agent fail with:OSError: [Errno 24] Too many open files: '/root/.commons/images/aserfgh.png'
and from the task itself, I run:import subprocess print(subprocess.check_output("ulimit -n", shell=True))
Which gives me in the logs of the task:b'1024'
So nnofiles is still 1024, the default value, but not when I ssh, damn. Maybe rebooting would work
Ok to be fair I get the same curve even when I remove clearml from the snippet, not sure why
I think that somehow somewhere a reference to the figure is still living, so plt.close("all") and gc cannot free the figure and it ends up accumulating. I don't know where yet
If I manually call report_matplotlib_figure yes. If I don't (just create the figure), no mem leak
Ok no it only helps if as far as I don't log the figures. If I log the figures, I will still run into the same problem
in my clearml.conf, I only have:sdk.aws.s3.region = eu-central-1 sdk.aws.s3.use_credentials_chain = true agent.package_manager.pip_version = "==20.2.3"