Math checks out that if I was generating around 140K a day, and this had been running for 9 days, it had 1.2M when I caught it . So I think the next day after I shut it down I was seeing previous days numbers before shut down added . And another 24 hours it barely changed, so ya, it was 100% the stdout logging .
I am running this on a 3090 GPU locally, just been letting it run for about two weeks now I think. Just have the one GPU, ha ha. It's at epoch 368 out of the 1,000 I have it set to cap out on ( if it does not hit the default YOLO "patience" limit of 50 before then and self terminate ).
Came to ClearML since it had slick dashboard and showed me the info that mattered. Loved that I could share the results of each epoch so we could make sure things were headed in the correct direction.
I guess last followup question, is there a way to cap costs?
Scale tier ? (I know it is not per usage, but it is probably more than 15$ per user 🙂 )
Welp, it's been a day with the new settings, and stats went up 140K for API calls 😢 ... going to check again tomorrow to see if any of that was spill over from yesterday
Is there a place in ClearML that shows Platform Usage? Like, what's actually taking up the API calls?
Ya, sorry, I meant that if you needed more info on what was being run, it was in that screenshot ( showed instances/epochs/batch size, etc ) . But yes, it's since been disabled .
I had no idea it was going to do that and sent your servers over 1.4M API hits unintentionally
Yeah, that is way too much, I think relates to the frequency it updates the console 😞
I did notice that the last 24 hours I dropped quite a bit, so my theory that the 140K might have some spillover from previous day might have been correct. Last 24 hours went from 1.24M to 1.32M, so about half as much as the day before, with the same training running.
Thanks, will do. Heck, for my use case, I only need like once every 10 minutes.
It'd be great if it just posted to clearml after each epoch is completed and the CSV with the results gets updated . I only care about using the dashboard to track completed progress . I can use my local computers terminal window to monitor current epoch training . No need to send that to clearml every second ;) Results once an hour or so is fine after each completes :)
I think we're good now :) Appreciate the help !!!
I appreciate your help @<1523701205467926528:profile|AgitatedDove14> 🙂
Maybe ClearML is using tensorboard in ways that I can fine tune? I saw there was a manual way if you were not using tensorboard to send over data, but the videos I saw from your team used this solution when demoing YOLOv8 on YouTube ( there were a few collab videos your team did with theirs, so I just followed their instructions ). But my gut is telling me that might be the issue for the remaining data being sent over that I have no insight into.
So, might be in the minority here, but seems like capturing stdout and sending that over to clearml via API should be disabled by default. Like I get maybe capturing stderr, but stdout? In a training scenario, that's MILLIONS of API calls just in progress bar indicators, right? Like it might actually be better for the ClearML servers just in general to make the user turn that on if they want it, otherwise we're just blasting your servers. In my case, I did not even know it was sending that over until I got into digging where these API calls were coming from, and saw the CONSOLE tab in clearml that had every single line of stdout captured.
FYI, I did not even know to look into this until I logged in and saw that I was being throttled because I had hit my monthly limit with API calls ( on my very first use of your platform ), and my last dozen or so epochs were just not even logged ( also a bummer ). I only had that one model in training, and thought there was no way I sent over a million API requests, so had to figure out where those were coming from, and tracked it down to that STDOUT, and was like ... wait, what?!?! Found that console tab, which I did not even use before, and saw that screenshot I posted, and was like ... well, there's your problem, ha ha
well from 2 to 30sec is a factor of 15, I think this is a good start 🙂
this one, right ? report_period_sec in ~/clearml.conf correct ?
Hi @<1572395184505753600:profile|GleamingSeagull15>
Try adjusting:
None
to 30 sec
It will reduce the number of log reports (i.e. API calls)
Scary to think how common that might be, could be interesting way to optimize your platform, detect excessive console logging and prompt user to confirm continued usage ( or link to docs on how to disable if they want to stop it )