Hi. I'M Currently Working On Training A Model, Specifically Fine-Tuning From Segment Anything. I'M Using Remote Training In Clearml, And I Have Three Servers: 2 A30 And 1 A100. Interestingly, When Training On The

Answered

Hi. I'm currently working on training a model, specifically fine-tuning from Segment Anything. I'm using remote training in ClearML, and I have three servers: 2 A30 and 1 A100. Interestingly, when training on the A30, the IOU is quite good, around ~0.9 . However, when I train on the A100, the score is significantly lower, around ~0.6.
I've conducted several tests to troubleshoot the issue:

I tried remote training on the CPU , but the scores on both A30 and A100 remained the same (good on A30 and bad on A100).
I also attempted training directly on A30 and A100 without using remote training. Surprisingly, the scores on both cards were the same and good (IOU ~0.9)Any insights or suggestions on this matter would be greatly appreciated. Is there any issue with how ClearML utilizes the A100 card? Thank you.
"This image depicts the plot of IOU metrics over training in A30 and A100"

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SilkyHawk58
				
					0
					 × 1

Votes Newest

Answers 2

yeah, but why is there such a notable difference in IOU when training remotely on server with A30 card compared to another server with A100 card? I simply enqueued the task to agents

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SilkyHawk58
				
					0
					 × 1

Hi @<1661542597597859840:profile|SilkyHawk58> , ClearML doesn't "utilize" the cards directly per se. ClearML enables your code to execute on remote machines (among many other things). However, the one actually utilizing the card is actually your code.

Makes sense?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Write your answer

1K Views

2 Answers

one year ago