Hi. I'm currently working on training a model, specifically fine-tuning from Segment Anything. I'm using remote training in ClearML, and I have three servers: 2 A30 and 1 A100. Interestingly, when training on the A30, the IOU is quite good, around ~0.9 . However, when I train on the A100, the score is significantly lower, around ~0.6.
I've conducted several tests to troubleshoot the issue:
- I tried remote training on the CPU , but the scores on both A30 and A100 remained the same (good on A30 and bad on A100).
- I also attempted training directly on A30 and A100 without using remote training. Surprisingly, the scores on both cards were the same and good (IOU ~0.9)Any insights or suggestions on this matter would be greatly appreciated. Is there any issue with how ClearML utilizes the A100 card? Thank you.
"This image depicts the plot of IOU metrics over training in A30 and A100"

