Hi, I'M Using Aws Ec2 Instance To Trian My Models With Clearml Autoscaler, But It Says Cuda Device Is Not Avaliable. The Code Runs Well On My Local Pc And It Runs Well On Clearml With Ec2 Yesterday, But It Suddenly Doesn'T Work Today. Is There Anyway To S

Answered

Hi, I'm using AWS EC2 instance to trian my models with ClearML autoscaler, but it says CUDA device is not avaliable. The code runs well on my local PC and it runs well on clearml with EC2 yesterday, but it suddenly doesn't work today. Is there anyway to solve this?

  				
Posted 
	8 months ago

					More
				  		
  Report
		
					EnchantingPenguin77
				
					0
					 × 1

Votes Newest

Answers 10

Hi @<1523701070390366208:profile|CostlyOstrich36> Any idea why this happen?

  				
Posted 
	8 months ago

					More
				  		
  Report
		
					EnchantingPenguin77
				
					0
					 × 1

Hi @<1597762318140182528:profile|EnchantingPenguin77> , I don't see any errors related to CUDA in the log

  				
Posted 
	8 months ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

one thing I've changed is the AMI for the autoscaler, I changed it from amazon linux to ubuntu linux since my docker file size exceed the limit of the amazon linux. Not sure if this has anything to do with this issue

  				
Posted 
	8 months ago

					More
				  		
  Report
		
					EnchantingPenguin77
				
					0
					 × 1

Can you add here the configuration of the autoscaler?

  				
Posted 
	8 months ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

@<1523701070390366208:profile|CostlyOstrich36> yes, in the end of the new file

  				
Posted 
	8 months ago

					More
				  		
  Report
		
					EnchantingPenguin77
				
					0
					 × 1

@<1597762318140182528:profile|EnchantingPenguin77> , are you sure you added the correct log? I don't see any errors related to cuda

  				
Posted 
	8 months ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

@<1523701070390366208:profile|CostlyOstrich36> sorry wrong log uploaded, here is the error:
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

  				
Posted 
	8 months ago

					More
				  		
  Report
		
					EnchantingPenguin77
				
					0
					 × 1

And this issue happens randomly, I was able to run it again last night, but failed again this morning

  				
Posted 
	8 months ago

					More
				  		
  Report
		
					EnchantingPenguin77
				
					0
					 × 1

screenshot of AWS Autoscaler setup, cpu mode is NOT enabled

  				
Posted 
	8 months ago

					More
				  		
  Report
		
					EnchantingPenguin77
				
					0
					 × 1

Hi @<1523701070390366208:profile|CostlyOstrich36> , here is the configuration. The GPU could be found sometimes when I clone the previous successful run, but the GPU was found randomly. Also I am unable to run multiple task at the same time even with cloning the previous run

  				
Posted 
	8 months ago

					More
				  		
  Report
		
					EnchantingPenguin77
				
					0
					 × 1

Write your answer

945 Views

10 Answers

8 months ago