Hi, I'M Using Aws Ec2 Instance To Trian My Models With Clearml Autoscaler, But It Says Cuda Device Is Not Avaliable. The Code Runs Well On My Local Pc And It Runs Well On Clearml With Ec2 Yesterday, But It Suddenly Doesn'T Work Today. Is There Anyway To S

Answered

Hi, I'm using AWS EC2 instance to trian my models with ClearML autoscaler, but it says CUDA device is not avaliable. The code runs well on my local PC and it runs well on clearml with EC2 yesterday, but it suddenly doesn't work today. Is there anyway to solve this?

  				
Posted 
	one month ago

					More  		
  Report
		
					EnchantingPenguin77
				
					0
					 × 1

Votes Newest

Answers 10

Hi EnchantingPenguin77 , I don't see any errors related to CUDA in the log

  				
Posted 
	one month ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

CostlyOstrich36 sorry wrong log uploaded, here is the error:
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

  				
Posted 
	one month ago

					More  		
  Report
		
					EnchantingPenguin77
				
					0
					 × 1

screenshot of AWS Autoscaler setup, cpu mode is NOT enabled

  				
Posted 
	one month ago

					More  		
  Report
		
					EnchantingPenguin77
				
					0
					 × 1

EnchantingPenguin77 , are you sure you added the correct log? I don't see any errors related to cuda

  				
Posted 
	one month ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Hi CostlyOstrich36 Any idea why this happen?

  				
Posted 
	one month ago

					More  		
  Report
		
					EnchantingPenguin77
				
					0
					 × 1

Hi CostlyOstrich36 , here is the configuration. The GPU could be found sometimes when I clone the previous successful run, but the GPU was found randomly. Also I am unable to run multiple task at the same time even with cloning the previous run

  				
Posted 
	26 days ago

					More  		
  Report
		
					EnchantingPenguin77
				
					0
					 × 1

CostlyOstrich36 yes, in the end of the new file

  				
Posted 
	29 days ago

					More  		
  Report
		
					EnchantingPenguin77
				
					0
					 × 1

And this issue happens randomly, I was able to run it again last night, but failed again this morning

  				
Posted 
	29 days ago

					More  		
  Report
		
					EnchantingPenguin77
				
					0
					 × 1

Can you add here the configuration of the autoscaler?

  				
Posted 
	28 days ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

one thing I've changed is the AMI for the autoscaler, I changed it from amazon linux to ubuntu linux since my docker file size exceed the limit of the amazon linux. Not sure if this has anything to do with this issue

  				
Posted 
	25 days ago

					More  		
  Report
		
					EnchantingPenguin77
				
					0
					 × 1

Write your answer

135 Views

10 Answers

one month ago

24 days ago