Hi, I Would Like To Bring Awareness

Answered

Hi, I would like to bring awareness on this issue , this impacts my work as I cannot install the older version of torch (1.11.0)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Votes Newest

Answers 23

I think you can set the cuda version in the clearml.conf , alternatively you can have the agent use a docker image with your required version of cuda instead of setting the environment directly on the machine

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					EnthusiasticShrimp49
				
					0

Hi @<1523701066867150848:profile|JitteryCoyote63>
Thank you for bringing it! can you verify with the latest clearml-agent 1.5.3rc2 ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

@<1537605940121964544:profile|EnthusiasticShrimp49> I'll try setting the cuda version clearml.conf, thanks for the tip!
@<1523701205467926528:profile|AgitatedDove14> Could you please push the code for that version on github?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Hi @<1523701066867150848:profile|JitteryCoyote63>

Could you please push the code for that version on github?

oh seems like it is not synced, thank you for noticing (it will be taken care immediately)
Regrading the issue:
Look at the attached images
None does not contain a specific wheel for cuda117 to x86, they use the pip defualt one

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

oh seems like it is not synced, thank you for noticing (it will be taken care immediately)

Thank you!

does not contain a specific wheel for cuda117 to x86, they use the pip defualt one

Yes so indeed they don't provide support for earlier cuda versions on latest torch versions. But I should still be able to install torch==1.11.0+cu115 even if I have cu117. Before that is what the clearml-agent was doing

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

No, I think the default version already supports cuda 117

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Could you please clarify? I don't get it

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

The wheel you download from pip, for example this one torch-1.11.0-cp38-cp38-manylinux1_x86_64.whl
is actually both CPU and cuda 117

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

This is not the case, I downloaded it and I got a cuda error at runtime

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

if this is the case pytorch really messed things up, this means they removed packages
Let me check something

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

@<1523701066867150848:profile|JitteryCoyote63>
I just created a new venv and run

pip install "torch==1.11.0.*" --extra-index-url

Then started python:

import torch
torch.cuda.is_available()

And I get True

what are you getting?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

RuntimeError: CUDA error: no kernel image is available for execution on the device

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

When running my training code

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Ha I just saw in the logs:

WARNING:py.warnings:/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/cuda/__init__.py:145: UserWarning:
NVIDIA A10G with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A10G GPU with PyTorch, please check the instructions at

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

So the wheel that was working for me was this one: [torch-1.11.0+cu115-cp38-cp38-linux_x86_64.whl](https://download.pytorch.org/whl/cu115/torch-1.11.0%2Bcu115-cp38-cp38-linux_x86_64.whl)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

So I suppose clearml-agent is not responsible, because it finds a wheel for torch 1.11.0 with cu117. It just happens that this wheel doesn't work in ec2 g5 instances suprizingly. Either I'll hardcode the correct wheel or I'll upgrade torch to 1.13.0

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

and I didn't have this problem before because when cu117 wheels were not available, the agent was trying to get the wheel with the closest cu version and was falling back to 1.11.0+cu115, and this one was working

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

So I suppose clearml-agent is not responsible, because it finds a wheel for torch 1.11.0 with cu117.

The thing is, the agent used to do all the heavy parsing because pytorch never actually had a pip compatible artifactory
But now they do, so the agent basically passed the parsing to pip and just added the correct additional pytorch pip repo.
It seems we need to switch back... wdyt?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I wouldn't do it, this is less code to maintain from your side and honestly too much auto magic makes it difficult for the user to control the environment (ie. to understand what happens behind the scenes). I am not sure what switching back will solve, here the wheel should have been correct, it's just the architecture of the card that is incompatible

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

I am not sure what switching back will solve, here the wheel should have been correct, it's just the architecture of the card that is incompatible

So I tested the "old" code that did the parsing and matching, and it did resolve to the correct wheel (i.e. found that there is no 117 only 115 and installed this one)
I think we should switch back, and have a configuration to control which mechanism the agent uses , wdyt?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I think we should switch back, and have a configuration to control which mechanism the agent uses , wdyt? (edited)

That sounds great!

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Hi @<1523701066867150848:profile|JitteryCoyote63>
RC is out,

pip3 install clearml-agent==1.5.3rc3

Then in pytorch_resolve: "direct"
None

Let me know if it worked

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

🚀 Thanks @<1523701205467926528:profile|AgitatedDove14> !

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Write your answer

2K Views

23 Answers

2 years ago