Answered

Hi All, I Hope Someone Can Help. I Am Using Clearml Agents With Docker Containers To Train Rl Models With Stable Baselines 3 On An On-Premise Server. I Am Having Issues With Saving And Loading Models. If I Don'T Specify

Hi all, I hope someone can help. I am using ClearML agents with docker containers to train RL models with stable baselines 3 on an on-premise server. I am having issues with saving and loading models.

If I don't specify output_uri in task.init then artifacts appear to be saved but the download option is greyed out when right clicking on an output model. The model is saved but the directory points to the docker container that ran the training not a folder on the server itself.

If I set output_uri = True then no model artifacts are saved at all. I feel like I am missing something simple but can't seem to find a solution in the docs.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					MysteriousSeahorse54
				
					0
					 × 1

Votes Newest

Answers 15

Yeah I guess that's the culprit. I'm not sure clearml and wandb were planned to work together and we are probably interfering with each other. Can you try removing the wandb model save callback and try again with output_uri=True?

Also, I'd be happy to learn of your use-case that uses both clearml and wandb. Is it for eval purposes or anything else?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AnxiousSeal95
				
					0
					 × 1

Sure I will try that. Does ClearML have a specific Stable Baselines 3 framework tag or should I try with just PyTorch?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					MysteriousSeahorse54
				
					0
					 × 1

I don't see SB3 here so PyTorch would be best: https://clear.ml/docs/latest/docs/integrations/libraries

  				
Posted 
	one year ago

					More
				  		
  Report
		
					TimelyMouse69
				
					0
					 × 1

This run with no output_uri specified produces artifacts.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					MysteriousSeahorse54
				
					0
					 × 1

I don't explicitly call torch save

  				
Posted 
	one year ago

					More
				  		
  Report
		
					MysteriousSeahorse54
				
					0
					 × 1

Hi AnxiousSeal95 , the models are saved both with a weights and biases call back and through stable baselines 3 model.save. Yes that makes sense to me that the files local to the docker container can't be downloaded. But yes when setting output_uri to true no models appear in the UI at all which seems strange

  				
Posted 
	one year ago

					More
				  		
  Report
		
					MysteriousSeahorse54
				
					0
					 × 1

Hi MysteriousSeahorse54 How are you saving the models? torch.save() ? If you're not specifying output_uri=True it makes sense that you can't download as they are local files 🙂
And when you put output_uri = True, does no model appear in the UI at all?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AnxiousSeal95
				
					0
					 × 1

Manual logging has the same behavior. When the output destination is not set the model artifacts are saved but can't be downloaded. They are saved to the docker in which they ran and not the fileserver. When the output uri is set the artifacts don't appear at all.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					MysteriousSeahorse54
				
					0
					 × 1

Sure I am just trying to get the saved model weights. Logging scalers works fine. I am using stable baselines 3 and pytorch.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					MysteriousSeahorse54
				
					0
					 × 1

Thanks! I couldn't find it either, but better to ask and be sure. Trying the run with manual logging now

  				
Posted 
	one year ago

					More
				  		
  Report
		
					MysteriousSeahorse54
				
					0
					 × 1

Sure, here is a snippet.
` run = wandb.init(project="rsTest",sync_tensorboard=True)

add tensorboard logging to the model

model = PPO('MlpPolicy', env, verbose=1, tensorboard_log=f"runs/{run.id}",
learning_rate=args.learning_rate,
batch_size=args.batch_size,
n_steps=args.n_steps,
n_epochs=args.n_epochs,
device='cpu')

create wandb callback

wandb_callback = WandbCallback(model_save_freq=1000,
model_save_path=f"models/{run.id}",
verbose=2,
)

variable for how often to save the model

time_steps = 100000
for i in range(25):
# add the reset_num_timesteps=False argument to the learn function to prevent the model from resetting the timestep counter
# add the tb_log_name argument to the learn function to log the tensorboard data to the correct folder
model.learn(total_timesteps=time_steps, callback=wandb_callback, progress_bar=True, reset_num_timesteps=False,tb_log_name=f"runs/{run.id}")
# save the model to the models folder with the run id and the current timestep
model.save(f"models/{run.id}/{time_steps*(i+1)}") `The part I don't understand is that when output_uri is not set then model artifacts show up. But when it is they don't.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					MysteriousSeahorse54
				
					0
					 × 1

Hmm, can you give a small code snippet of the save code? Are you using a wandb specific code? If so it makes sense we don't save it as we only intercept torch.save() and not wandb function calls

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AnxiousSeal95
				
					0
					 × 1

Can you give me a bit more info what exactly you're trying to log and what framework you're using?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					TimelyMouse69
				
					0
					 × 1

Could you try to see if it does work when you log those manually?
https://clear.ml/docs/latest/docs/clearml_sdk/model_sdk#manually-logging-models

  				
Posted 
	one year ago

					More
				  		
  Report
		
					TimelyMouse69
				
					0
					 × 1

But you can't download them.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					MysteriousSeahorse54
				
					0
					 × 1

Write your answer

969 Views

15 Answers

one year ago