Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi All, I Hope Someone Can Help. I Am Using Clearml Agents With Docker Containers To Train Rl Models With Stable Baselines 3 On An On-Premise Server. I Am Having Issues With Saving And Loading Models. If I Don'T Specify

Hi all, I hope someone can help. I am using ClearML agents with docker containers to train RL models with stable baselines 3 on an on-premise server. I am having issues with saving and loading models.

If I don't specify output_uri in task.init then artifacts appear to be saved but the download option is greyed out when right clicking on an output model. The model is saved but the directory points to the docker container that ran the training not a folder on the server itself.

If I set output_uri = True then no model artifacts are saved at all. I feel like I am missing something simple but can't seem to find a solution in the docs.

  
  
Posted one year ago
Votes Newest

Answers 15


Can you give me a bit more info what exactly you're trying to log and what framework you're using?

  
  
Posted one year ago

Sure I am just trying to get the saved model weights. Logging scalers works fine. I am using stable baselines 3 and pytorch.

  
  
Posted one year ago

This run with no output_uri specified produces artifacts.

  
  
Posted one year ago

But you can't download them.

  
  
Posted one year ago

Could you try to see if it does work when you log those manually?
https://clear.ml/docs/latest/docs/clearml_sdk/model_sdk#manually-logging-models

  
  
Posted one year ago

Sure I will try that. Does ClearML have a specific Stable Baselines 3 framework tag or should I try with just PyTorch?

  
  
Posted one year ago

I don't see SB3 here so PyTorch would be best: https://clear.ml/docs/latest/docs/integrations/libraries

  
  
Posted one year ago

Thanks! I couldn't find it either, but better to ask and be sure. Trying the run with manual logging now

  
  
Posted one year ago

Manual logging has the same behavior. When the output destination is not set the model artifacts are saved but can't be downloaded. They are saved to the docker in which they ran and not the fileserver. When the output uri is set the artifacts don't appear at all.

  
  
Posted one year ago

Hi MysteriousSeahorse54 How are you saving the models? torch.save() ? If you're not specifying output_uri=True it makes sense that you can't download as they are local files 🙂
And when you put output_uri = True, does no model appear in the UI at all?

  
  
Posted one year ago

Hi AnxiousSeal95 , the models are saved both with a weights and biases call back and through stable baselines 3 model.save. Yes that makes sense to me that the files local to the docker container can't be downloaded. But yes when setting output_uri to true no models appear in the UI at all which seems strange

  
  
Posted one year ago

Hmm, can you give a small code snippet of the save code? Are you using a wandb specific code? If so it makes sense we don't save it as we only intercept torch.save() and not wandb function calls

  
  
Posted one year ago

Sure, here is a snippet.
` run = wandb.init(project="rsTest",sync_tensorboard=True)

add tensorboard logging to the model

model = PPO('MlpPolicy', env, verbose=1, tensorboard_log=f"runs/{run.id}",
learning_rate=args.learning_rate,
batch_size=args.batch_size,
n_steps=args.n_steps,
n_epochs=args.n_epochs,
device='cpu')

create wandb callback

wandb_callback = WandbCallback(model_save_freq=1000,
model_save_path=f"models/{run.id}",
verbose=2,
)

variable for how often to save the model

time_steps = 100000
for i in range(25):
# add the reset_num_timesteps=False argument to the learn function to prevent the model from resetting the timestep counter
# add the tb_log_name argument to the learn function to log the tensorboard data to the correct folder
model.learn(total_timesteps=time_steps, callback=wandb_callback, progress_bar=True, reset_num_timesteps=False,tb_log_name=f"runs/{run.id}")
# save the model to the models folder with the run id and the current timestep
model.save(f"models/{run.id}/{time_steps*(i+1)}") `The part I don't understand is that when output_uri is not set then model artifacts show up. But when it is they don't.

  
  
Posted one year ago

I don't explicitly call torch save

  
  
Posted one year ago

Yeah I guess that's the culprit. I'm not sure clearml and wandb were planned to work together and we are probably interfering with each other. Can you try removing the wandb model save callback and try again with output_uri=True?

Also, I'd be happy to learn of your use-case that uses both clearml and wandb. Is it for eval purposes or anything else?

  
  
Posted one year ago
970 Views
15 Answers
one year ago
one year ago
Tags