Reputation
Badges 1
14 × Eureka!Hi MysteriousSeahorse54 How are you saving the models? torch.save() ? If you're not specifying output_uri=True it makes sense that you can't download as they are local files 🙂
And when you put output_uri = True, does no model appear in the UI at all?
@<1590514584836378624:profile|AmiableSeaturtle81> yeah I can see what you mean. So you reuploaded everything from the ClearML file server into S3 and just changed the links?
Hi ZanyPig66 ,
I assume you're using torch.save() to save your model? A good place to start is with David's suggestion with specifying output_uri = True in the Task.init() code.
And some real pipeline (As real as our tests get 😄 )
Hmm. seem like there is a problem. Let me check 🙂
Hi GrittyHawk31 , you can use Dataset.get(). If you're using a file you can call Dataset.get_local_copy() to download it.
You can check https://clear.ml/docs/latest/docs/clearml_data/data_management_examples/data_man_python#data-ingestion documentation out or an https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py that uses it
Well...I'll make sure we do something about it 🙂
GiganticTurtle0 Got it, makes a lot of sense!
Hi Alejandro, could you elaborate on the use case? Do you want to basically save models and some "info" on them, but remove all experiments? You remove experiments to remove clutter? Or any other reason?
Will you later use the models for something (Retraining \ deployment)?
Hi DeliciousBluewhale87 , can you try adding --cpu-only flag?
This gets me the artifact that I return in step1
I think this is what you wanted
You can use pre \ post step callbacks.
And in the pre_execute_callback, I can access this:a_pipeline._nodes[a_node.parents[0]].job.task.artifacts['data_frame']
pipe.add_step(name='stage_process', parents=['stage_data', ],
base_task_project='examples', base_task_name='pipeline step 2 process dataset',
parameter_override={'General/dataset_url': '${stage_data.artifacts.dataset.url}',
'General/test_size': 0.25}, pre_execute_callback=pre_execute_callback_example, post_execute_callback=post_execute_callback_example)
Yup indeed! Let us know how it goes!
GiganticTurtle0 What about modifying the cleanup service to put all experiments with a specific tag into a subfolder? Then you'll have a subfolder for published experiments (or production models or whatever criteria you want to have 🙂 ). That would declutter your workspace automatically and still retain everything.
We are 😄 We have 3 talks in the upcoming GTC
LOL Love this Thread and sorry I didn't answer earlier!
VivaciousPenguin66 EnviousStarfish54 I totally agree with you. We do have answers to "how do you do X or Y" but we don't have workflows really.
What would be a logical place to start? Would something like "training a Yolo V3 person detector on COCO dataset and how you enable continuous training (let's say adding PASCAL dataset afterwords) be something interesting?
The only problem is the friction between atomic and big picture. In...
Hmm, can you give a small code snippet of the save code? Are you using a wandb specific code? If so it makes sense we don't save it as we only intercept torch.save() and not wandb function calls
BTW, just talked to the devs, what happens is that your metrics \ logs are saved locally, then once a task is closed, it's zipped. If you are affraid the instance might be taken from you, first we are planning to release a solution for these situations 🙂 and second your code needs to be aware of the risk and to be able to "resume" training from a specific model snapshot \ iteration.
GiganticTurtle0 So 🙂 had a short chat with one of our R&D guys. ATM, what you're looking for isn't there. What you can do is use OutputModel().update_weights_package(folder_here) and a folder will be saved with EVERYTHING in it. Now I don't think it would work for you (I assume you want to donwload the model all the time, but artifacts just some times, and don't want to download everything all the time) but it's a hack.
Another option is to use model design field to save links to a...
I think the best model name is person_detector_lr0.001_batchsz32_accuracy0.63.pkl 😄
Hi FierceHamster54 can you try another instance type? I just tried with n1 and it works. We are looking to see if it's instance type related
If you want you can just upload them manually to s3 as the last "line" of the script, or write a pipeline step that does that. Just remember you'll have to import it somehow later on
Hadrien, just making sure I get the terminology, stopped instance meaning you don't pay for it, but just its storage, right? Or is it up and idling (and then Martin's suggestion is valid)? Do you get stopped instances instantely when you ask for them?
Seems like it. Is that an issue?