I think for the time being it's not possible to upload automatically to S3. Not sure it's a problem to add support for that but I don't think it's supported ATM (Will double check)
We can use VPC (which we use, but then the entire bringup of the training would be different)
It's automatically set in the user's .clearml - I think that /opt/ml is persistent (this is where you are supposed to save checkpoints in sagemaker)
So I think it's necessary to code defensively and once training is done, upload to a remote location (S3 in your case). If disc is persistent this should be a problem as the logs will be saved. Makes sense?
makes sense.. I currently aws s3 sync every n iterations and then I saw that there is an option to load a dir rather than a zip
Sure - the problem is that many of our trainings in sagemaker are not exposed to the company's server
thanks for the quick response! and also - your library/product is really cool and impressive
So all training machines will be exposed to the server?
Can you elaborate on the use-case a bit more? Why not report directly to the server?
BTW, just talked to the devs, what happens is that your metrics \ logs are saved locally, then once a task is closed, it's zipped. If you are affraid the instance might be taken from you, first we are planning to release a solution for these situations 🙂 and second your code needs to be aware of the risk and to be able to "resume" training from a specific model snapshot \ iteration.
Cool and impressive are 2 adjective we like to hear 😄
this should explain how to do it. You get the offline session path once you init the task
If you want you can just upload them manually to s3 as the last "line" of the script, or write a pipeline step that does that. Just remember you'll have to import it somehow later on
maybe I missed it in the documentation - but I could use also something like set_offline_dir() (to make sure it's pointing opt/ml or something) and then get_offline_file() and upload it myself
If spot is taken from you then yes. It will be. (unless there's some drive persistence)
but what happens if the script is terminated? maybe a spot termination, ctrl+c, this means I loose track of the training?