Reputation
Badges 1
26 × Eureka!Sure - the problem is that many of our trainings in sagemaker are not exposed to the company's server
We can use VPC (which we use, but then the entire bringup of the training would be different)
maybe I missed it in the documentation - but I could use also something like set_offline_dir() (to make sure it's pointing opt/ml or something) and then get_offline_file() and upload it myself
but what happens if the script is terminated? maybe a spot termination, ctrl+c, this means I loose track of the training?
It's automatically set in the user's .clearml - I think that /opt/ml is persistent (this is where you are supposed to save checkpoints in sagemaker)
thanks for the quick response! and also - your library/product is really cool and impressive
makes sense.. I currently aws s3 sync every n iterations and then I saw that there is an option to load a dir rather than a zip
AgitatedDove14 , thanks for the quick response!
Yes - for upgrading this is what we do. The thing is the first server was a pilot for my team and now we have a proper server for the company (but it was a pilot too so we didn't migrated all the old experiments). To make sure I understand the paid service - do you want to have a short phone call to explain the different options - is it a complete support or can we pay for specific support tasks? It worth mentioning that I'm not the decision maker in the company but I would like to have some ...
In the version we have I don't see that the plots are resizable - we are running 1.1.0, I believe
Hi AgitatedDove14 - I think it's the one before 1.1.1, client is latest 1.0.5. Testing now on tensorflow-cpu 2.5.0
Hi I mean something like what runai are doing, or how would you work together with http://run.ai ?
Ok - thanks AgitatedDove14
We want to have many people working on a cluster of machines and we want to be able to allocate fraction of GPU to specific jobs, to avoid starvation
Is it supported in other versions? 🙂
You can also replace image_open = Image.open(os.path.join('..', '..', 'reporting', 'data_samples', 'picasso.jpg'))
    image = np.asarray(image_open)
with image = np.random.random(
size
=[512, 512, 3])
just add description="test" to one of the tf.summary.image calls and see that it silently doesn't get logged
Great! I'll update to this version and will verify the issue is solved