Reputation
Badges 1
26 × Eureka!Yes - for upgrading this is what we do. The thing is the first server was a pilot for my team and now we have a proper server for the company (but it was a pilot too so we didn't migrated all the old experiments). To make sure I understand the paid service - do you want to have a short phone call to explain the different options - is it a complete support or can we pay for specific support tasks? It worth mentioning that I'm not the decision maker in the company but I would like to have some ...
It's automatically set in the user's .clearml - I think that /opt/ml is persistent (this is where you are supposed to save checkpoints in sagemaker)
In the version we have I don't see that the plots are resizable - we are running 1.1.0, I believe
We want to have many people working on a cluster of machines and we want to be able to allocate fraction of GPU to specific jobs, to avoid starvation
maybe I missed it in the documentation - but I could use also something like set_offline_dir() (to make sure it's pointing opt/ml or something) and then get_offline_file() and upload it myself
Sure - the problem is that many of our trainings in sagemaker are not exposed to the company's server
but what happens if the script is terminated? maybe a spot termination, ctrl+c, this means I loose track of the training?
Hi I mean something like what runai are doing, or how would you work together with http://run.ai ?
Great! I'll update to this version and will verify the issue is solved
makes sense.. I currently aws s3 sync every n iterations and then I saw that there is an option to load a dir rather than a zip
thanks for the quick response! and also - your library/product is really cool and impressive
Is it supported in other versions? 🙂
just add description="test" to one of the tf.summary.image calls and see that it silently doesn't get logged
Ok - thanks AgitatedDove14
Hi AgitatedDove14 - I think it's the one before 1.1.1, client is latest 1.0.5. Testing now on tensorflow-cpu 2.5.0
AgitatedDove14 , thanks for the quick response!
We can use VPC (which we use, but then the entire bringup of the training would be different)
You can also replace image_open = Image.open(os.path.join('..', '..', 'reporting', 'data_samples', 'picasso.jpg'))
    image = np.asarray(image_open)
with image = np.random.random(
size
=[512, 512, 3])