Reputation
Badges 1
119 × Eureka!@<1523701070390366208:profile|CostlyOstrich36> Thanks for the help! It ended being a mistake on my side. Misconfigured the VM's memory and it had only 3.75 G. Failed when installing torch.
Side note: When running src.train
as a module the server gets the command as src
and has to be modified to be src.train
Basically one points to an hdf5 and the other one has no extensiion
Hey CostlyOstrich36 sorry to ping you! Let's say I enqueue multiple experiments on a couple of agents and one of them fails. Is it possible to restart the experiment from the UI using the latest checkpoint? What if the experiment gets assigned to the other agent? I am not sure how the continue_last_task
flag would help in this case.
Hey AgitatedDove14 does this work for you?
` from argparse import ArgumentParser
from tensorflow.keras import utils as np_utils
from tensorflow.keras.datasets import mnist
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow as tf
from clearml import Task
class Linear(tf.keras.Model):
def init(self, in_shape=(784,), num_classes=10):
super().init()
self.l...