Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
I Am Using Clearml Pro And Pretty Regularly I Will Restart An Experiment And Nothing Will Get Logged To Clearml. It Shows The Experiment Running (For Days) And It'S Running Fine On The Pc But No Scalers Or Debug Samples Are Shown. How Do We Troubleshoot T

I am using ClearML Pro and pretty regularly I will restart an experiment and nothing will get logged to ClearML. It shows the experiment running (for days) and it's running fine on the PC but no scalers or debug samples are shown.
How do we troubleshoot this?

  
  
Posted 4 months ago
Votes Newest

Answers 69


I am still having this issue. An update is that the "abort" does not work. Even though the state is correctly tracked in ClearML when I try to abort the experiment through the UI it says it does it but the experiment remains running on the computer.

  
  
Posted 3 months ago

I am on 1.16.2

    task = Task.init(project_name=model_config['ClearML']['project_name'],
                     task_name=model_config['ClearML']['task_name'],
                     continue_last_task=False,
                     auto_connect_streams=True)
  
  
Posted 3 months ago

So I was able to repeat the same behavior on a machine running this example None

by adding the following callback

class TensorBoardImage(TensorBoard):
    @staticmethod
    def make_image(tensor):
        from PIL import Image
        import io
        tensor = np.stack((tensor, tensor, tensor), axis=2)
        height, width, channels = tensor.shape
        image = Image.fromarray(tensor)
        output = io.BytesIO()
        image.save(output, format='PNG')
        image_string = output.getvalue()
        output.close()
        return tf.Summary.Image(height=height,
                                width=width,
                                colorspace=channels,
                                encoded_image_string=image_string)

    def on_epoch_end(self, epoch, logs=None):
        if logs is None:
            logs = {}
        super(TensorBoardImage, self).on_epoch_end(epoch, logs)
        images = self.validation_data[0]  # 0 - data; 1 - labels
        img = (255 * images[0].reshape(28, 28)).astype('uint8')

        image = self.make_image(img)
        summary = tf.Summary(value=[tf.Summary.Value(tag='image', image=image)])
        self.writer.add_summary(summary, epoch)

So it seems like there is some bug in the how ClearML is logging tensorbaord images that causes everything to fail

  
  
Posted 3 months ago

It is still getting stuck. I think the issue might have something to do with the iterations versus epochs. I notice that one of the scalars that gets logged early is logging the epoch while the remaining scalars seem to be iterations because the iteration value is 1355 instead of 26

  
  
Posted 3 months ago

task.connect(model_config)
task.connect(DataAugConfig)

If these are separate dictionaries , you should probably use two sections:

    task.connect(model_config, name="model config")
    task.connect(DataAugConfig, name="data aug")

It is still getting stuck.
I notice that one of the scalars that gets logged early is logging the epoch while the remaining scalars seem to be iterations because the iteration value is 1355 instead of 26

wait so you are seeing Some scalars ?

while the remaining scalars seem to be iterations because the iteration value is 1355 instead of 26

what are you seeing in your TB?

  
  
Posted 3 months ago

I just created a new virtual environment and the problem persists. There are only two dependencies clearml and tensorflow. @<1523701070390366208:profile|CostlyOstrich36> what logs are you referring to?

  
  
Posted 3 months ago

So even if you abort it on the start of the experiment it will keep running and reporting logs?

  
  
Posted 3 months ago

When the script is hung at the end the experiment says failed in ClearML

  
  
Posted 3 months ago

Another thing I notice is that aborting the experiment does not work when this is happening. It just continues to run

  
  
Posted 3 months ago
8K Views
69 Answers
4 months ago
3 months ago
Tags