
Reputation
Badges 1
119 × Eureka!AgitatedDove14 Downloading a dataset would not be possible using this right? I want to be able to access the data just avoid reporting the experiment results
Iβll show you what I have through PM!
I am using pytorch_lightning
, I'll try to create a snippet I can share! Thanks π
I am using the code inside the on_train_epoch_end
inside a metric. So the important part is:
` fig = plt.figure()
my plot
logger.experiment.add_figure("fig", fig)
plt.close() `
AgitatedDove14 Thanks! Im trying to figure out how to create a minimum working example! I am also working with Hydra so that may be a thing. The extension is whats causing it to fail (havenβt figured out why).
Hey AgitatedDove14 does this work for you?
` from argparse import ArgumentParser
from tensorflow.keras import utils as np_utils
from tensorflow.keras.datasets import mnist
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow as tf
from clearml import Task
class Linear(tf.keras.Model):
def init(self, in_shape=(784,), num_classes=10):
super().init()
self.l...
Hi CostlyOstrich36 ! The message is the following:clearml.model - INFO - Selected model id: 27c1a1700b0b4e25a4344dc4ef9868fa
They are not models, those are intermediate tensors I am caching to make training faster. I don't need to log them.
Yes! I think thats what I will do π Let me know if there is a way to contribute a mode to keep logging off. We just donβt want to pollute the server when debugging.
Is this caused by running the script with the arguments?
Hey AgitatedDove14 after playing around seems that if the callback filepath points to an hdf5 file it is not uploaded.
Basically one points to an hdf5 and the other one has no extensiion
It is failing exactly when the download finishes. Not sure if it is something but on the ~/.clearml/pip-download-cache
only a cu120
empty folder appears. Should the torch wheel be saved there?
So should I set them all with a default value? The working dir is the project one, the one that contains the module
package
Last question CostlyOstrich36 sorry to poke you! Seems even though if I set an extremely long time it will still fail when the first plots are reported. The first plots are generated automatically by pytorch lightning and track the cpu and gpu usage. Do you think this could be the cause? or should it also detect the iteration.
Sure! I enqueue the experiment from my local machine:python -m src.train model=my_model loss=my_loss dataset=my_dataset
Then I go to the server and run the experiment and create a copy to run with a new model. On the copy, I go to the script path
and modify it to be:-m src.train model=my_other_model loss=my_loss dataset=my_dataset
The new experiment, even though the script path
has my_new_model
default, starts training using my_model
.
I can also see ...
CostlyOstrich36 Pytorch lightning exposes the current_epoch
in the trainer, not sure if that is what you mean.
No, I have all the packages with a version. I just want to know if there is a way to override the requirements versions detected by Pigar when using detect_with_pip_freeze: false
. I have locally cloudpickle==1.4.1
but when running the code and sending the task to the node the environment uses cloudpickle==1.6.0
. I have to manually change the version on the UI. Is there a way to force this single package to have a version? Maybe on the requirments.txt or something similar
SuccessfulKoala55 just to let you know: since I opened the link straight from the GCP console it was using https
on the address instead of http
hence the error. Thanks a lot for your help!