It is failing exactly when the download finishes. Not sure if it is something but on the ~/.clearml/pip-download-cache
only a cu120
empty folder appears. Should the torch wheel be saved there?
Sure! For torch I have:
torch==2.0.1
# via
# monai
# pytorch-lightning
# torchio
# torchmetrics
AgitatedDove14 Well I have a loss function which is something like:class MyLoss(...): def forward(...): weights = self.compute_weights(...) return (weights * (target-preds)).mean()
There seems to be a problem on certain batch when computing the weights. What would be the best way to log the batch that causes the problem, along with the weights being computed.
Awesome AgitatedDove14 Thanks a lot ๐
Iโll show you what I have through PM!
So should I set them all with a default value? The working dir is the project one, the one that contains the module
package
` File "/home/ramon/.trains/venvs-builds/3.7/lib/python3.7/site-packages/trains/backend_api/session/token_manager.py", line 72, in _get_token_exp
return jwt.decode(token, verify=False).get('exp', sys.maxsize)
File "/home/ramon/.trains/venvs-builds/3.7/lib/python3.7/site-packages/jwt/api_jwt.py", line 113, in decode
decoded = self.decode_complete(jwt, key, algorithms, options, **kwargs)
File "/home/ramon/.trains/venvs-builds/3.7/lib/python3.7/site-packages/jwt/api_jwt.py", line 80, in decode_c...
Yes AgitatedDove14 , I am not sure what they use by default. Here is a simple working example:
` from typing import Optional
import torch
from clearml import Task
from pytorch_lightning import LightningDataModule, LightningModule
from pytorch_lightning.utilities.cli import LightningCLI
from torch.utils.data import DataLoader, Dataset, Subset
class RandomDataset(Dataset):
def init(self, size, length):
self.len = length
self.data = torch.randn(length, size)
def ...
Thanks AgitatedDove14 ! seems to be subclassed model + extension
Basically one points to an hdf5 and the other one has no extensiion
This works:filepath = self.log_dir + os.sep + "checkpoint" self.callbacks.append( ModelCheckpoint( filepath, monitor="val_loss", mode="min", save_best_only=True, save_weights_only=True, ) )
And this doesnโt:
` filepath = self.log_dir + os.sep + "checkpoint.hdf5"
self.callbacks.append(
ModelCheckpoint(
filepath,
...
Hey AgitatedDove14 after playing around seems that if the callback filepath points to an hdf5 file it is not uploaded.
Hey CostlyOstrich36 sorry to ping you! Let's say I enqueue multiple experiments on a couple of agents and one of them fails. Is it possible to restart the experiment from the UI using the latest checkpoint? What if the experiment gets assigned to the other agent? I am not sure how the continue_last_task
flag would help in this case.
AgitatedDove14 Thanks! Im trying to figure out how to create a minimum working example! I am also working with Hydra so that may be a thing. The extension is whats causing it to fail (havenโt figured out why).
Hi CostlyOstrich36 ! The message is the following:clearml.model - INFO - Selected model id: 27c1a1700b0b4e25a4344dc4ef9868fa
They are not models, those are intermediate tensors I am caching to make training faster. I don't need to log them.
Yes! I will take a look at it!
It works perfectly! AgitatedDove14 There is something weird on my side ๐ข
There are also ways to override the parameters as stated https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_cli.html#use-of-command-line-arguments .