Reputation
Badges 1
119 × Eureka!Yes AgitatedDove14 , I am not sure what they use by default. Here is a simple working example:
` from typing import Optional
import torch
from clearml import Task
from pytorch_lightning import LightningDataModule, LightningModule
from pytorch_lightning.utilities.cli import LightningCLI
from torch.utils.data import DataLoader, Dataset, Subset
class RandomDataset(Dataset):
def init(self, size, length):
self.len = length
self.data = torch.randn(length, size)
def ...
There are also ways to override the parameters as stated https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_cli.html#use-of-command-line-arguments .
` [package_manager.force_repo_requirements_txt=true] Skipping requirements, using repository "requirements.txt"
Using base prefix '/opt/conda'
New python executable in /home/ramon/.clearml/venvs-builds/3.7/bin/python3.7
Also creating executable in /home/ramon/.clearml/venvs-builds/3.7/bin/python
Installing setuptools, pip, wheel...
2021-06-10 09:57:56
done.
Collecting pip<20.2
Using cached pip-20.1.1-py2.py3-none-any.whl (1.5 MB)
Installing collected packages: pip
Attempting uninstall: p...
I'll give that a try! Thanks CostlyOstrich36
CostlyOstrich36 That seemed to do the job! No message after the first epoch, with the caveat of losing resource monitoring. Any idea of what could be causing this? If the resource monitor is the first plot then the iteration detection will fail? Are there any hacks to keep the resource monitoring? Thanks a lot! ๐
Sure! Could you point me out how its done
So I would have to disconnect pytorch? And then upload the model at the end
I am using pytorch_lightning
, I'll try to create a snippet I can share! Thanks ๐
Yes Martin! I have a package installed from github but its using the pypi version
Last question CostlyOstrich36 sorry to poke you! Seems even though if I set an extremely long time it will still fail when the first plots are reported. The first plots are generated automatically by pytorch lightning and track the cpu and gpu usage. Do you think this could be the cause? or should it also detect the iteration.
Oh I think I am wrong! Then it must be the clearml monitoring. Still it fails way before the timer ends.
I set the number to a crazy value and it fails around the same iteration
I set it to 200000
! But the problem stems from when the first plot is the clearml cpu and gpu monitoring, were you able to reproduce it? Even if I set the number fairly large when the monitoring plot was reported the message appeared.
I feel itโs easier not to report than cleaning after but please correct me if I am overthinking it. Iโll check if I could wrap the code in something that calls the Task.delete if debugging
AgitatedDove14 Downloading a dataset would not be possible using this right? I want to be able to access the data just avoid reporting the experiment results
AgitatedDove14 task.set_archived(True)
+ the cleanup service should do it ๐ If we run in debug mode the experiment goes directly to the archive and gets cleaned and we donโt pollute the main experiment page.
I need to fetch a dataset for some simple tests but since it doesnโt have credentials to the self-hosted server it wont find the dataset
Yes! What env variables should I pass
Thanks SuccessfulKoala55 !
This works:filepath = self.log_dir + os.sep + "checkpoint" self.callbacks.append( ModelCheckpoint( filepath, monitor="val_loss", mode="min", save_best_only=True, save_weights_only=True, ) )
And this doesnโt:
` filepath = self.log_dir + os.sep + "checkpoint.hdf5"
self.callbacks.append(
ModelCheckpoint(
filepath,
...
Hey AgitatedDove14 after playing around seems that if the callback filepath points to an hdf5 file it is not uploaded.