Yes AgitatedDove14 , I am not sure what they use by default. Here is a simple working example:
` from typing import Optional
import torch
from clearml import Task
from pytorch_lightning import LightningDataModule, LightningModule
from pytorch_lightning.utilities.cli import LightningCLI
from torch.utils.data import DataLoader, Dataset, Subset
class RandomDataset(Dataset):
def init(self, size, length):
self.len = length
self.data = torch.randn(length, size)
def ...
If you try:ModelCheckpoint('best_model.hdf5', save_best_only=True)does it work too?
I need to fetch a dataset for some simple tests but since it doesn’t have credentials to the self-hosted server it wont find the dataset
Yes! What env variables should I pass
Managed to get:
clearml_agent: ERROR: Command '['/home/ramon/.clearml/venvs-builds/3.9/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r', '/var/tmp/requirements_tb0x2i3j.txt', '--extra-index-url', '
died with <Signals.SIGKILL: 9>.
while building the task with the id on the agent
Basically one points to an hdf5 and the other one has no extensiion
Just to make sure get everything right AgitatedDove14 :
We have to define the Task inside the function decorated with the @hydra.main We can modify the parameters that are overridden on UI on : configuration tab -> Args -> overrides -> modify the listAdditional question:
Will the sweep functionality work?
This works:filepath = self.log_dir + os.sep + "checkpoint" self.callbacks.append( ModelCheckpoint( filepath, monitor="val_loss", mode="min", save_best_only=True, save_weights_only=True, ) )And this doesn’t:
` filepath = self.log_dir + os.sep + "checkpoint.hdf5"
self.callbacks.append(
ModelCheckpoint(
filepath,
...
I’ll open the PR!
AgitatedDove14 Thanks! I’ll give it a try! Makes sense 👌
I feel it’s easier not to report than cleaning after but please correct me if I am overthinking it. I’ll check if I could wrap the code in something that calls the Task.delete if debugging
AgitatedDove14 from this thread I understand hydra is not supported and therefore overriding the parameters from the UI wont work, but is there still a way to track and add the parameters to the experiment? Will task.connect_configuration work with the yaml files?
TimelyPenguin76 I found out its just one package that is causing the error ( cloudpickle breaks everything). Is there a way to use Pigar but force a single package to have a version?
Thanks AgitatedDove14 ! seems to be subclassed model + extension
Oh I think I am wrong! Then it must be the clearml monitoring. Still it fails way before the timer ends.
Hi CostlyOstrich36 ! The message is the following:clearml.model - INFO - Selected model id: 27c1a1700b0b4e25a4344dc4ef9868faThey are not models, those are intermediate tensors I am caching to make training faster. I don't need to log them.
Yes AgitatedDove14 , I added git user name and password on the trains.conf file. On the results tab of the UI the logs clone command shows the SSH command instead of the HTTPS :Repository cloning failed: Command ['clone', mailto:'git@gitlab.com : ...
Yes, it’s similar; somewhat more automatic since it detects the classes of functions arguments and generates the CLI. What do you mean by that AgitatedDove14 get all the parameters and use task.connect ?
Yes, exactly! Unfortunately I am not so familiar with the internals of the library but I could take a look and figure that out.
I am about to try everything AgitatedDove14 but ran into a gitlab error from the agent, I added the username and password to the configuration file but still get a Host key verification failed . Is it common that the cloning message shows the SSH link instead of the HTTPS when username and password are provided?
Sure, I’ll share It through a private message!
AgitatedDove14 I filed an issue of fire for them to point us to the argument parsing method https://github.com/google/python-fire/issues/291
Hey AgitatedDove14 do you have an implementation for gcloud? this is awesome