Reputation
Badges 1
119 × Eureka!Yes! I think thats what I will do ๐ Let me know if there is a way to contribute a mode to keep logging off. We just donโt want to pollute the server when debugging.
Last question CostlyOstrich36 sorry to poke you! Seems even though if I set an extremely long time it will still fail when the first plots are reported. The first plots are generated automatically by pytorch lightning and track the cpu and gpu usage. Do you think this could be the cause? or should it also detect the iteration.
Hey CostlyOstrich36 sorry to ping you! Let's say I enqueue multiple experiments on a couple of agents and one of them fails. Is it possible to restart the experiment from the UI using the latest checkpoint? What if the experiment gets assigned to the other agent? I am not sure how the continue_last_task flag would help in this case.
Thats really cool! But I would still prefer avoid using pip_freeze, is there a way?
Yes AgitatedDove14 , I added git user name and password on the trains.conf file. On the results tab of the UI the logs clone command shows the SSH command instead of the HTTPS :Repository cloning failed: Command ['clone', mailto:'git@gitlab.com : ...
AgitatedDove14 Thanks! Im trying to figure out how to create a minimum working example! I am also working with Hydra so that may be a thing. The extension is whats causing it to fail (havenโt figured out why).
I am about to try everything AgitatedDove14 but ran into a gitlab error from the agent, I added the username and password to the configuration file but still get a Host key verification failed . Is it common that the cloning message shows the SSH link instead of the HTTPS when username and password are provided?
Hey AgitatedDove14 does this work for you?
` from argparse import ArgumentParser
from tensorflow.keras import utils as np_utils
from tensorflow.keras.datasets import mnist
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow as tf
from clearml import Task
class Linear(tf.keras.Model):
def init(self, in_shape=(784,), num_classes=10):
super().init()
self.l...
Best thing ever, thanks AgitatedDove14 !
Hi AgitatedDove14 thanks for your reply, with the dashboard I meant the Web-App (UI) . I am trying to access http://<External IP>:8080 but unfortunately nothing shows up.
CostlyOstrich36 Pytorch lightning exposes the current_epoch in the trainer, not sure if that is what you mean.
I have the agent configured to force install requirements.txt
I am still getting the error even with the v0.16.3 agent, is there something else we have to do other than updating it?
I set the number to a crazy value and it fails around the same iteration
There are also ways to override the parameters as stated https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_cli.html#use-of-command-line-arguments .
AgitatedDove14 update here! Something like this should work:from trains import StorageManager from trains.storage.helper import StorageHelper bucket = 'gs://bucket' helper = StorageHelper.get(bucket) remote_files = helper.list('folder') for f in remote_files: StorageManager.get_local_copy(bucket + "/" + f)the * gives [] results since one the list method startswith is used which uses it as a string and not as a wildcard
Hey CostlyOstrich36 ! I am using clearml==1.1.2 and clearml-agent==1.1.0 . Stopped is not the right word, more like frozen, it just froze at an epoch. The console on the agent shows epoch 33 first batch and the one at the server epoch 32 last batch. The experiment was running for ~6 hours.
Using detect_with_pip_freeze: true runs into package version not found for some of the ones I have locally.
AgitatedDove14 task.set_archived(True) + the cleanup service should do it ๐ If we run in debug mode the experiment goes directly to the archive and gets cleaned and we donโt pollute the main experiment page.
AgitatedDove14 I am not sure why the packages get different versions, maybe since the package is not directly imported in my code it is possible to get a different version to what I have locally (?). Should all the libraries versions match exactly between local and the code that runs in the agent? The Task.add_requirements(package_name, package_version=None) workaround works perfectly! I just add the previous version that doesnโt break the code. Yes, definitely a force flag could help ...