Reputation
Badges 1
119 × Eureka!CostlyOstrich36 That seemed to do the job! No message after the first epoch, with the caveat of losing resource monitoring. Any idea of what could be causing this? If the resource monitor is the first plot then the iteration detection will fail? Are there any hacks to keep the resource monitoring? Thanks a lot! π
Yes! I think thats what I will do π Let me know if there is a way to contribute a mode to keep logging off. We just donβt want to pollute the server when debugging.
Last question CostlyOstrich36 sorry to poke you! Seems even though if I set an extremely long time it will still fail when the first plots are reported. The first plots are generated automatically by pytorch lightning and track the cpu and gpu usage. Do you think this could be the cause? or should it also detect the iteration.
Hey CostlyOstrich36 sorry to ping you! Let's say I enqueue multiple experiments on a couple of agents and one of them fails. Is it possible to restart the experiment from the UI using the latest checkpoint? What if the experiment gets assigned to the other agent? I am not sure how the continue_last_task flag would help in this case.
Thats really cool! But I would still prefer avoid using pip_freeze, is there a way?
Yes AgitatedDove14 , I added git user name and password on the trains.conf file. On the results tab of the UI the logs clone command shows the SSH command instead of the HTTPS :Repository cloning failed: Command ['clone', mailto:'git@gitlab.com : ...
AgitatedDove14 Thanks! Im trying to figure out how to create a minimum working example! I am also working with Hydra so that may be a thing. The extension is whats causing it to fail (havenβt figured out why).
Managed to get:
clearml_agent: ERROR: Command '['/home/ramon/.clearml/venvs-builds/3.9/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r', '/var/tmp/requirements_tb0x2i3j.txt', '--extra-index-url', '
died with <Signals.SIGKILL: 9>.
while building the task with the id on the agent
I am about to try everything AgitatedDove14 but ran into a gitlab error from the agent, I added the username and password to the configuration file but still get a Host key verification failed . Is it common that the cloning message shows the SSH link instead of the HTTPS when username and password are provided?
Hey AgitatedDove14 does this work for you?
` from argparse import ArgumentParser
from tensorflow.keras import utils as np_utils
from tensorflow.keras.datasets import mnist
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow as tf
from clearml import Task
class Linear(tf.keras.Model):
def init(self, in_shape=(784,), num_classes=10):
super().init()
self.l...
Best thing ever, thanks AgitatedDove14 !
Hi AgitatedDove14 thanks for your reply, with the dashboard I meant the Web-App (UI) . I am trying to access http://<External IP>:8080 but unfortunately nothing shows up.
CostlyOstrich36 Pytorch lightning exposes the current_epoch in the trainer, not sure if that is what you mean.
I have the agent configured to force install requirements.txt
I am still getting the error even with the v0.16.3 agent, is there something else we have to do other than updating it?
If you try:ModelCheckpoint('best_model.hdf5', save_best_only=True)does it work too?
I set the number to a crazy value and it fails around the same iteration
There are also ways to override the parameters as stated https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_cli.html#use-of-command-line-arguments .
AgitatedDove14 update here! Something like this should work:from trains import StorageManager from trains.storage.helper import StorageHelper bucket = 'gs://bucket' helper = StorageHelper.get(bucket) remote_files = helper.list('folder') for f in remote_files: StorageManager.get_local_copy(bucket + "/" + f)the * gives [] results since one the list method startswith is used which uses it as a string and not as a wildcard
I need to fetch a dataset for some simple tests but since it doesnβt have credentials to the self-hosted server it wont find the dataset