It works perfectly! AgitatedDove14 There is something weird on my side ๐ข
Also, should I allow 8080
, 8008
, and 8081
on ingress and egress on GCP or is only egress enough?
Hi AgitatedDove14 thanks for your reply, with the dashboard I meant the Web-App (UI) . I am trying to access http://<External IP>:8080
but unfortunately nothing shows up.
Hey CostlyOstrich36 sorry to ping you! Let's say I enqueue multiple experiments on a couple of agents and one of them fails. Is it possible to restart the experiment from the UI using the latest checkpoint? What if the experiment gets assigned to the other agent? I am not sure how the continue_last_task
flag would help in this case.
Thanks AgitatedDove14 ! seems to be subclassed model + extension
Thanks SuccessfulKoala55 !
Yes! What env variables should I pass
AgitatedDove14 Well I have a loss function which is something like:class MyLoss(...): def forward(...): weights = self.compute_weights(...) return (weights * (target-preds)).mean()
There seems to be a problem on certain batch when computing the weights. What would be the best way to log the batch that causes the problem, along with the weights being computed.
AgitatedDove14 Thanks! Im trying to figure out how to create a minimum working example! I am also working with Hydra so that may be a thing. The extension is whats causing it to fail (havenโt figured out why).
Hi CostlyOstrich36 ! The message is the following:clearml.model - INFO - Selected model id: 27c1a1700b0b4e25a4344dc4ef9868fa
They are not models, those are intermediate tensors I am caching to make training faster. I don't need to log them.
Yes Martin! I have a package installed from github but its using the pypi version
Side note: When running src.train
as a module the server gets the command as src
and has to be modified to be src.train
AgitatedDove14 Thanks! Iโll give it a try! Makes sense ๐
AgitatedDove14 I am not sure why the packages get different versions, maybe since the package is not directly imported in my code it is possible to get a different version to what I have locally (?). Should all the libraries versions match exactly between local and the code that runs in the agent? The Task.add_requirements(package_name, package_version=None)
workaround works perfectly! I just add the previous version that doesnโt break the code. Yes, definitely a force flag could help ...
Not yet AgitatedDove14 , does the agent use by default the python version the command is run with? I installed conda and tried using package_manager.type=conda
but then get an error:clearml_agent: ERROR: 'NoneType' object has no attribute 'lower'
With pip
I get the first error I showed, I tried conda
and it starts running but at some point crashes with:clearml_agent: ERROR: 'NoneType' object has no attribute 'lower'
Awesome AgitatedDove14 Thanks a lot ๐
Sure! I enqueue the experiment from my local machine:python -m src.train model=my_model loss=my_loss dataset=my_dataset
Then I go to the server and run the experiment and create a copy to run with a new model. On the copy, I go to the script path
and modify it to be:-m src.train model=my_other_model loss=my_loss dataset=my_dataset
The new experiment, even though the script path
has my_new_model
default, starts training using my_model
.
I can also see ...
Hey AgitatedDove14 does this work for you?
` from argparse import ArgumentParser
from tensorflow.keras import utils as np_utils
from tensorflow.keras.datasets import mnist
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow as tf
from clearml import Task
class Linear(tf.keras.Model):
def init(self, in_shape=(784,), num_classes=10):
super().init()
self.l...