Pigar is capturing different versions that the ones I have installed on my local machine (not a problem except for one). I just want to force the version of that package in a way that I don’t have to manually change it from the UI for every experiment.
This works:filepath = self.log_dir + os.sep + "checkpoint" self.callbacks.append( ModelCheckpoint( filepath, monitor="val_loss", mode="min", save_best_only=True, save_weights_only=True, ) )
And this doesn’t:
` filepath = self.log_dir + os.sep + "checkpoint.hdf5"
self.callbacks.append(
ModelCheckpoint(
filepath,
...
Yes! I think thats what I will do 👌 Let me know if there is a way to contribute a mode to keep logging off. We just don’t want to pollute the server when debugging.
AgitatedDove14 Thanks! Im trying to figure out how to create a minimum working example! I am also working with Hydra so that may be a thing. The extension is whats causing it to fail (haven’t figured out why).
I need to fetch a dataset for some simple tests but since it doesn’t have credentials to the self-hosted server it wont find the dataset
I configured a firewall rule that opened the ports for the instance (not 100% sure if this is the right way) using network tags. Yes, the whole screen is black and no trains logo show up: Safari can’t open the page because the server where this page is located isn’t responding.
Thanks AgitatedDove14 !
Yes AgitatedDove14 ! I’ll PM you
Sure! I enqueue the experiment from my local machine:python -m src.train model=my_model loss=my_loss dataset=my_dataset
Then I go to the server and run the experiment and create a copy to run with a new model. On the copy, I go to the script path
and modify it to be:-m src.train model=my_other_model loss=my_loss dataset=my_dataset
The new experiment, even though the script path
has my_new_model
default, starts training using my_model
.
I can also see ...
Hi CostlyOstrich36 ! The message is the following:clearml.model - INFO - Selected model id: 27c1a1700b0b4e25a4344dc4ef9868fa
They are not models, those are intermediate tensors I am caching to make training faster. I don't need to log them.
Also, should I allow 8080
, 8008
, and 8081
on ingress and egress on GCP or is only egress enough?
Thanks TimelyPenguin76 , the example works fine! I’ll debug further on my side!
Yes, everything is that way (work dir and args are ok) except the script path . It shows -m module arg1 arg2
.
Side note: When running src.train
as a module the server gets the command as src
and has to be modified to be src.train
` [package_manager.force_repo_requirements_txt=true] Skipping requirements, using repository "requirements.txt"
Using base prefix '/opt/conda'
New python executable in /home/ramon/.clearml/venvs-builds/3.7/bin/python3.7
Also creating executable in /home/ramon/.clearml/venvs-builds/3.7/bin/python
Installing setuptools, pip, wheel...
2021-06-10 09:57:56
done.
Collecting pip<20.2
Using cached pip-20.1.1-py2.py3-none-any.whl (1.5 MB)
Installing collected packages: pip
Attempting uninstall: p...
With pip
I get the first error I showed, I tried conda
and it starts running but at some point crashes with:clearml_agent: ERROR: 'NoneType' object has no attribute 'lower'
I get the URL to the checkpoint/weights
can I use this to download the weights?
Yes, exactly! Unfortunately I am not so familiar with the internals of the library but I could take a look and figure that out.
Thanks AgitatedDove14 ! seems to be subclassed model + extension
I’ll show you what I have through PM!
AgitatedDove14 task.set_archived(True)
+ the cleanup service should do it 👌 If we run in debug mode the experiment goes directly to the archive and gets cleaned and we don’t pollute the main experiment page.
I have the agent configured to force install requirements.txt
Yes Martin! I have a package installed from github but its using the pypi version
AgitatedDove14 Thanks! I’ll give it a try! Makes sense 👌
Awesome AgitatedDove14 Thanks a lot 🙌
SuccessfulKoala55 on both 8080
and 8008
I get: Safari can’t open the page http://<External IP>:80XX
because Safari can’t establish a secure connection to the server http://<External IP>:80XX
.
On the server through the command line?