Reputation
Badges 1
119 × Eureka!Hi CostlyOstrich36 ! The message is the following:clearml.model - INFO - Selected model id: 27c1a1700b0b4e25a4344dc4ef9868fa
They are not models, those are intermediate tensors I am caching to make training faster. I don't need to log them.
AgitatedDove14 Thanks! Iβll give it a try! Makes sense π
Using detect_with_pip_freeze: true
runs into package version not found for some of the ones I have locally.
Pigar is capturing different versions that the ones I have installed on my local machine (not a problem except for one). I just want to force the version of that package in a way that I donβt have to manually change it from the UI for every experiment.
AgitatedDove14 Well I have a loss function which is something like:class MyLoss(...): def forward(...): weights = self.compute_weights(...) return (weights * (target-preds)).mean()
There seems to be a problem on certain batch when computing the weights. What would be the best way to log the batch that causes the problem, along with the weights being computed.
It is the latest RC, I get the following:
` Executing Conda: /opt/conda/bin/conda install -p /home/ramon/.clearml/venvs-builds/3.8 -c pytorch -c conda-forge -c defaults 'pip<20.2' --quiet --json
Pass
Trying pip install: /home/ramon/.clearml/venvs-builds/3.8/task_repository/my-rep.git/requirements.txt
Executing Conda: /opt/conda/bin/conda install -p /home/ramon/.clearml/venvs-builds/3.8 -c pytorch -c conda-forge -c defaults numpy==1.20.3 --quiet --json
Pass
Warning, could not locate PyTorch to...
TimelyPenguin76 I found out its just one package that is causing the error ( cloudpickle
breaks everything). Is there a way to use Pigar but force a single package to have a version?
Yes! I will take a look at it!
There are also ways to override the parameters as stated https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_cli.html#use-of-command-line-arguments .
Not yet AgitatedDove14 , does the agent use by default the python version the command is run with? I installed conda and tried using package_manager.type=conda
but then get an error:clearml_agent: ERROR: 'NoneType' object has no attribute 'lower'
I am using pytorch_lightning
, I'll try to create a snippet I can share! Thanks π
AgitatedDove14 update here! Something like this should work:from trains import StorageManager from trains.storage.helper import StorageHelper bucket = 'gs://bucket' helper = StorageHelper.get(bucket) remote_files = helper.list('folder') for f in remote_files: StorageManager.get_local_copy(bucket + "/" + f)
the *
gives []
results since one the list
method startswith
is used which uses it as a string and not as a wildcard
I feel itβs easier not to report than cleaning after but please correct me if I am overthinking it. Iβll check if I could wrap the code in something that calls the Task.delete if debugging
Hey CostlyOstrich36 ! I am using clearml==1.1.2
and clearml-agent==1.1.0
. Stopped is not the right word, more like frozen, it just froze at an epoch. The console on the agent shows epoch 33 first batch and the one at the server epoch 32 last batch. The experiment was running for ~6 hours.
Yes Martin! I have a package installed from github but its using the pypi version
So I would have to disconnect pytorch? And then upload the model at the end
AgitatedDove14 I filed an issue of fire for them to point us to the argument parsing method https://github.com/google/python-fire/issues/291
Yes AgitatedDove14 , I added git user name and password on the trains.conf file. On the results tab of the UI the logs clone command shows the SSH
command instead of the HTTPS
:Repository cloning failed: Command ['clone',
mailto:'git@gitlab.com : ...
Thanks SuccessfulKoala55 !
Best thing ever, thanks AgitatedDove14 !
AgitatedDove14 from this thread I understand hydra is not supported and therefore overriding the parameters from the UI wont work, but is there still a way to track and add the parameters to the experiment? Will task.connect_configuration
work with the yaml files?
Managed to get:
clearml_agent: ERROR: Command '['/home/ramon/.clearml/venvs-builds/3.9/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r', '/var/tmp/requirements_tb0x2i3j.txt', '--extra-index-url', '
died with <Signals.SIGKILL: 9>.
while building the task with the id on the agent
It is failing exactly when the download finishes. Not sure if it is something but on the ~/.clearml/pip-download-cache
only a cu120
empty folder appears. Should the torch wheel be saved there?