Reputation
Badges 1
53 × Eureka!In another task I have tried to evaluate this metric but received similar error :clearml.automation.optimization - WARNING - Could not find requested metric ('evaluate', 'val_loss') report on base task
My questions are:
- how can I avoid creating tens of new cache files?
- do you happen to know why this lock is created and how it is connected to the above error (in the link - regarding "failing to clone.. ")
I attached the log of the tensor board.
The scalar reported to the tf log is : 0.2631
The scalar reported to the scalars board in the UI is : 0.121
There is a major difference between the two
I will elaborate on the situation:
I have 2 agents - training and training_2. They are both listening to the services queue, but only 'training' pulls the tasks. At the beginning I had 2 tasks in the services queue. Then, agent 'training' pulled one and is currently executing it, but for some reason - it also pulled the 2nd task into it's queue, that is although this agent is not free and I have another agent which is: 'training_2'.
I indeed have different scalar there :val_loss but I have reported this metric in the checkpoint not in the logger..
I have some info that I wouldn't like to post here (due to security reasons), is there a way to share the link only with your user ? 🙂
Unfortunately, I am not running on a community server
And afterwards, I have the following output that continues for 300 iterations without further reports of metrics
I really like the first idea, but I already see a problem if I make changes to the file, I will need to re-upload it every time?
I am currently using the repo cache, but unfortunately every time I run a new task with an existing cached repo, a new cache file is created.. very weird behaviour which I have already described in previous threads ( https://clearml.slack.com/archives/CTK20V944/p1651393535530439?thread_ts=1651063782.712379&cid=CTK20V944 )
I don't know if it has anything to do with it but I now see that the repo which is cloned and save in the cache is actually a different branch than the one which is pulled by the agent.
From the log file:repository = https://gitlab.com/data_science_team/PROJECT_NAME branch = MYBRANCHSOMENUMBER IP### DEBUG
warning: redirecting to https://gitlab.com/data_science_team/PROJECT_NAME.git/
Thank you for the clarification, everything is clear now 🙂
I would to delete a tag name from list of tags not from a task, ie from here:
Hi, adding the requirements to the code doesn't help because the task fails beforehand. CostlyOstrich36
Another question: Is there a way to group together Dataset tasks (i.e redefine their parent) after the tasks have been finalized? In the same context: is there a way to change the dependency graph in the clearml dashboard after the task creation and finalization ?
yes of course I specified it with :torch.save(state, f_path)
where f_path = os.path.join(tempfile.gettempdir(), 'trained_models') + '\ http://my_model.pt '
EDIT
I have disabled VCS-cache and it seems that the multiple cache files are still created when running a new task. Also still the lock is created once a new experiment is run: first image - cache after removing lock, second image - a few seconds later after running a new task. Also attached log output of the task uploaded (with ### replacing non relevant details).
this is the correct file
Hi, joining in into this thread. I just have the same error with my task, which is not an optimization one.created virtual environment CPython3.7.5.final.0-64 in 232mscreator CPython3Posix(dest=/home/ubuntu/.clearml/venvs-builds/3.7, clear=False, no_vcs_ignore=False, global=False)seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/home/ubuntu/.local/share/virtualenv)
` added seed packages: pip==22.0.4, setuptool...
AgitatedDove14 and what about 'epoch_loss' without validation? That's the scalar I am interested to understand where it comes from.. I thought that was just the loss reported at the end of the train epoch via tf
The problem is with the path I am trying to save the model to.Therefore my question How do I extract paths inside ClearML?
This is not something that we defined or created- if I understand your question. It is created once a ClearML task is run, and there until the lock is deleted (which is something we do to handle another error I posted here about)
As it seems torch.save() just saves to the disk, so it seems there is no need for (server) parent path, just the name of the file , in my case : ' http://my_model.pt ' .
Thanks for you help CostlyOstrich36
In the child dataset task I see the following :
ARTIFACTS - > STATE:Dataset stateFiles added/modified: 1 - total size 518.78 MBCurrent dependency graph: {"0385db....": [], ()"94f4....": ["0385db..."]}
child task is 94f4.. and parent task is "0385db..." but what does the () line means?