Reputation
Badges 1
53 × Eureka!I attached the log of the tensor board.
The scalar reported to the tf log is : 0.2631
The scalar reported to the scalars board in the UI is : 0.121
There is a major difference between the two
AgitatedDove14 and what about 'epoch_loss' without validation? That's the scalar I am interested to understand where it comes from.. I thought that was just the loss reported at the end of the train epoch via tf
As it seems torch.save()
just saves to the disk, so it seems there is no need for (server) parent path, just the name of the file , in my case : '
http://my_model.pt '
.
Thanks for you help CostlyOstrich36
Interesting I am only now seeing **optimizer_kwargs
it seems that it will fix my problem. Is it too much to ask if you could add an example of how to initiate the optuna object with the kwargs (mainly how to initiate 'trial', 'study', 'objective' arguments) ? 🙂
Thank you for the clarification, everything is clear now 🙂
Hi, adding the requirements to the code doesn't help because the task fails beforehand. CostlyOstrich36
Sending you to private CostlyOstrich36
CostlyOstrich36 yes - sorry for the wrong terminology
I had a task which I have cloned and reset a bunch of times, when I created the test as a new one, the error didnt appear again
Hi, joining in into this thread. I just have the same error with my task, which is not an optimization one.created virtual environment CPython3.7.5.final.0-64 in 232ms
creator CPython3Posix(dest=/home/ubuntu/.clearml/venvs-builds/3.7, clear=False, no_vcs_ignore=False, global=False)
seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/home/ubuntu/.local/share/virtualenv)
` added seed packages: pip==22.0.4, setuptool...
I don't know if it has anything to do with it but I now see that the repo which is cloned and save in the cache is actually a different branch than the one which is pulled by the agent.
From the log file:repository =
https://gitlab.com/data_science_team/PROJECT_NAME branch = MYBRANCH
SOMENUMBER IP### DEBUG
warning: redirecting to https://gitlab.com/data_science_team/PROJECT_NAME.git/
And afterwards, I have the following output that continues for 300 iterations without further reports of metrics
I would to delete a tag name from list of tags not from a task, ie from here:
Thanks! the second link is exactly that what I was looking for 🙂
In the child dataset task I see the following :
ARTIFACTS - > STATE:Dataset state
Files added/modified: 1 - total size 518.78 MB
Current dependency graph: {
"0385db....": [],
()"94f4....": ["0385db..."]
}
child task is 94f4..
and parent task is "0385db..."
but what does the () line means?
Another question: Is there a way to group together Dataset tasks (i.e redefine their parent) after the tasks have been finalized? In the same context: is there a way to change the dependency graph in the clearml dashboard after the task creation and finalization ?
Yeah this is a lock which is always in our cache, cant figure out why it's there, but when I delete the lock and the other files, they always reappear when I run a new clearml task.
Another thing I should note: I have recently had an error which fix was to run git config --global --add safe.directory /root/.clearml/vcs-cache/r__ (git repo name).d7f
Ever since, once I run a new task - a new file appears in the cache with the format of <git repo name.lock file name_a bunch of numbers>
What will happen if I disable the cache? Is there a way to find out which experiment is hung and why? in order to avoid this?
EDIT
I have disabled VCS-cache and it seems that the multiple cache files are still created when running a new task. Also still the lock is created once a new experiment is run: first image - cache after removing lock, second image - a few seconds later after running a new task. Also attached log output of the task uploaded (with ### replacing non relevant details).
Hi, the URL contains some details which I wouldn't like to share on this thread. Can I send it to one of you in private message?
CostlyOstrich36 The application problem was indeed solved 🙂 but the plots one didn't
Clearing my cookies solved the issue, Thanks 🙂
My questions are:
- how can I avoid creating tens of new cache files?
- do you happen to know why this lock is created and how it is connected to the above error (in the link - regarding "failing to clone.. ")