
Reputation
Badges 1
53 × Eureka!Does this relate to the error below? from reading the issue I didn't see anyone mentioning this error -clearml_agent: ERROR: Failed cloning repository.
1) Make sure you pushed the requested commit:
(repository='
https://gitlab.com/data_science_team/____ ', branch='ilanit', commit_id='b5c___', tag='', docker_cmd='ubuntu:18.04', entry_point='training/____.py', working_dir='src')
2) Check if remote-worker has valid credentials [see worker configuration file]
I have some info that I wouldn't like to post here (due to security reasons), is there a way to share the link only with your user ? 🙂
Clearing my cookies solved the issue, Thanks 🙂
Hi, the URL contains some details which I wouldn't like to share on this thread. Can I send it to one of you in private message?
CostlyOstrich36 Another clarification:
The master branch cache is stored at ". clearml/vcs-cache " - the code file doesn't exist there + the problem described above is occuring in this folder (multiple cache files of same repo).
While my branch is stored at " .clearml/venvs-builds/3.7/task_repository/"
Hi, joining in into this thread. I just have the same error with my task, which is not an optimization one.created virtual environment CPython3.7.5.final.0-64 in 232ms
creator CPython3Posix(dest=/home/ubuntu/.clearml/venvs-builds/3.7, clear=False, no_vcs_ignore=False, global=False)
seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/home/ubuntu/.local/share/virtualenv)
` added seed packages: pip==22.0.4, setuptool...
Another question: Is there a way to group together Dataset tasks (i.e redefine their parent) after the tasks have been finalized? In the same context: is there a way to change the dependency graph in the clearml dashboard after the task creation and finalization ?
CostlyOstrich36 yes - sorry for the wrong terminology
Interesting I am only now seeing **optimizer_kwargs
it seems that it will fix my problem. Is it too much to ask if you could add an example of how to initiate the optuna object with the kwargs (mainly how to initiate 'trial', 'study', 'objective' arguments) ? 🙂
I had a task which I have cloned and reset a bunch of times, when I created the test as a new one, the error didnt appear again
In the child dataset task I see the following :
ARTIFACTS - > STATE:Dataset state
Files added/modified: 1 - total size 518.78 MB
Current dependency graph: {
"0385db....": [],
()"94f4....": ["0385db..."]
}
child task is 94f4..
and parent task is "0385db..."
but what does the () line means?
EDIT CostlyOstrich36
third image - cache after running another task with new cache file created even though cache is disabled
As it seems torch.save()
just saves to the disk, so it seems there is no need for (server) parent path, just the name of the file , in my case : '
http://my_model.pt '
.
Thanks for you help CostlyOstrich36
Thank you for the clarification, everything is clear now 🙂
Yeah this is a lock which is always in our cache, cant figure out why it's there, but when I delete the lock and the other files, they always reappear when I run a new clearml task.
Another thing I should note: I have recently had an error which fix was to run git config --global --add safe.directory /root/.clearml/vcs-cache/r__ (git repo name).d7f
Ever since, once I run a new task - a new file appears in the cache with the format of <git repo name.lock file name_a bunch of numbers>
This is not something that we defined or created- if I understand your question. It is created once a ClearML task is run, and there until the lock is deleted (which is something we do to handle another error I posted here about)
What will happen if I disable the cache? Is there a way to find out which experiment is hung and why? in order to avoid this?
My questions are:
- how can I avoid creating tens of new cache files?
- do you happen to know why this lock is created and how it is connected to the above error (in the link - regarding "failing to clone.. ")
EDIT
I have disabled VCS-cache and it seems that the multiple cache files are still created when running a new task. Also still the lock is created once a new experiment is run: first image - cache after removing lock, second image - a few seconds later after running a new task. Also attached log output of the task uploaded (with ### replacing non relevant details).
AgitatedDove14 and what about 'epoch_loss' without validation? That's the scalar I am interested to understand where it comes from.. I thought that was just the loss reported at the end of the train epoch via tf