Reputation
Badges 1
53 × Eureka!Does this relate to the error below? from reading the issue I didn't see anyone mentioning this error -clearml_agent: ERROR: Failed cloning repository.
1) Make sure you pushed the requested commit:
(repository='
https://gitlab.com/data_science_team/____ ', branch='ilanit', commit_id='b5c___', tag='', docker_cmd='ubuntu:18.04', entry_point='training/____.py', working_dir='src')
2) Check if remote-worker has valid credentials [see worker configuration file]
Another question: Is there a way to group together Dataset tasks (i.e redefine their parent) after the tasks have been finalized? In the same context: is there a way to change the dependency graph in the clearml dashboard after the task creation and finalization ?
CostlyOstrich36 Another clarification:
The master branch cache is stored at ". clearml/vcs-cache " - the code file doesn't exist there + the problem described above is occuring in this folder (multiple cache files of same repo).
While my branch is stored at " .clearml/venvs-builds/3.7/task_repository/"
CumbersomeCormorant74 As you can see in the attached - there were 2 experiments at the same time, but only one agent pulled the task, even though the second agent was free and listening to the queue.
Hi, joining in into this thread. I just have the same error with my task, which is not an optimization one.created virtual environment CPython3.7.5.final.0-64 in 232ms
creator CPython3Posix(dest=/home/ubuntu/.clearml/venvs-builds/3.7, clear=False, no_vcs_ignore=False, global=False)
seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/home/ubuntu/.local/share/virtualenv)
` added seed packages: pip==22.0.4, setuptool...
Can you check if there is a situation where there are tasks pending in that queue while one of the workers is idle?
In what queue ? In services there are no pending tasks because they were all pulled by 'training'.
just to maker sure, how do you start the agents? Are you using the
--services-mode
option?
I used clearml-agent daemon --detached --services-mode --queue services --docker ubuntu:18.04 --cpu-only
We have been trying to resolve the issue. I will comment here again if any more problems arise. Thanks!
Thanks! the second link is exactly that what I was looking for 🙂
yes of course I specified it with :torch.save(state, f_path)
where f_path = os.path.join(tempfile.gettempdir(), 'trained_models') + '\
http://my_model.pt '
The problem is with the path I am trying to save the model to.Therefore my question How do I extract paths inside ClearML?
Just to clarify again - when I start the agents I run :clearml-agent daemon --detached --queue training
and then: clearml-agent daemon --detached --services-mode --queue services --docker ubuntu:18.04 --cpu-only
This is why there are 'training' and 'training_2' queues.
I don't know if it has anything to do with it but I now see that the repo which is cloned and save in the cache is actually a different branch than the one which is pulled by the agent.
From the log file:repository =
https://gitlab.com/data_science_team/PROJECT_NAME branch = MYBRANCH
SOMENUMBER IP### DEBUG
warning: redirecting to https://gitlab.com/data_science_team/PROJECT_NAME.git/
Clearing my cookies solved the issue, Thanks 🙂
I indeed have different scalar there :val_loss
but I have reported this metric in the checkpoint not in the logger..
Unfortunately, I am not running on a community server
CostlyOstrich36 The application problem was indeed solved 🙂 but the plots one didn't
I have some info that I wouldn't like to post here (due to security reasons), is there a way to share the link only with your user ? 🙂
where should I look to see this metric? at the scalars tab?
Interesting I am only now seeing **optimizer_kwargs
it seems that it will fix my problem. Is it too much to ask if you could add an example of how to initiate the optuna object with the kwargs (mainly how to initiate 'trial', 'study', 'objective' arguments) ? 🙂
Thank you for the clarification, everything is clear now 🙂
My questions are:
- how can I avoid creating tens of new cache files?
- do you happen to know why this lock is created and how it is connected to the above error (in the link - regarding "failing to clone.. ")