
Reputation
Badges 1
53 × Eureka!I have some info that I wouldn't like to post here (due to security reasons), is there a way to share the link only with your user ? 🙂
this is the correct file
Hi, joining in into this thread. I just have the same error with my task, which is not an optimization one.created virtual environment CPython3.7.5.final.0-64 in 232ms
creator CPython3Posix(dest=/home/ubuntu/.clearml/venvs-builds/3.7, clear=False, no_vcs_ignore=False, global=False)
seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/home/ubuntu/.local/share/virtualenv)
` added seed packages: pip==22.0.4, setuptool...
I had a task which I have cloned and reset a bunch of times, when I created the test as a new one, the error didnt appear again
where should I look to see this metric? at the scalars tab?
Interesting I am only now seeing **optimizer_kwargs
it seems that it will fix my problem. Is it too much to ask if you could add an example of how to initiate the optuna object with the kwargs (mainly how to initiate 'trial', 'study', 'objective' arguments) ? 🙂
CostlyOstrich36 Another clarification:
The master branch cache is stored at ". clearml/vcs-cache " - the code file doesn't exist there + the problem described above is occuring in this folder (multiple cache files of same repo).
While my branch is stored at " .clearml/venvs-builds/3.7/task_repository/"
AgitatedDove14 and what about 'epoch_loss' without validation? That's the scalar I am interested to understand where it comes from.. I thought that was just the loss reported at the end of the train epoch via tf
Thanks! the second link is exactly that what I was looking for 🙂
I attached the log of the tensor board.
The scalar reported to the tf log is : 0.2631
The scalar reported to the scalars board in the UI is : 0.121
There is a major difference between the two
CostlyOstrich36 The application problem was indeed solved 🙂 but the plots one didn't
Clearing my cookies solved the issue, Thanks 🙂
Yeah this is a lock which is always in our cache, cant figure out why it's there, but when I delete the lock and the other files, they always reappear when I run a new clearml task.
Another thing I should note: I have recently had an error which fix was to run git config --global --add safe.directory /root/.clearml/vcs-cache/r__ (git repo name).d7f
Ever since, once I run a new task - a new file appears in the cache with the format of <git repo name.lock file name_a bunch of numbers>
EDIT
I have disabled VCS-cache and it seems that the multiple cache files are still created when running a new task. Also still the lock is created once a new experiment is run: first image - cache after removing lock, second image - a few seconds later after running a new task. Also attached log output of the task uploaded (with ### replacing non relevant details).
Sending you to private CostlyOstrich36
Just to clarify again - when I start the agents I run :clearml-agent daemon --detached --queue training
and then: clearml-agent daemon --detached --services-mode --queue services --docker ubuntu:18.04 --cpu-only
This is why there are 'training' and 'training_2' queues.
CostlyOstrich36 yes - sorry for the wrong terminology
CumbersomeCormorant74 As you can see in the attached - there were 2 experiments at the same time, but only one agent pulled the task, even though the second agent was free and listening to the queue.
EDIT CostlyOstrich36
third image - cache after running another task with new cache file created even though cache is disabled
Hi, the URL contains some details which I wouldn't like to share on this thread. Can I send it to one of you in private message?
Does this relate to the error below? from reading the issue I didn't see anyone mentioning this error -clearml_agent: ERROR: Failed cloning repository.
1) Make sure you pushed the requested commit:
(repository='
https://gitlab.com/data_science_team/____ ', branch='ilanit', commit_id='b5c___', tag='', docker_cmd='ubuntu:18.04', entry_point='training/____.py', working_dir='src')
2) Check if remote-worker has valid credentials [see worker configuration file]
Hi, adding the requirements to the code doesn't help because the task fails beforehand. CostlyOstrich36
I really like the first idea, but I already see a problem if I make changes to the file, I will need to re-upload it every time?
I am currently using the repo cache, but unfortunately every time I run a new task with an existing cached repo, a new cache file is created.. very weird behaviour which I have already described in previous threads ( https://clearml.slack.com/archives/CTK20V944/p1651393535530439?thread_ts=1651063782.712379&cid=CTK20V944 )
In the child dataset task I see the following :
ARTIFACTS - > STATE:Dataset state
Files added/modified: 1 - total size 518.78 MB
Current dependency graph: {
"0385db....": [],
()"94f4....": ["0385db..."]
}
child task is 94f4..
and parent task is "0385db..."
but what does the () line means?
Another question: Is there a way to group together Dataset tasks (i.e redefine their parent) after the tasks have been finalized? In the same context: is there a way to change the dependency graph in the clearml dashboard after the task creation and finalization ?