Reputation
Badges 1
25 × Eureka!DeterminedToad86 were you running a jupyter notebook or a jupyter console ?
This will set more time before the timeout right?
Correct.
task.freeze_monitor()
download()
task.defrost_monitor()
Currently there isn't, but that's a good ides.
What would be the argument of using it vs increasing the timeout ?
btw: setting the resource timeout to 99999 will basically mean that it will wait until the first reported iteration, Not that it will just sleep for 99999sec π
The main reason to add the timeout is because the warning was annoying to users π
The secondary was that clearml will start reporting based on seconds from start, then when iterations start it will revert back to iterations. But if the iterations are "epochs" the numbers are lower so you end up with a graph that does not match the expected "iterations" x-axis. Make sense ?
Hi CooperativeFox72
Sure πtask.set_resource_monitor_iteration_timeout(seconds_from_start=1800)
Hi SubstantialElk6 I believe you just need to use clearml 1.0.5 , and make sure you rae passing the correct OS environment to the agent
Are you inheriting from their docker file ?
OHH nice, I thought that it just some kind of job queue on up and running machines
It's much more than that, it's a way of life π
But seriously now, it allows you to use any machine as part of your cluster, and send jobs for execution from the web UI (any machine, even just a standalong GPU machine under your desk, or any cloud GPU instance any mixing the two together:)
Maybe I need to change something here:Β
apiserver.conf
Not sure, I'm still waiting on answer... It...
Hi CooperativeFox72 ,
From the backend guys, long story short, upgrade your machine => more cpu cores , more processes , it is that easy π
The cool thing of using the trains-agent, you can change any experiment parameters and automate the process, so you get hyper-parameter optimization out of the box, and you can build complicated pipelines
https://github.com/allegroai/trains/tree/master/examples/optimization/hyper-parameter-optimization
https://github.com/allegroai/trains/blob/master/examples/automation/task_piping_example.py
CooperativeFox72 of course, anything trains related, this is the place π
Fire away
Thanks CharmingShrimp37 !
Could you PR the fix ?
It will be just in time for the 0.16 release π
in clearml.conf we could have:azure.storage { max_connections = 10 # containers: [ # { # account_name: "clearml" # account_key: "secret" # # container_name: # } # ] }
Then in AzureContainerConfigurations
:
` @classmethod
def from_config(cls, configuration):
...
class AzureContainerConfigurations(object):
def init(self, container_configs=None, max_connections=None):
...
Hi ShakyJellyfish91
It seems clearml is using a single connection, that takes a long time download
Hmm, I found this one:
https://github.com/allegroai/clearml/blob/1cb5dbb276026644ae20fef63d58256cdc887818/clearml/storage/helper.py#L1763
Does max_connections=10
mean 10 concurrent connections ?
as i also noticed that uploads are sometimes slow, and i see here max_connections=2
Makes sense to me, please go ahead and add that as well (basically the same thing on _AzureBlobServiceStorageDriver.upload_object
and an additional variable on the AzureContainerConfigurations
class.
Could you PR a tested draft ? we will be able to take from there
okay let's PR this fix ?
*Actually looking at the code, when you call Task.create(...) it will always store the diff from the remote server.
Could that be the issue?
To edit the Task's diff:task.update_task(dict(script=dict(diff='DIFF TEXT HERE')))
it seems it's following the path of the script i'm using to task.create, eg:
The folder it should run it is the script path you are passing (i.e. "script=ep_fn," )
Wrong path would imply that is it not finding the correct repository, is that the case ?
Β I want to schedule bulk tasks to run via agents, so I'm runningΒ
create
I see, that makes sense.
specially when dealing with submodules,
BTW: submodule diff should always get stored, can you provide some error logs on fail cases?
Before manually modifying the diff:
If you have local commits (i.e. un-pushed) this might fail the diff apply, in that case you can set the following in your clearml.confstore_code_diff_from_remote: true
https://github.com/allegroai/clear...
Thank you!
one thing i noticed is that it's not able to find the branch name on >=1.0.6x , while on 1.0.5 it can
That might be it! let me check the code again...
ShakyJellyfish91 what exactly are you passing to Task.create?
Could it be you are only passing script=
and leaving repo=
None ?
Can you see the repo itself ? the commit id ?
additionally, I found is that clearml==1.0.5 package is able to find these partial changes, newer versions find nothing at all, maybe it's because it's always comparing against remote
Hmm it was always from remote...
it is actually doing the following:git rev-parse --abbrev-ref --symbolic-full-name @{u}
Then with the branch name output,git diff --submodule=diff <add_branch_name_here>
Thanks ShakyJellyfish91 ! please let me know what you come up with, I would love for us to fix this issue.
Thanks ShakyJellyfish91 this really helps to narrow it down!
Let me see what I can find
Change to add_missing_installed_packages=False,
here, and see if you end up with git diff
https://github.com/allegroai/clearml/blob/1f82b0c4010799be6157f5c845c7f6ac48e71c0c/clearml/backend_interface/task/populate.py#L158
ShakyJellyfish91 can you check if version 1.0.6rc2
can find the changes ?
No worries, and I will make sure we output a warning if section names are not used π
Hi SucculentBeetle7
The parameters passed to add_step
need to contain the section name (maybe we should warn if it is not there, I'll see if we can add it).
So maybe something like:{'Args/param1', 1}
Or{'General/param1', 1}
Can you verify it solves the issue?
Thanks DefeatedOstrich93
Let me check if I can reproduce it.