Reputation
Badges 1
25 × Eureka!task._wait_for_repo_detection()
You can use the above, to wait until repository & packages are detected
(If this is something users need, we should probably make it a "public function" )
and this
server_info['url'] = f"http://{server_info['hostname']}:{server_info['port']}/{server_info['base_url']}/"
agree, but setting the agentβs env variable TMPDIR
I think this needs to be passed to the docker with -e TMPDIR=/new/tmp
as additional container args:
see example
None
wdyt?
Hi @<1695969549783928832:profile|ObedientTurkey46>
Use --services-mode in the agent , it will run many Tasks on the same machine, this is usually associated with the services queue, but can be run on any queue. This way you could have the same machine easily running those multiple "control" tasks.
wdyt?
CheerfulGorilla72
upd: I see NAN in the tensorboard, and 0 in Clearml.
I have to admit, since NaN's are actually skipped in the graph, should we actually log them ?
Hmm, so this is kind of a hack for ClearML AWS autoscaling ?
and every instance is running an agent? or a single Task?
I want a blacklist of things I DONT want to report
π only whitelisting is currently supported
Should not be very complicated to add:
https://github.com/allegroai/clearml/blob/b24ed1937cf8a685f929aef5ac0625449d29cb69/clearml/task.py#L4096
Maybe everything that starts with exclamation is exclusion, like "!*.bin"
will only skip *.bin files
I basically moved the Task.init() call below the imports
Okay that is odd, can you copy pate the before/after of the import, so we can fix that?!
however if I want multiple machines syncing with the optimizer, for pulling the sampled hyper parameters and reporting results, I can't see how it would work
I have to admit, this is where I'm loosing you.
I thought you wanted to avoid the agent, since you wanted to run everything locally, wasn't that the issue ?
Maybe there is some background missing here, let me see if I can explain how the optimizer works.
In your actual training code you have something like:` params = {'lr': 0.3, ...
WackyRabbit7
we did execute locally
Sure, instead of pipe.start()
use pipe.start_locally(run_pipeline_steps_locally=False)
, this is it π
I can install clearml and clearml-agemt and run the worker inside a docker
oh I see, you should install it inside a docker, then mount the docker socket so it can spin sibling containers , ans lastly make sure the mounts are correct with this env variable:
None
Let me check... I think you might need to docker exec
Anyhow, I would start by upgrading the server itself.
Sounds good?
model_path/run_2022_07_20T22_11_15.209_0.zip , err: [Errno 28] No space left on device
Where was it running?
I take it that these files are also brought into pipeline tasks's local disk?
Unless you changed the object, then no, they should not be downloaded (the "link" is passed)
SmarmyDolphin68 okay what's happening is the process exists before the actual data is being sent (report_matplotlib_figure is an async call, and data is sent in the background)
Basically you should just wait for all the events to be flushedtask.flush(wait_for_uploads=True)
That said, quickly testing it it seems it does not wait properly (again I think this is due to the fact we do not have a main Task here, I'll continue debugging)
In the meantime you can just dosleep(3.0)
And it wil...
Hi @<1720249421582569472:profile|NonchalantSeaanemone34>
pipeline decorator where lambda function call another function(say
xyz
) and during pipeline execution, error is thrown that
xyz
is not defined?
Each pipeline function becomes a standalone "script", which I assume if the lambda function is defined outside of the decorated pipeline component function, would throw an undefined error.
My suggestion would be to define the lambda function as a nes...
Nice ! π
btw: clone=True
means creating a copy of the running Task, but basically there is no need for that , with clone=False, it will stop the running process, and launch it on the remote host, logging everything on the original Task.
LOL π
Make sure that when you train the model or create it manually you set the default "output_uri"
task = Task.init(..., output_uri=True)
or
task = Task.init(..., output_uri="s3://...")
yup, it's there in draft mode so I can get the latest git commit when it's used as a base task
Yes that seems to be the problem, if it is in draft mode, you have no outputs...
Gitlab has support for S3 based cache btw.
This might still be considered "slow" compared to local-dist/cluster mount
Would adding support for some sort of post task script help? Is something already there?
Interesting, can you expand on the use case? (currently there is only pre-task script, for setup)
No. since you are using Pool. there is no need to call task init again. Just call it once before you create the Pool, then when you want to use it, just do task = Task.current_task()
. Ive seen parameters connect and task create in
seconds
and other times it takes 4 minutes.
This might be your backend (cleamrl-server) replying slowly becuase of load?
Is there a way (at the class level) to control the retry logic on connecting to the API server?
The difference in the two screenshots is literally only the URLs in
clearml.conf
and it went from 30s down to 2-3s.
Yes that could be network, also notice that there is aut...
ClumsyElephant70
Could it be virtualenv package is not installed on the host machine ?
(From the log it seems you are running in venv mode, is that correct?)
@<1523710674990010368:profile|GreasyPenguin14> make sure it to uses https not ssh:
edit ~/clearml.conf
force_git_ssh_protocol: false
and that you have both git_user & git_pass set in your clearml.conf
CooperativeFox72 could you expand on "not working"?
If you have a yaml file, I would do:
` # local_path = './my_config.yaml'
path = task.connect_configuration(local_path, name=name)
if task.running_locally():
with open(local_path, "r") as config_file:
my_params_dict = yaml.load(config_file, Loader=yaml.FullLoader)
my_params_dict['change_me'] = 'new value'
my_params_text = yaml.dump(my_params_dict)
store back the change, my_params assumed to be the content of the param file (tex...
RoundMosquito25 do notice the agent is pulling the code from the remote repo, so you do need to push the local commits, but the uncommitted changes clearml will do for you. Make sense?