It's the same but done from outside, you want the same and "offline" as well right?
Let me check something
CooperativeFox72 could you expand on "not working"?
If you have a yaml file, I would do:
` # local_path = './my_config.yaml'
path = task.connect_configuration(local_path, name=name)
if task.running_locally():
with open(local_path, "r") as config_file:
my_params_dict = yaml.load(config_file, Loader=yaml.FullLoader)
my_params_dict['change_me'] = 'new value'
my_params_text = yaml.dump(my_params_dict)
store back the change, my_params assumed to be the content of the param file (tex...
Very odd, I still can't reproduce. This is just the cleanup service running without anything else ?
What's the clearml version it is using ?
Nope - confirmed to be running on the OS's Python environment,
okay so bare metal root is definitely not recommended.
I'm not sure how/why it get's stuck though π
Any chance you can run the agent as non-root?
Also maybe preferred in docker mode, so it is easier for you to control the environment of the Task
Thanks DefeatedOstrich93
Let me check if I can reproduce it.
SubstantialElk6 (2) yes definitely will be fixed
Regrading (1), what do you mean by "via the code" ? Do you mean like as a Task docker cmd ?
@<1639799308809146368:profile|TritePigeon86> +1
task.wait_for_status() task.reload() task.artifacts["output"].get()
Well it should work out if the box as long as you have the full route, i.e. Section/param
and I run agent from local user and I would expect that settings to have effect -v /home/localuser/.ssh:/home/testuser/.ssh
It does not map it directly, it creates a temp copy in the host /tmp folder of the entire ".ssh" folder, than maps this folder inside the container:
https://github.com/allegroai/clearml-agent/blob/a5a797ec5e5e3e90b115213c0411a516cab60e83/clearml_agent/commands/worker.py#L3422
Notice that the "docker_internal_mounts" section is nested inside the "agent" section ...
Nice! So out of curiosity why didn't it work this time and you had to do it manually?
Hi @<1566596960691949568:profile|UpsetWalrus59>
All correct with the exception of " ...or 1GB Metric" this is a limit, since metrics (and meta data) is always stored on the clearml-server, so it is metered. There is also an API limit, basically anti abuse, which of course resets every month, but if you are running tens of experiments at the same time you will hit this limit. Make sense ?
why doesn't this happen on my other experiments?
same 100+ reports ?
(My new theory is that calling Task.reload() will fix it, and it might be called internally for the other experiments, like when reporting models/artifacts)
Could that be the case ?
Are there any services OOB like this?
On the open-source, I can't recall any but will probably be easy to write. Paid tier might have an offering though, not sure π
oh, if this is the case, why not use the "main" server?
Because it lives behind a VPN and github workers donβt have access to it
makes sense
If this is the case, I have to admit that combining offline-mode and remote execution makes sense, no?
SpotlessFish46
yes you can access the entire code in the incomitted changes, you can test it with:task = Task.get_task(task_id='aabb') task_dict = task.export_task()
2. correct, but then if you need the entire code base you need to clone the arepo and apply the uncommitted changes. Basically trains-agent does that when execute with buildtrains-agent build --id aabb --target ~/my_task_env
3. See (2)
... training script was set to upload every epoch. Seems like this resulted in a torrent of metrics being uploaded.
oh that makes sense, so basically you were bombarding the server with requests, and ending with kind of denial of service
This should have worked with the latest clearml RC.
And you verified it is not working?
So there is a hack for it:CLEARML_OFFLINE_MODE=1 python3 my_main.py
Which is the same as calling Task.set_offline
Then inside the code After the Task.init call:
` task = Task.init(...)
not sure what the if here is?!
Task.debug_simulate_remote_task(task_id="offline-1") `This will make things act as if this is running remotely , i.e. your logic Task.running_remotely() will be called.
Do notice that in remote mode, all the arguments / data is read from the clearml-server into the cod...
tf datasets is able to handle batch downloading quite well.
SubstantialElk6 I was not aware of that, I was under the impression tf dataset is accessed on a file level, no?
Yes, albeit not actually "intercept" as the user will be able to directly put Task sin queues B_machine_a/B_machine_b , but any time the user is pushing Tasks into queue B, this service will pull it and push to the individual machines queue.
what do you think?
ClearML maintains a github action that sets up a dummy clearml-server,
You have one, it's the http://app.clear.ml (not a dummy one, but for this purpose it will work)
thoughts ?
Ok no it only helps if as far as I don't log the figure.
you mean if you create the natplotlib figure and no automagic connect you still see the mem leak ?
I just cloned it from the examples that are available in the SaaS console upon account creation
Ohhh! that would explain it. Maybe it is broken there?! let me check a second
Correct, which makes sense if you have a stochastic process and you are looking for the best model snapshot. That said I guess the default use case would be min/max (and not the global variant)
is it displaying that it is running anything?
but it still not is able to run any task after I abort and rerun another task
When you "run" a task you are pushing it to a queue, so how come a queue is empty? what happens after you push your newly cloned task to the queue ?