Reputation
Badges 1
25 × Eureka!The main reason to add the timeout is because the warning was annoying to users 🙂
The secondary was that clearml will start reporting based on seconds from start, then when iterations start it will revert back to iterations. But if the iterations are "epochs" the numbers are lower so you end up with a graph that does not match the expected "iterations" x-axis. Make sense ?
This will set more time before the timeout right?
Correct.
task.freeze_monitor()
download()
task.defrost_monitor()
Currently there isn't, but that's a good ides.
What would be the argument of using it vs increasing the timeout ?
btw: setting the resource timeout to 99999 will basically mean that it will wait until the first reported iteration, Not that it will just sleep for 99999sec 🙂
Yes it is reproducible do you want a snippet?
Already fixed 🙂 please ping tomorrow, I think an RC should be out soon with the fix
😞 CooperativeFox72 please see if you can send a code snippet to reproduce the issue. I'd be happy to solve the it ...
Hi CooperativeFox72
But my docker image has all my code and all the packages it needed I don't understand why the agent need to install all of those again? (edited)
So based on the docker file you previously posted, I think all your python packages are actually installed on the "appuser" and not as system packages.
Basically remove the "add user" part and the --user from the pip install.
For example:
` FROM nvidia/cuda:10.1-cudnn7-devel
ENV DEBIAN_FRONTEND noninteractive
RUN ...
Maybe we should rename it?! it actually creates a Task but will not auto connect it...
CooperativeFox72
Could you try to run the docker and then inside the docker try to do:su root whoami
Okay we have something 🙂
To your clearml.conf add:agent.docker_preprocess_bash_script = [ "su root", "cp -f /root/*.conf ~/", ]Let's see if that works
I am creating this user
Please explain, I think this is the culprit ...
but I am think they done it for a reason no?
Not a very good one, they just installed everything under the user and used --user for the pip.
It really does not matter inside a docker, the only reason one might want to do that is if you are mounting other drives and you want to make sure they are not accessed with "root" user, but with 1000 user id.
Yes this is definitely the issue, the agent assume the docker user is "root".
Let me check something
CooperativeFox72 could you expand on "not working"?
If you have a yaml file, I would do:
` # local_path = './my_config.yaml'
path = task.connect_configuration(local_path, name=name)
if task.running_locally():
with open(local_path, "r") as config_file:
my_params_dict = yaml.load(config_file, Loader=yaml.FullLoader)
my_params_dict['change_me'] = 'new value'
my_params_text = yaml.dump(my_params_dict)
store back the change, my_params assumed to be the content of the param file (tex...
This one should work:
` path = task.connect_configuration(path, name=name)
if task.running_locally():
my_params = read_from_path(path)
my_params = change_parmas(my_params) # change some staff
store back the change, my_params assumed to be the content of the param file (text)
task.set_configuration_object(name=name, config_taxt=my_params) `
Hi LudicrousDeer3
It should not be a problem see iteration argument in Logger.report_scalar
https://github.com/allegroai/clearml/blob/22d795f68f0175ba9511cabd444ea4dba464f3cd/examples/reporting/scalar_reporting.py#L19
https://allegro.ai/clearml/docs/rst/references/clearml_python_ref/logger_module/logger_logger.html?highlight=report_scalar#clearml.logger.Logger.report_scalar
LudicrousDeer3 when using Logger you can provide 'iteration' argument, is this what you are looking for?
Hi RobustFlamingo1
The ClearML Orchestrator looks interesting. But the website suggests that K8S is required
No k8s is not a must, only an option 🙂
We have a Linux training box (LambdaBox) where we want to run training. Can we place the ClearML orchestrator agent on the machine without needing K8S?
Yes should be quite easy.
If you intent to use containers, make sure you have docker installed.
Then just pip install clearml-agent and configure it:
https://clear.ml/doc...
Hi CooperativeFox72 trains 0.16 is out, did it solve this issue? (btw: you can upgrade trains to 0.16 without upgrading the trains-server)
CooperativeFox72 you can you start by checking the latest RC :)pip install trains==0.15.2rc0
Thanks CooperativeFox72 ! I'll test and keep you posted 🙂
CooperativeFox72 we are aware of Pool throwing exception that causes things to hang. Fix will be deployed in 0.16 (due to be released tomorrow).
Do you have a code to reproduce it, so I can verify the fix solves the issue?
Hi CooperativeFox72
Sure 🙂task.set_resource_monitor_iteration_timeout(seconds_from_start=1800)
CooperativeFox72 this is indeed sad news 😞
When you have the time, please see if you can send a code snippet to reproduce the issue. I'd like to have it fixed
The issue itself is changing the default user.
USER appuser
WORKDIR /home/appuser
Any reason for it ?
GiganticTurtle0 notice that when you spin an agent with --services-mode, you basically let it run many Tasks at once (this is in contrast to the default behavior, when you have one Task per agent).