Alright, experiment finished properly (all models uploaded). I will restart it to check again, but seems like the bug was introduced after that
but the post_packages does not reinstalls the version 1.7.1
And so in the UI, in workers&queues tab, I see randomly one of the two experiments for the worker that is running both experiments
yes, exactly: I run python my_script.py
, the script executes, creates the task, calls task.remote_execute(exit_process=True)
and returns to bash. Then, in the bash console, after some time, I see some messages being logged from clearml
There is a pinned github thread on https://github.com/allegroai/clearml/issues/81 , seems to be the right place?
AgitatedDove14 one last question: how can I enforce a specific wheel to be installed?
Hi TimelyPenguin76 , any chance this was fixed? 🙂
The rest of the configuration is set with env variables
that would work for pytorch and clearml yes, but what about my local package?
Hi CostlyOstrich36 , I am not using Hydra, only OmegaConf, so you mean just calling OmegaConf.load should be enough?
Mmmh unfortunately not easily… I will try to debug deeper today, is there a way to resume a task from code to debug locally?
Something like replacing Task.init
with Task.get_task
so that Task.current_task
is the same task as the output of Task.get_task
Is there any logic on the server side that could change the iteration number?
btw I monkey patched ignite’s function global_step_from_engine
to print the iteration and passed the modified function to the ClearMLLogger.attach_output_handler(…, global_step_transform=patched_global_step_from_engine(engine))
. It prints the correct iteration number when calling ClearMLLogger.OutputHandler.__ call__ .
` def call(self, engine: Engine, logger: ClearMLLogger, event_name: Union[str, Events]) -> None:
if not isinstance(logger, ClearMLLogger):
...
in my clearml.conf, I only have:sdk.aws.s3.region = eu-central-1 sdk.aws.s3.use_credentials_chain = true agent.package_manager.pip_version = "==20.2.3"
AgitatedDove14 Should I create an issue for this to keep track of it?
I carry this code from older versions of trains to be honest, I don't remember precisely why I did that
Why is it required in the case where boto3 can figure them out itself within the ec2 instance?
Alright SuccessfulKoala55 I was able to make it work by downgrading clearml-agent to 0.17.2
mmmmh I just restarted the experiment and it seems to work now. I am not sure why that happened. From this SO it could be related to size of the repo. Might be a good idea to clone with --depth 1
in the agents?
Or more generally, try to catch this error and retry a few times?
@<1523701205467926528:profile|AgitatedDove14> I see other rc in pypi but no corresponding tags in the clearml-agent repo? are these releases legit?
What is latest rc of clearml-agent? 1.5.2rc0?
Ho nice, thanks for pointing this out!
SuccessfulKoala55 I was able to make it work with use_credentials_chain: true
in the clearml.conf and the following patch: https://github.com/allegroai/clearml/pull/478
Sure! Here are the relevant parts:
` ...
Current configuration (clearml_agent v1.2.3, location: /tmp/.clearml_agent.3m6hdm1_.cfg):
...
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version = ==20.2.3
agent.package_manager.system_site_packages = false
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = pytorch
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 ...
As to why: This is part of the piping that I described in a previous message: Task B requires an artifact from task A, so I pass the name of the artifact as a parameter of task B, so that B knows what artifact from A it should retrieve