Reputation
Badges 1
979 × Eureka!I am already trying with latest of pip 😞
Hey SuccessfulKoala55 , unfortunately this doesn’t work, because the dict contains others dicts, and only the first level dict becomes a dict, the inner dicts still are ProxyDictPostWrite
and will make OmegaConf.create fail
This is consistent: Each time I send a new task on the default queue, if trains-agent-1 has only one task running (the long one), it will pick another one. If I add one more experiment in the queue at that point (trains-agent-1 running two experiments at the same time), that experiment will stay in queue (trains-agent-2 and trains-agent-3 will not pick it because they also are running experiments)
line 13 is empty 🤔
No space, I will add and test 🙂
Hi CostlyOstrich36 , I am not using Hydra, only OmegaConf, so you mean just calling OmegaConf.load should be enough?
with open(path, "r") as stream: return yaml.load(stream, Loader=yaml.FullLoader)
Note: Could be related to https://github.com/allegroai/clearml/issues/790 , not sure
I have a custom way of reading the config file
it would be nice if Task.connect_configuration could support custom yaml file readers for me
And since I ran the task locally with python3.9, it used that version in the docker container
hoo thats cool! I could place torch==1.3.1 there
AgitatedDove14 Didn’t work 😞
mmmmh I just restarted the experiment and it seems to work now. I am not sure why that happened. From this SO it could be related to size of the repo. Might be a good idea to clone with --depth 1
in the agents?
Or more generally, try to catch this error and retry a few times?
Hi @<1523701087100473344:profile|SuccessfulKoala55> I was able to find the issue, I was creating a queue and worker subprocess that were not properly cleaned up
Unfortunately this is difficult to reproduce... Neverthless it would be important to me to be robust against it, because if this error happens in a task in the middle of my pipeline, the whole process fails.
This binds to another wider topic I think: How to "skip" tasks if they already run (a mechanism similar to what [ https://luigi.readthedocs.io/en/stable/ ] offers). That would allow to restart the pipeline and skip tasks until the point where the task failed
ClearML has a task.set_initial_iteration
, I used it as such:checkpoint = torch.load(checkpoint_fp, map_location="cuda:0") Checkpoint.load_objects(to_load=self.to_save, checkpoint=checkpoint) task.set_initial_iteration(engine.state.iteration)
But still the same issue, I am not sure whether I use it correctly and if it’s a bug or not, AgitatedDove14 ? (I am using clearml 1.0.4rc1, clearml-agent 1.0.0)
Hi SmugDolphin23 thanks for the input! Will try now but that seems hacky: to have it working I have to specify python3.8 two times:
one in the agent config file (agent.default_python is already python3.8, but seems to be ignored) + make sure it is available (using python:3.8 docker image)Is there a way to prevent this redundancy? Ie. If I want to change the python version, I can control it from a single place?
` Traceback (most recent call last):
File "devops/train.py", line 73, in <module>
train(parse_args)
File "devops/train.py", line 37, in train
train_task.get_logger().set_default_upload_destination(args.artifacts + '/clearml_debug_images/')
File "/home/machine/miniconda3/envs/py36/lib/python3.6/site-packages/clearml/logger.py", line 1038, in set_default_upload_destination
uri = storage.verify_upload(folder_uri=uri)
File "/home/machine/miniconda3/envs/py36/lib/python3.6/site...
Yes, I stayed with an older version for a compatibility reason I cannot remember now 😄 - just tested with 1.1.2 and it’s the same
I tried specifying the bucket directly in my clearml.conf, same problem. I guess clearml just reads from the env vars first
I hitted enter too fast ^^
Installing them globally via$ pip install numpy opencv torch
will install locally with warning:Defaulting to user installation because normal site-packages is not writeable
, therefore the installation will take place in ~/.local/lib/python3.6/site-packages
, instead of the default one. Will this still be considered as global site-packages
and still be included in experiments envs? From what I tested it does
Yes that’s correct - the weird thing is that the error shows the right detected region
Trying now your code… should take a couple of mins
SuccessfulKoala55 I want to avoid writing creds in plain text in the config file
But clearml does read from env vars as well right? It’s not just delegating resolution to the aws cli, so it should be possible to specify the region to use for the logger, right?