Reputation
Badges 1
981 × Eureka!even if I explicitely use previous_task.output_uri = " s3://my_bucket " , it is ignored and still saves the json file locally
Otherwise I can try loading the file with custom loader, save as temp file, pass the temp file to connect_configuration, it will return me another temp file with overwritten config, and then pass this new file to OmegaConf
I am actually calling later in the start_training function the following:with idist.Parallel(backend="nccl") as parallel: parallel.run(training_func)So my backend should be nccl and not gloo , right? Not sure how important it is, I read in the https://pytorch.org/docs/stable/distributed.html#which-backend-to-use that nccl should be used for distributed GPU training and gloo for distributed CPU training
I will try with clearml==1.1.5rc2
In the comparison the problem will be the same, right? If I choose last/min/max values, it wonβt tell me the corresponding values for others metrics. I could switch to graphs, group by metric and look manually for the corresponding values, but that becomes quickly cumbersome as the number of experiments compared grow
As a quick fix, can you test with auto refresh (see top right button with the pause sign you have on the video)
That doesnβt work unfortunately
But I would need to reindex everything right? Is that a expensive operation?
yes, because it wonβt install the local package which has this setup.py with the problem in its install_requires described in my previous message
Default would be venv, only use docker if an image is passed. Use case: not have to duplicate all queues to accept both docker and venv agents on the same instances
I also don't understand what you mean by unless the domain is different... The same way ssh keys are global, I would have expected the git creds to be used for any git operation
What is weird is:
Executing the task from an agent: task.get_parameters() returns an empty dict Calling task.get_parameters() from a local standalone script returns the correct properties, as shown in web UI, even if I updated them in UI.So I guess the problem comes from trains-agent?
Nice, thanks!
Yes I agree, but I get a strange error when using dataloaders:RuntimeError: [enforce fail at context_gpu.cu:323] error == cudaSuccess. 3 vs 0. Error at: /pytorch/caffe2/core/context_gpu.cu:323: initialization error
only when I use num_workers > 0
And if you need a very small change, you can also simply https://www.geeksforgeeks.org/monkey-patching-in-python-dynamic-behavior/ it
Thanks for the hint, Iβll check the paid version, but Iβd like first to understand how much efforts it would be to fix the current situation by myself π
It worked like a charm π± Awesome thanks AgitatedDove14 !
I reindexed only the logs to a new index afterwards, I am now doing the same with the metrics since they cannot be displayed in the UI because of their wrong dynamic mappings
Hi TimelyPenguin76 , any chance this was fixed already? π
Hi TimelyPenguin76 , any chance this was fixed? π
Hi AgitatedDove14 , thanks for the answer! I will try adding 'multiprocessing_context='forkserver' to the DataLoader. In the issue you linked, nirraviv mentionned that forkserver was slower and shared a link to another issue https://github.com/pytorch/pytorch/issues/15849#issuecomment-573921048 where someone implemented a fast variant of the DataLoader to overcome the speed problem.
Did you experiment any drop of performances using forkserver? If yes, did you test the variant suggested i...
Ho yes, this could work as well, thanks AgitatedDove14 !
Super! Iβll give it a try and keep you updated here, thanks a lot for your efforts π
So if all artifacts are logged in the pipeline controller task, I need the last task to access all the artifacts from the pipeline task. I need to execute something like PipelineController.get_artifact() in the last step task
it worked for the other folder, so I assume yes --> I archived the /opt/trains/data/mongo, sent the archive via scp, unarchived, updated the rights and now it works
I think that somehow somewhere a reference to the figure is still living, so plt.close("all") and gc cannot free the figure and it ends up accumulating. I don't know where yet
` trains-elastic | {"type": "server", "timestamp": "2020-08-12T11:01:33,709Z", "level": "ERROR", "component": "o.e.b.ElasticsearchUncaughtExceptionHandler", "cluster.name": "trains", "node.name": "trains", "message": "uncaught exception in thread [main]",
trains-elastic | "stacktrace": ["org.elasticsearch.bootstrap.StartupException: ElasticsearchException[failed to bind service]; nested: AccessDeniedException[/usr/share/elasticsearch/data/nodes];",
trains-elastic | "at org.elasticsearc...
AgitatedDove14 WOW, thanks a lot! I will dig into that π
And after the update, the loss graph appears
Same, it also returns a ProxyDictPostWrite , which is not supported by OmegaConf.create
Yes, actually thats what I am doing, because I have a task C depending on tasks A and B. Since a Task cannot have two parents, I retrieve one task id (task A) as the parent id and the other one (ID of task B) as a hyper-parameter, as you described π