Reputation
Badges 1
981 × Eureka!But I would need to reindex everything right? Is that a expensive operation?
yes, because it wonβt install the local package which has this setup.py with the problem in its install_requires described in my previous message
Default would be venv, only use docker if an image is passed. Use case: not have to duplicate all queues to accept both docker and venv agents on the same instances
I also don't understand what you mean by unless the domain is different... The same way ssh keys are global, I would have expected the git creds to be used for any git operation
What is weird is:
Executing the task from an agent: task.get_parameters() returns an empty dict Calling task.get_parameters() from a local standalone script returns the correct properties, as shown in web UI, even if I updated them in UI.So I guess the problem comes from trains-agent?
Yes I agree, but I get a strange error when using dataloaders:RuntimeError: [enforce fail at context_gpu.cu:323] error == cudaSuccess. 3 vs 0. Error at: /pytorch/caffe2/core/context_gpu.cu:323: initialization error
only when I use num_workers > 0
And if you need a very small change, you can also simply https://www.geeksforgeeks.org/monkey-patching-in-python-dynamic-behavior/ it
Thanks for the hint, Iβll check the paid version, but Iβd like first to understand how much efforts it would be to fix the current situation by myself π
It worked like a charm π± Awesome thanks AgitatedDove14 !
I reindexed only the logs to a new index afterwards, I am now doing the same with the metrics since they cannot be displayed in the UI because of their wrong dynamic mappings
Hi TimelyPenguin76 , any chance this was fixed already? π
Hi TimelyPenguin76 , any chance this was fixed? π
Hi AgitatedDove14 , thanks for the answer! I will try adding 'multiprocessing_context='forkserver' to the DataLoader. In the issue you linked, nirraviv mentionned that forkserver was slower and shared a link to another issue https://github.com/pytorch/pytorch/issues/15849#issuecomment-573921048 where someone implemented a fast variant of the DataLoader to overcome the speed problem.
Did you experiment any drop of performances using forkserver? If yes, did you test the variant suggested i...
Ho yes, this could work as well, thanks AgitatedDove14 !
So if all artifacts are logged in the pipeline controller task, I need the last task to access all the artifacts from the pipeline task. I need to execute something like PipelineController.get_artifact() in the last step task
it worked for the other folder, so I assume yes --> I archived the /opt/trains/data/mongo, sent the archive via scp, unarchived, updated the rights and now it works
I think that somehow somewhere a reference to the figure is still living, so plt.close("all") and gc cannot free the figure and it ends up accumulating. I don't know where yet
` trains-elastic | {"type": "server", "timestamp": "2020-08-12T11:01:33,709Z", "level": "ERROR", "component": "o.e.b.ElasticsearchUncaughtExceptionHandler", "cluster.name": "trains", "node.name": "trains", "message": "uncaught exception in thread [main]",
trains-elastic | "stacktrace": ["org.elasticsearch.bootstrap.StartupException: ElasticsearchException[failed to bind service]; nested: AccessDeniedException[/usr/share/elasticsearch/data/nodes];",
trains-elastic | "at org.elasticsearc...
AgitatedDove14 WOW, thanks a lot! I will dig into that π
And after the update, the loss graph appears
Same, it also returns a ProxyDictPostWrite , which is not supported by OmegaConf.create
Yes, actually thats what I am doing, because I have a task C depending on tasks A and B. Since a Task cannot have two parents, I retrieve one task id (task A) as the parent id and the other one (ID of task B) as a hyper-parameter, as you described π
I can probably have a python script that checks if there are any tasks running/pending, and if not, run docker-compose down to stop the clearml-server, then use boto3 to trigger the creating of a snapshot of the EBS, then wait until it is finished, then restarts the clearml-server, wdyt?
Iβve reindexed the data for the logs, now the mappings are correct but I am missing one month of data, I have literally no idea where this data is/how it disappeared
Here are the logs of the agent :)
` (base) user@worker:~$ tail -f /tmp/.clearml_agent_daemon_outjdups8t2.txt
sdk.development.worker.log_stdout = true
sdk.development.worker.report_global_mem_used = false
+----------------------------------+--------+-------+
| id | name | tags |
+----------------------------------+--------+-------+
| 54e4a62a402d5135612ba7b12cfe4e57 | docker | |
+----------------------------------+--------+-------+
Starting infinite tas...
btw task._get_task_property('hyperparams') also gives me ValueError: Task has no hyperparams section defined
(I didn't have this problem so far because I was using ssh keys globaly, but I want know to switch to git auth using Personal Access Token for security reasons)
btw I see in the pytorch_distributed_example I see that you average_gradients , but from pytorch https://pytorch.org/tutorials/beginner/dist_overview.html it says:DDP takes care of gradient communication to keep model replicas synchronized and overlaps it with the gradient computations to speed up training.