
Reputation
Badges 1
981 × Eureka!yea I just realized that you would also need to specify different subnets, etcโฆ not sure how easy it is ๐ But it would be very valuable, on-demand GPU instances are so hard to spin up nowadays in aws ๐
That would be awesome ๐
Sure, just sent you a screenshot in PM
After some investigation, I think it could come from the way you catch error when checking the creds in trains.conf: When I passed the aws creds using env vars, another error poped up: https://github.com/boto/botocore/issues/2187 , linked to boto3
I now have a different question: when installing torch from wheels files, I am guaranteed to have the corresponding cuda library and cudnn together right?
with open(path, "r") as stream: return yaml.load(stream, Loader=yaml.FullLoader)
AppetizingMouse58 Yes and yes
Well no luck - using matplotlib.use('agg')
in my training codebase doesn't solve the mem leak
But I would need to reindex everything right? Is that a expensive operation?
But that was too complicated, I found an easier approach
Interesting! Something like that would be cool yes! I just realized that custom plugins in Mattermost are written in Go, could be a good hackday for me ๐ to learn go
AppetizingMouse58 the events_plot.json template misses the plot_len
declaration, could you please give me the definition of this field? (reindexing with dynamic: strict
fails with: "mapping set to strict, dynamic introduction of [plot_len] within [_doc] is not allowed
)
Actually was not related to clearml, the higher level error causing this one was (somewhere in the stack trace): RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd
-> wrong numpy version
AgitatedDove14 , my โuncommitted changesโ ends with... if __name__ == "__main__": task = clearml.Task.get_task(clearml.config.get_remote_task_id()) task.connect(config) run() from clearml import Task Task.init()
Alright SuccessfulKoala55 I was able to make it work by downgrading clearml-agent to 0.17.2
clearml doesn't change the matplotlib backend under the hood, right? Just making sure ๐
Yes, but I am not certain how: I just deleted the /data folder and restarted the server
Hi AgitatedDove14 , thanks for the answer! I will try adding 'multiprocessing_context='forkserver' to the DataLoader. In the issue you linked, nirraviv mentionned that forkserver was slower and shared a link to another issue https://github.com/pytorch/pytorch/issues/15849#issuecomment-573921048 where someone implemented a fast variant of the DataLoader to overcome the speed problem.
Did you experiment any drop of performances using forkserver? If yes, did you test the variant suggested i...
When installed with http://get.docker.com , it works
same as the first one described
From https://discuss.pytorch.org/t/please-help-me-understand-installation-for-cuda-on-linux/14217/4 it looks like my assumption is correct: there is no need for cudatoolkit to be installed since wheels already contain all cuda/cudnn libraries required by torch
Ok, I am asking because I often see the autoscaler starting more instances than the number of experiments in the queues, so I guess I just need to increase the max_spin_up_time_min
So I need to have this merging of small configuration files to build the bigger one
Any chance this is reproducible ?
Unfortunately not at the moment, I could find a reproducible scenario. If I clone a task that was stuck and start it, it might not be stuck
How many processes do you see running (i.e. ps -Af | grep python) ?
I will check that when the next one will be blocked ๐
What is the training framework? is it multiprocess ? how are you launching the process itself? is it Linux OS? is it running inside a specific container ?
I train with p...
The task requires this service, so the task starts it on the machine - Then I want to make sure the service is closed by the task upon completion/failure/abortion
(by console you mean in the dashboard right? or the terminal?)