Reputation
Badges 1
981 × Eureka!Hi SuccessfulKoala55 , Yes itβs for the same host/bucket - Iβll try with a different browser
what about the stacktrace of the error:Error: Can not start new instance, An error occurred (InvalidParameterValue) when calling the RunInstances operation: Invalid availability zone: [eu-west-2]?
Very nice! Maybe we could have this option as a toggle setting in the user profile page, so that by default we keep the current behaviour, and users like me can change it π wdyt?
because I cannot locate libcudart or because cudnn_version = 0?
UnevenDolphin73 , task = clearml.Task.get_task(clearml.config.get_remote_task_id()) worked, thanks
AgitatedDove14 In my case I'd rather have it under the "Artifacts" tab because it is a big json file
(by console you mean in the dashboard right? or the terminal?)
To be fully transparent, I did a manual reindexing of the whole ES DB one year ago after it run out of space, at that point I might have changed the mapping to strict, but I am not sure. Could you please confirm that the mapping is correct?
Now I am trying to restart the cluster with docker-compose and specifying the last volume, how can I do that?
SuccessfulKoala55 Thanks! If I understood correctly, setting index.number_of_shards = 2 (instead of 1) would create a second shard for the large index, splitting it into two shards? This https://stackoverflow.com/a/32256100 seems to say that itβs not possible to change this value after the index creation, is it true?
I am still confused though - from the get started page of pytorch website, when choosing "conda", the generated installation command includes cudatoolkit, while when choosing "pip" it only uses a wheel file.
Does that mean the wheel file contains cudatoolkit (cuda runtime)?
So it looks like it tries to register a batch of 500 documents
I fixed, will push a fix in pytorch-ignite π
In execution tab, I see old commit, in logs, I see an empty branch and the old commit
very cool, good to know, thanks SuccessfulKoala55 π
Thanks SuccessfulKoala55 ! So CLEARML_NO_DEFAULT_SERVER=1 by default, right?
Hi there, yes I was able to make it work with some glue code:
Save your model, optimizer, scheduler every epoch Have a separate thread that periodically pulls the instance metadata and check if the instance is marked for stop, in this case, add a custom tag eg. TO_RESUME Have a services that periodically pulls failed experiments from the queue with the tag TO_RESUME, force marking them as stopped instead of failed and reschedule them with as extra-param the last checkpoint
What I mean is that I don't need to have cudatoolkit installed in the current conda env, right?
AppetizingMouse58 Yes and yes
I am running on bare metal, and cuda seems to be installed at /usr/lib/x86_64-linux-gnu/libcuda.so.460.39
Something was triggered, you can see the CPU usage starting right when the instance became unresponsive - maybe a merge operation from ES?
Very cool! Run two train-agent daemons, one per GPU on the same machine, with default Nvidia/CUDA Docker This is close to my use case, I just would like to run these two daemons not with docker, would that be possible? I should just remove the --docker nvidia/cuda param right?
yes, the new project is the one where I changed the layout and that gets reset when I move an experiment there