Reputation
Badges 1
978 × Eureka!Interesting idea! (I assume for reporting only, not configuration)
Yes for reporting only - Also to understand which version is used by the agent to define the torch wheel downloaded
regrading the cuda check with
nvcc
, I'm not saying this is a perfect solution, I just mentioned that this is how this is currently done.
I'm actually not sure if there is an easy way to get it from nvidia-smi interface, worth checking though ...
Ok, but when nvcc
is not ava...
and with this setup I can use GPU without any problem, meaning that the wheel does contain the cuda runtime
yes what happens in the case of the installation with pip wheels files?
AgitatedDove14 According to the dependency order you shared, the original message of this thread isn't solved: the agent mentionned used output from nvcc (2) before checking the nvidia driver version (1)
thanks for clarifying! Maybe this could be clarified in the agent logs of the experiments with something like the following?agent.cuda_driver_version = ... agent.cuda_runtime_version = ...
and the agent says agent.cudnn_version = 0
But I can do:
` $ python
import torch
torch.cuda.is_available()
True
torch.backends.cudnn.version()
8005 `
because I cannot locate libcudart or because cudnn_version = 0?
Ok, this I cannot locate
From my experience, I only installed cuda drivers on my machines. I didn't used conda to install torch nor cudatoolkit, I just let clearml-agent download the torch wheel file and install it
AppetizingMouse58 After some thoughts, we decided to install from scratch 0.16, with no data migration, because we believe this was an edge case not worth spending efforts on. Thank you very much for your help there, very appreciated. You guys rock! ๐
sure, will be happy to debug that ๐
Thanks! Unfortunately still not working, here is the log file:
I should also rename /opt/trains/data/elastic_migrated_2020-08-11_15-27-05
folder to /opt/trains/data/elastic
before running the migration tool right?
CostlyOstrich36 , actually this only happens for a single agent. The weird thing is that I have a machine with two gpus, and I spawn two agents, one per gpus. Both have the same version. For one, I can see all the logs, but not for the other
no, at least not in clearml-server version 1.1.1-135 โข 1.1.1 โข 2.14
no, I think I could reproduce with multiple queues
SuccessfulKoala55 Thanks! If I understood correctly, setting index.number_of_shards = 2
(instead of 1) would create a second shard for the large index, splitting it into two shards? This https://stackoverflow.com/a/32256100 seems to say that itโs not possible to change this value after the index creation, is it true?
Hi CostlyOstrich36 , one more observation: it looks like when I donโt open the experiment in the webUI before it is finished, then I get all the logs correctly. It is when I open the experiment in the webUI while it is running that I donโt see all the logs.
So it looks like there is an effect of caching (the logs are retrieved only once, when I open the experiment for the first time), and not afterwards (or rarely). Is that possible?
That would be amazing!
Yes, it did spin two instances for the same task
btw SuccessfulKoala55 the parameter is not documented in https://allegro.ai/clearml/docs/docs/references/clearml_ref.html#sdk-development-worker
Hi AgitatedDove14 , thatโs super exciting news! ๐คฉ ๐
Regarding the two outstanding points:
In my case, Iโd maintain a client python package that takes care of the pre/post processing of each request, so that I only send the raw data to the inference service and I post process the raw output of the model returned by the inference service. But I understand why it might be desirable for the users to have these steps happening on the server. What is challenging in this context? Defining how t...
to pass secrets to each experiment