So you want these two on two different graphs ?
PompousBeetle71 could you try trains-agent 0.15.0rc0 ? What's the OS you are using? Are you running in docker mode, if so, what's the docker version?
But functionality is working
Awesome , I will wait with the merge until tested internally .
There is a resale coming out after the weekend, once it is out I expect we will merge it.
sdk.conf will add it to the default loaded values (as I think you deduced).
can copy paste the sdk.conf here? (maybe something is missing there?)
Sadly no 😞
(I mean you could quickly write a reader for TB and report it, but it is not built into the SDK)
Okay let me check the code and comeback with followup questions
I'm glad you were able to solve the issue!
WackyRabbit7 I could not reproduce it, what did you pass in "GOOGLE_APPLICATION_CREDENTIALS" ?
Where would I put these credentials? I don't want to expose them in the logs as environmental variable or hard code them.
Hi GleamingGrasshopper63
So basically you need a vault, to store those credentials...
Unfortunately the open-source version does not contain vault support, but the paid tiers scale/enterprise do.
There you can have an environment variable defined in the vault, that each time the agent runs your code, it will pull it from the vault and set it on your process. wdyt ?
Just verifying the Pod does get allocated 2 gpus, correct ?
What do you have under the "script path" in the Task?
Should be under Profile -> Workspace (Configuration Vault)
PompousBeetle71 oh no 😞
okay this is a bit drastic, but let's see if it helps.
In your trains.conf, add the following section:loggers { loggers { trains { level: ERROR } } }
You can check the keras example, run it twice, on the second time it will continue from the previous checkpoint and you will have input and output model.
https://github.com/allegroai/clearml/blob/master/examples/frameworks/keras/keras_tensorboard.py
Hi GleamingGrasshopper63
How well can the ML Ops component handle job queuing on a multi-GPU server
This is fully supported 🙂
You can think of queues as a way to simplify resources for users (you can do more than that,but let's start simple)
Basicalli qou can create a queue per type of GPU, for example a list of queues could be: on_prem_1gpu, on_prem_2gpus, ..., ec2_t4, ec2_v100
Then when you spin the agents, per type of machine you attach the agent to the "correct" queue.
Int...
clearml - WARNING - Could not retrieve remote configuration named 'hyperparams'
What's the clearml-server version you are working with ?
In both logs I see (even in the single GPU log, it seems you "see" two GPUs, is that correct?)GPU 0,1 Tesla V100-SXM2-32GB (arch=7.0)
Last question, this is using relatively old clearml version (0.17.5), can you test with the latest version (1.1.1)?
PompousBeetle71 , These are cuda versions, I'm looking for the nvidia driver version for example 440.xx or 418.xx .
The reason is, we set an OS environment for the driver, and I remember that old drivers did not support it . Basically they do not support NVIDIA_VISIBLE_DEVICES=all , so I'm trying to see if that's the case, then we could add fix .
GreasyPenguin14
Is it possible in ClearML to have a main task (the complete cross validation) and subtasks (one for each fold)?
You mean to see it as nested in the UI? or Auto logged by the code ?
GreasyPenguin14 could you test with the matplotlib lib example ? (I cannot reproduce it and it seems like something to do with pycharm and matplotlib backend)
https://github.com/allegroai/clearml/blob/master/examples/frameworks/matplotlib/matplotlib_example.py
Oh that makes sense.
So now you can just get the models as dict as well (basically clearml allows you to access them both as a list, so it is easy to get the last created, and as dict so you can match the filenames)
This one will get the list of modelsprint(task.models["output"].keys())
Now you can just pick the best onemodel = task.models["output"]["epoch13-..."] my_model_file = model.get_local_copy()
Sure, run:clearml-agent init
It is a CLI wizard to configure the initial configuration file.
I prefer serving my models in-house and only performing the monitoring via ClearML.
clearml-serving
is an infrastructure for you to run models 🙂
to clarify, clearml-serving
is running on your end (meaning this is not SaaS where a 3rd party is running the model)
By the way, I saw there is a project dashboard app which might support the visualization I am looking for. Is it suitable for such use case?
Hmm interesting, actually it might, it does collect matrices over time ...
Thanks BattyLizard6 , fix is on its way 🙂
yes that makes send, I think what happened is one of the processes completed the Task (i.e. closed it) before the others did, and so they threw exception.
I switched to have all tasks in a separate process
I think that's probably the best (performance wise as well), nice!
@<1523710674990010368:profile|GreasyPenguin14> If I understand correctly you can use tokens as user/pass (it's basically the same interface from the git client perspective, meaning from ClearML
git_user = gitlab-ci-token
git_pass = <the_actual_toke>
WDYT?
Hi RipeGoose2
There is no need for any TrainsLogger in pytorch lightning as they switched to using the tensorboard logging by default, and everything they pass there we automagically catch.
What do you think is missing? or can be improved ?
Hi @<1724960475575226368:profile|GloriousKoala29>
Is there a way to aggregate the results, such as defining an iteration as the accuracy of 100 samples
Hmm, i'm assuming what you actually want is to store it with the actual input/output and a score, is that correct?
Hi @<1631102016807768064:profile|ZanySealion18>
ClearML (remote execution) sometimes doesn't "pick-up" GPU. After I rerun the task it picks it up.
what do you mean by "does not pick up"? is it the container is up but not executed with --gpus , so no GPU access?
I don't know whether you have access to the backend,
Creepy , no I do not 🙂
I can't make anything appear in the console part of the ui
clearml_task.logger.report_text("some text")
should work