Reputation
Badges 1
58 × Eureka!Yes, I tried to run steps 1,2,3,4 in order but got stuck at 3
SuccessfulKoala55 Yes, I am using the --docker flag.
You are right about the Keyring. Once I make sure credentials are stored in a secure way, it works as expected. Thanks :)
The docker container in step 3 does not run because of the incompatibility
That makes sense. The configuration file is located at ~/trains.conf which I believe is the default location.
No I can't see my username printed out in the dump
Yes, I am using Pool. Here is what I think is happening. clearml launches a subprocess which I assume is a daemonic process. That process in-turn launches a subprocess for training which causes the error I mentioned
Hi AgitatedDove14 Thanks for checking. I would like to compare several experiments (plots, hyperparams, etc), so it would have to been nice to do it in the UI. I have to search through the long list right now. With python, I can only do few of the things that I intend to do. Is this something that might be added in the future?
What would be the query ? Are you reporting 100+ diff scalars ?
At the moment I am not reporting any scalars related to inference. I'm only reporting data related to training a model. But I would like to report records that result from an inference process. For example the record would contain key_1, key_2, datetime, pred_1, pred_2 ... pred_n. I would have about 20 scalars if each of these fields is reported as a scalar.
The query can be a simple filtering criteria matching some keys ...
Also it might be better (although not necessary) to have a separate collection for storing inference results for better organization.
fatal: could not read Username for ' ': terminal prompts disabled error: Could not fetch originWhy is trains-agent trying read from terminal prompt instead of trains.conf ?
Ok, I will look into artifacts. However, I will probably need high performance query functionality. For example, say I have a model and hundreds of thousands of inference records for that model. I want to be able to efficiently query that. My guess is that wouldn't be possible with artifacts. But that should be possible with Task.get_tasks .
Hi AppetizingMouse58
Yes, I tried to perform steps 3-10, however step 3 raised an error because data files for mongo were incompatible between 3.6 and >4.0
I come across many small questions like these which may been answered earlier. But they are hard to find in Slack messages. Is it better to post such questions on Stackoverflow so they benefit everybody? I might post the link here.
I was getting the error in step number 3
Hi AgitatedDove14 Thanks, I'll check these out.
What is the exact use case you have in mind?
I want to store some additional data that is not relevant to training a model. For example, store inference results, explanations, etc and then use them in a different process. I currently use separate database for this.
Btw, I had been busy with another project and hadn't logged in here for some time. I see that you guys have made a lot of progress in the last two months! I'm excited to di...
SuccessfulKoala55
For security reasons I don't want to have my password written out in a file. I'm trying to use https://docs.github.com/en/free-pro-team@latest/github/authenticating-to-github/creating-a-personal-access-token (PAT) from Github but I get authentication error. Is there an issue using PAT?
The second subprocess is by design. It becomes the primary process when clearml does not use multiprocessing. I hope I'm not confusing you further
(Do notice that even though you can spin two agents on the same GPU, the nvidia drivers cannot share allocated GPU memory, so if one Task consumes too much memory the other will not have enough free GPU memory to run)
Basically the same restriction as manually launching two processes using the same GPU
That makes sense. Currently, I use python multiprocessing to launch multiple experiments on the sam GPU device. I'm guessing using trains-agent will be similar
Steps 1 and 2 on this https://clear.ml/docs/latest/docs/deploying_clearml/clearml_server_mongo44_migration/ say to backup opt/clearml/data/mongo and uncompress into /opt/clearml/data/mongo_4 . Isn't it just copying the old data files?
You will need to habe multipleĀ
trains-agent
sĀ but they will be sharing the same queue (i.e. pulling jobs from the same queue the HPO process is pushing to)
Make sense ?
Hmm. So say I have a parameter NUM_PARALLEL_EXECUTIONS , I can programmatically launch that many trains-agent for every optimization run?!