Reputation
Badges 1
58 × Eureka!I was getting the error in step number 3
Yes, I tried to run steps 1,2,3,4 in order but got stuck at 3
I posted the https://stackoverflow.com/questions/64636294/trains-reusing-previous-task-id/64636297#64636297 on stackoverflow with the answer :)
I had to manually create a dump for the mongo data and import it into 4.4. I was just referring to adding a note to the documentation for other users.
Yes, I am using Pool. Here is what I think is happening. clearml launches a subprocess which I assume is a daemonic process. That process in-turn launches a subprocess for training which causes the error I mentioned
2. interesting error, maybe we can revert to "thread mode" if running under a daemon. (I have to admit, I'm not sure why python has this limitation, let me check it...)
Yes, I'm not sure either. I have banged my head against the wall in trying to have multiple level of subprocesses, but it gets too complicated with python. Let me know what you find out
(Do notice that even though you can spin two agents on the same GPU, the nvidia drivers cannot share allocated GPU memory, so if one Task consumes too much memory the other will not have enough free GPU memory to run)
Basically the same restriction as manually launching two processes using the same GPU
That makes sense. Currently, I use python multiprocessing to launch multiple experiments on the sam GPU device. I'm guessing using trains-agent
will be similar
I cannot execute step 4 because I can't get past step 3. Does that make sense?
The docker container in step 3 does not run because of the incompatibility
Hi AgitatedDove14 , I'll wait for you to reply on Github before I add my comments to these points.
Ok, So Git credentials are present at two locations - 1) outside the agent
config and 2) inside it. I updated credentials at both locations and now I'm seeing agent.git_user = <username>
in the dump, but I still have the same issue.
` # Set GIT user/pass credentials
leave blank for GIT SSH credentials ...
Hmm, ok. Yes that would make it easier.
From architectural point of view - say I know I'll be running the experiment on a trains-agent
, when I initialize and execute the experiment locally, how hard would it be to instead send all the execution details and env to the trains agent and run it directly there? Can the configuration be packaged when we initialize the Task? Does the question make sense?
That makes sense. The configuration file is located at ~/trains.conf
which I believe is the default location.
No I can't see my username printed out in the dump
Ok, I will look into artifacts. However, I will probably need high performance query functionality. For example, say I have a model and hundreds of thousands of inference records for that model. I want to be able to efficiently query that. My guess is that wouldn't be possible with artifacts. But that should be possible with Task.get_tasks
.
I come across many small questions like these which may been answered earlier. But they are hard to find in Slack messages. Is it better to post such questions on Stackoverflow so they benefit everybody? I might post the link here.
There was some complication during the upgrade so I had to resort to the manual process.
I have now been able to upgrade by dumping the mongodb data and restoring it independently.
SuccessfulKoala55 Yes, I am using the --docker flag.
You are right about the Keyring. Once I make sure credentials are stored in a secure way, it works as expected. Thanks :)
I'm using docker to run the experiment. Could it be that the config in the docker container doesn't have the git credentials?