
Reputation
Badges 1
49 × Eureka!AgitatedDove14
These were the loggers names I can see locally running the code, it might differ running remotely.
['trains.utilities.pyhocon.config_parser', 'trains.utilities.pyhocon', 'trains.utilities', 'trains', 'trains.config', 'trains.storage', 'trains.metrics', 'trains.Repository Detection']
regarding repreduce it, have a long data processing after initializing the task and before setting the input model/output model.
TimelyPenguin76 yes, both 0.15.1
the trains version is still 0.14 it will take time to switch it
I thought to change to connected ditionary though.
AgitatedDove14 I'm using both argpraser and sys.argv to start different processes that each of them will interact with a single GPU. So each process have a specific argument with a different value to differentiate between them. (only the main interact with trains). At the moment I encounter issues with getting the arguments from the processes I spawn. I'm explicitly calling python my_script.py --args...
and each process knows to interact with the other. It's a bit complicated to explai...
I created a wrapper to work like executing python -m torch.distributed.launch --nproc_per_node 2 ./my_script.py
but from my script. I do call trains.init
in the subprocesses, I the actually difference between the subproceses supposed to be, in terms or arguments, local_rank
that's all.It may be possible and that I'm not distributing the model between the GPUs in an optimal way or at least in a way that matches your framework.
If you have an example it would be great.
AgitatedDove14 Hi, So I solve that by passing to the created processes the arguments injected into the argprase as part of the commandline. The examples helped.
AgitatedDove14 Well, after starting a new project it works. I guess it's a bug.
AgitatedDove14 Yes, I can. I didn't delete the previous project yet.
SteadyFox10 ModelCheckpoint is not for pytorch I think, couldn't find anything like it.
TimelyPenguin76 I see it in the web-app under the model.
AgitatedDove14 yes, you're right. it was 10.2 or 10.1 if I recall.
AgitatedDove14 I'm using that code in the meanwhile
` ### This script checks the number of GPUs, create a list like 0,1,2...
Then adds '--gpus' before that list of GPUs
NUM_GPUS=nvidia-smi -L | wc -l
NUM_GPUS=$(($NUM_GPUS-1))
OUT=()
if [ $NUM_GPUS -ge 0 ]
then
for i in $(seq 0 $NUM_GPUS); do OUT+=( "$i" ); done
echo ${OUT[*]// /|} | tr ' ' ',' | awk '{print "--gpus "$1}'
else
echo ""
fi `
AgitatedDove14 no, there's no reason in my case to pass an empty string. that's why I removed the type=str
part.
the version of the agent (the worker that received the job was 0.14.1)
the one that created the template was 0.14.2
AgitatedDove14 Good to know! 🙂
I think it's good the way you described it (the second option).
let's call it an applicative project which has experiments and an abstract/parent project, or some other name that group applicative projects.
AgitatedDove14 Thanks Martin, I know that. I just say it's a bug.
AgitatedDove14 v0.14
AgitatedDove14 thanks, I'll check it out.
I use torch and yes, I use save so your code will catch it.
AgitatedDove14 You were right. I can get them as system tags.
I've wrote a class that wraps an training session and interaction with trains as upon loading/saving the experiment I need more than just the 'model.bin'
So I use these tags to match a specific aux files that were saved with their model.
yes, it was.
yes, there's a use for empty strings, for example in text generation you may generate the next word given some prefix, the prefix may be an empty string.
TimelyPenguin76 the tags names are 'Epoch 1', 'Step 5705'
the return value of the InputModel(<Put a string copy from the UI with the tag id>).tags
is an empty array.
SteadyFox10 AgitatedDove14 Thanks, I really did change the name.