Reputation
Badges 1
49 × Eureka!AgitatedDove14 I'm using that code in the meanwhile
` ### This script checks the number of GPUs, create a list like 0,1,2...
Then adds '--gpus' before that list of GPUs
NUM_GPUS=nvidia-smi -L | wc -l
NUM_GPUS=$(($NUM_GPUS-1))
OUT=()
if [ $NUM_GPUS -ge 0 ]
then
for i in $(seq 0 $NUM_GPUS); do OUT+=( "$i" ); done
echo ${OUT[*]// /|} | tr ' ' ',' | awk '{print "--gpus "$1}'
else
echo ""
fi `
AgitatedDove14 yes, you're right. it was 10.2 or 10.1 if I recall.
AgitatedDove14 v0.14
the solution that worked: [logging.getLogger(name).setLevel(logging.ERROR) for name in logging.root.manager.loggerDict if "trains" in name]
I think if there's a default value it should override the type, otherwise go with the type
AgitatedDove14 no, there's no reason in my case to pass an empty string. that's why I removed the type=str part.
yes, it was.
TimelyPenguin76 yes, both 0.15.1
AgitatedDove14 You were right. I can get them as system tags.
I've wrote a class that wraps an training session and interaction with trains as upon loading/saving the experiment I need more than just the 'model.bin'
So I use these tags to match a specific aux files that were saved with their model.
TimelyPenguin76 the tags names are 'Epoch 1', 'Step 5705'
the return value of the InputModel(<Put a string copy from the UI with the tag id>).tags is an empty array.
SteadyFox10 AgitatedDove14 Thanks, I really did change the name.
yes, there's a use for empty strings, for example in text generation you may generate the next word given some prefix, the prefix may be an empty string.
I thought to change to connected ditionary though.
AgitatedDove14 When the default is None I expect the default value to be None even if the type is str. But I'll use your recommendation 🙂
SteadyFox10 ModelCheckpoint is not for pytorch I think, couldn't find anything like it.
AgitatedDove14 I can't try the new agent at the moment, the OS is Ubuntu 18.04 more specifically: amazon/Deep Learning Base AMI (Ubuntu 18.04) Version 22.0 and no docker. Running on the machine.
AgitatedDove14 Drastic indeed, I belive I will lose all the trains logs that way. In that case I prefer to keep the redundant logs.
If you'll find a more specific solution I'll love to know what it is 🙂
I use torch and yes, I use save so your code will catch it.
AgitatedDove14
These were the loggers names I can see locally running the code, it might differ running remotely.
['trains.utilities.pyhocon.config_parser', 'trains.utilities.pyhocon', 'trains.utilities', 'trains', 'trains.config', 'trains.storage', 'trains.metrics', 'trains.Repository Detection']
regarding repreduce it, have a long data processing after initializing the task and before setting the input model/output model.
AgitatedDove14 The question is whether it's done post-experiment or not.
After you conducted experiments for a few projects and you want to organize it our way of thinking works.
If you wan't subversions as you go on with the experiments that are conceptually different that they require a different project you're doing something not very organized. In that case the other option will be better, not my style of work.
I created a wrapper to work like executing python -m torch.distributed.launch --nproc_per_node 2 ./my_script.py but from my script. I do call trains.init in the subprocesses, I the actually difference between the subproceses supposed to be, in terms or arguments, local_rank that's all.It may be possible and that I'm not distributing the model between the GPUs in an optimal way or at least in a way that matches your framework.
If you have an example it would be great.
AgitatedDove14 Hi, So I solve that by passing to the created processes the arguments injected into the argprase as part of the commandline. The examples helped.
AgitatedDove14 I've tried the drastic measure suggested above as I had a log file of 1gb filled with the trains.frameworks - WARNING - Could not retrieve model location, skipping auto model logging
It didn't work :S
AgitatedDove14 Thanks Martin, I know that. I just say it's a bug.
AgitatedDove14
I think exclusion of arguments from the arg praser is a good idea.
Regarding the other parameters such as the working directory and script path. I just want to automate it as when running the script from my local machine for the "template" of the experiment it gets values that won't work when running in the worker. I just thought it can be automated from the code.
I've solved the first part by importing trains after parsing the arguments. Still not sure about the second part of my question.