Reputation
Badges 1
49 × Eureka!Yeah, I thought to use artifact, wondered if I can avoid using it or on the other hand, use only it just to define the "the model" as a folder.
Thanks.
I actually tried to print the logging.getLogger("trains.frameworks").level
and it was ERROR as expected. Therefore I'm not quite sure that's the problem... next I thought to patch your functions.
I think if there's a default value it should override the type, otherwise go with the type
the trains version is still 0.14 it will take time to switch it
AgitatedDove14 no, there's no reason in my case to pass an empty string. that's why I removed the type=str
part.
AgitatedDove14 v0.14
AgitatedDove14 The question is whether it's done post-experiment or not.
After you conducted experiments for a few projects and you want to organize it our way of thinking works.
If you wan't subversions as you go on with the experiments that are conceptually different that they require a different project you're doing something not very organized. In that case the other option will be better, not my style of work.
AgitatedDove14 Good to know! 🙂
I think it's good the way you described it (the second option).
let's call it an applicative project which has experiments and an abstract/parent project, or some other name that group applicative projects.
AgitatedDove14 Thanks Martin, I know that. I just say it's a bug.
I think it's either parent project or parent experiment you don't need both.
AgitatedDove14 You were right. I can get them as system tags.
I've wrote a class that wraps an training session and interaction with trains as upon loading/saving the experiment I need more than just the 'model.bin'
So I use these tags to match a specific aux files that were saved with their model.
AgitatedDove14 My solution actually works better when I want to copy the model + aux to a different s3 folder for deployment as the aux is very light and I can copy the model without downloading it. But thanks for the suggestion.
TimelyPenguin76 yes, both 0.15.1
TimelyPenguin76 I see it in the web-app under the model.
AgitatedDove14 I can't try the new agent at the moment, the OS is Ubuntu 18.04 more specifically: amazon/Deep Learning Base AMI (Ubuntu 18.04) Version 22.0
and no docker. Running on the machine.
AgitatedDove14 I'm using that code in the meanwhile
` ### This script checks the number of GPUs, create a list like 0,1,2...
Then adds '--gpus' before that list of GPUs
NUM_GPUS=nvidia-smi -L | wc -l
NUM_GPUS=$(($NUM_GPUS-1))
OUT=()
if [ $NUM_GPUS -ge 0 ]
then
for i in $(seq 0 $NUM_GPUS); do OUT+=( "$i" ); done
echo ${OUT[*]// /|} | tr ' ' ',' | awk '{print "--gpus "$1}'
else
echo ""
fi `
AgitatedDove14 yes, you're right. it was 10.2 or 10.1 if I recall.
SteadyFox10 AgitatedDove14 Thanks, I really did change the name.
AgitatedDove14 Well, after starting a new project it works. I guess it's a bug.
AgitatedDove14 Yes, I can. I didn't delete the previous project yet.
AgitatedDove14
I think exclusion of arguments from the arg praser is a good idea.
Regarding the other parameters such as the working directory and script path. I just want to automate it as when running the script from my local machine for the "template" of the experiment it gets values that won't work when running in the worker. I just thought it can be automated from the code.
AgitatedDove14 I'm using both argpraser and sys.argv to start different processes that each of them will interact with a single GPU. So each process have a specific argument with a different value to differentiate between them. (only the main interact with trains). At the moment I encounter issues with getting the arguments from the processes I spawn. I'm explicitly calling python my_script.py --args...
and each process knows to interact with the other. It's a bit complicated to explai...
I created a wrapper to work like executing python -m torch.distributed.launch --nproc_per_node 2 ./my_script.py
but from my script. I do call trains.init
in the subprocesses, I the actually difference between the subproceses supposed to be, in terms or arguments, local_rank
that's all.It may be possible and that I'm not distributing the model between the GPUs in an optimal way or at least in a way that matches your framework.
If you have an example it would be great.
SuccessfulKoala55 No, that's not what I mean.
Take a look at the process in the machine:/home/ubuntu/.trains/venvs-builds/3.6/bin/python -u path/to/script/my_script.py
That's how the process starts. Therefore, when I try to get sys.argv
all I get is path/to/script/my_script.py
.
I'm talking about allowing to have arguments that are not being injected to the argparse. So it will look like:
` /home/ubuntu/.trains/venvs-builds/3.6/bin/python -u path/to/script/my_script.py --...
AgitatedDove14 It will take me probably a few days but I'll let you know.