
Reputation
Badges 1
186 × Eureka!same here, changing arguments in the Args section of Hyperparameters doesnโt work, training script starts with the default values.
trains 0.16.0
trains-agent 0.16.0
trains-server 0.16.0
I updated the version in the Installed packages section before starting the experiment
I donโt connect anything explicitly, Iโm using argparse, it used to work before the update
thanks! this bug and cloning problem seem to be fixed
copy-pasting entire training command into command line ๐
I change the arguments in Web UI, but it looks like they are not parsed by trains
yeah, backups take much longer, and we had to increase our EC2 instance volume size twice because of these indices
got it, thanks, will try to delete older ones
nope, old clenup task fails with trains_agent: ERROR: Could not find task id=e7725856e9a04271aab846d77d6f7d66 (for host: )
Exception: 'Tasks' object has no attribute 'id
weirdly enough, curl
http://apiserver:8008 from inside the container works
{
username: "username"
password: "password"
name: "John Doe"
},
well okay, it's probably not that weird considering that worker just runs the container
I guess I could manually explore different containers and their content ๐ as far as I remember, I had to update Elastic records when we moved to the new cloud provider in order to update model URLs
not quite. for example, Iโm not sure which info is stored in Elastic and which is in MongoDB
if you click on the experiment name here, you get 404 because link looks like this:
https://DOMAIN/projects/PROJECT_ID/EXPERIMENT_ID
when it should look like this:
https://DOMAIN/projects/PROJECT_ID/experiments/EXPERIMENT_ID
btw, are there any examples of exporting metrics using Python client? I could only find last_metrics attribute of the task
we do log a lot of the different metrics, maybe this can be part of the problem
it will probably screw up my resource monitoring plots, but well, who cares ๐
I use Docker for training, which means that log_dir contents are removed for the continued experiment btw
thank you, I'll let you know if setting it to zero worked
weird
this is what I got in installed packages without adding the direct link:
torch==1.6.0.dev20200430+cu101
torchvision==0.7.0.dev20200430+cu101
docker mode. they do share the same folder with the training data mounted as a volume, but only for reading the data.
awesome news ๐
nope, that's the point, quite often we run experiments separately, but they are related to each other. currently there's no way to see that one experiment is using checkpoint from the previous experiment since we need to manually insert S3 link as a hyperparameter. it would be useful to see these connections. maybe instead of grouping we could see which experiments are using artifacts of this experiment
nope, same problem even after creating a new experiment from scratch