Reputation
Badges 1
981 × Eureka!Ok so the problem was indeed the way docker was installed (with snap)
Ok, I guess Iโll just delete the whole loss series. Thanks!
Yea thats what I thought, I do have trains server 0.15
Alright, I had a look in the /tmp/.trains_agent_daemon_outabcdef.txt logs, not many insights from here. For the moment, I simply started a new trains-agent daemon in services mode and I will wait to see what happens.
If I remove security_group_ids and just let subnet_id in the configuration, it is not taken into account (the instances are created in a the default subnet)
with what I shared above, I now get:docker: Error response from daemon: network 'host' not found.
I now have a different question: when installing torch from wheels files, I am guaranteed to have the corresponding cuda library and cudnn together right?
Thanks a lot, I will play with that!
AgitatedDove14 Yes exactly! it is shown in the recording above
After some investigation, I think it could come from the way you catch error when checking the creds in trains.conf: When I passed the aws creds using env vars, another error poped up: https://github.com/boto/botocore/issues/2187 , linked to boto3
Would you like me to open an issue for that or will you fix it?
but I also make sure to write the trains.conf to the root directory in this bash script:echo " sdk.aws.s3.key = *** sdk.aws.s3.secret = *** " > ~/trains.conf ... python3 -m trains_agent --config-file "~/trains.conf" ...
Hi NonchalantHedgehong19 , thanks for the hint! what should be the content of the requirement file then? Can I specify my local package inside? how?
So the problem comes when I domy_task.output_uri = " s3://my-bucket , trains in the background checks if it has access to this bucket and it is not able to find/ read the creds
And I can verify that ~/trains.conf exists in the su home folder
File "devops/valid.py", line 80, in valid(parse_args) File "devops/valid.py", line 41, in valid valid_task.output_uri = args.artifacts File "/data/.trains/venvs-builds/3.6/lib/python3.6/site-packages/trains/task.py", line 695, in output_uri ", check configuration file ~/trains.conf".format(value)) ValueError: Could not get access credentials for 's3://ml-artefacts' , check configuration file ~/trains.conf
region is empty, I never entered it and it worked
AgitatedDove14 Yes exactly, I tried the fix suggested in the github issue urllib3>=1.25.4 and the ImportError disappeared ๐
(btw, yes I adapted to use Task.init(...output_uri=)
Now, I know the experiments having the most metrics. I want to downsample these metrics by 10, ie only keep iterations that are multiple of 10. How can I query (to delete) only the documents ending with 0?
Iโd like to move to a setup where I donโt need these tricks
AgitatedDove14 That's a good point: The experiment failing with this error does show the correct aws key:... sdk.aws.s3.key = ***** sdk.aws.s3.region = ...
So most likely trains was masking the original error, it might be worth investigating to help other users in the future
Does what you suggested here > https://github.com/allegroai/trains-agent/issues/18#issuecomment-634551232 also applies for containers used by the services queue?
I will probably just use everywhere an absolute path to be robust against different machine user accounts: /home/user/trains.conf
I'll try to pass these values using the env vars
And after the update, the loss graph appears
without the envs, I had error: ValueError: Could not get access credentials for ' s3://my-bucket ' , check configuration file ~/trains.conf After using envs, I got error: ImportError: cannot import name 'IPV6_ADDRZ_RE' from 'urllib3.util.url'