
Reputation
Badges 1
981 × Eureka!There is no need to add creds on the machine, since the EC2 instance has an attached IAM profile that grants access to s3. Boto3 is able retrieve the files from the s3 bucket
Why is it required in the case where boto3 can figure them out itself within the ec2 instance?
SuccessfulKoala55 I was able to make it work with use_credentials_chain: true
in the clearml.conf and the following patch: https://github.com/allegroai/clearml/pull/478
Yea I really need that feature, I need to move away from key/secrets to iam roles
I will go for lunch actually ๐ back in ~1h
for some reason when cloning task A, trains sets an old commit in task B. I tried to recreate task A to enforce a new task id and new commit id, but still the same issue
AgitatedDove14 I finally solved it: The problem was --network='host'
should be --network=host
Hoo I found:user@trains-agent-1: ps -ax 5199 ? Sl 29:25 python3 -m trains_agent --config-file ~/trains.conf daemon --queue default --log-level DEBUG --detached 6096 ? Sl 30:04 python3 -m trains_agent --config-file ~/trains.conf daemon --queue default --log-level DEBUG --detached
here is the function used to create the task:
` def schedule_task(parent_task: Task,
task_type: str = None,
entry_point: str = None,
force_requirements: List[str] = None,
queue_name="default",
working_dir: str = ".",
extra_params=None,
wait_for_status: bool = False,
raise_on_status: Iterable[Task.TaskStatusEnum] = (Task.TaskStatusEnum.failed, Task.Ta...
RobustRat47 It can also simply be that the instance type you declared is not available in the zone you defined
without the envs, I had error: ValueError: Could not get access credentials for '
s3://my-bucket ' , check configuration file ~/trains.conf
After using envs, I got error: ImportError: cannot import name 'IPV6_ADDRZ_RE' from 'urllib3.util.url'
Could you please share the stacktrace?
Yes, but I am not certain how: I just deleted the /data folder and restarted the server
what would be the name of these vars?
Isn't it overkill to run a whole ubuntu 18.04 just to run a dead simple controller task?
Sure ๐ Opened https://github.com/allegroai/clearml/issues/568
ubuntu18.04 is actually 64Mo, I can live with that ๐
So most likely trains was masking the original error, it might be worth investigating to help other users in the future
Ok, now I would like to copy from one machine to another via scp, so I copied the whole /opt/trains/data folder, but I got the following errors:
So the migration from one server to another + adding new accounts with password worked, thanks for your help!
AgitatedDove14 That's a good point: The experiment failing with this error does show the correct aws key:... sdk.aws.s3.key = ***** sdk.aws.s3.region = ...
Thanks! Corrected both, now its building
Yea so I assume that training my models using docker will be slightly slower so I'd like to avoid it. For the rest using docker is convenient
For new projects it works ๐
CostlyOstrich36 , this also happens with clearml-agent 1.1.1 on a aws instanceโฆ
Thanks a lot for the solution SuccessfulKoala55 ! Iโll try that if the solution โdelete old bucket, wait for its name to be available, recreate it with the other aws account, transfer the data backโ fails
CostlyOstrich36 , actually this only happens for a single agent. The weird thing is that I have a machine with two gpus, and I spawn two agents, one per gpus. Both have the same version. For one, I can see all the logs, but not for the other