
Reputation
Badges 1
979 × Eureka!extra_configurations = {'SubnetId': "<subnet-id>"}
with brackets right?
But we can easily extend, right?
extra_configurations = {"SubnetId": "<subnet-id>"}
That fixed it š
Still getting the same error, it is not taken into account š¤
Thanks for your answer! I am in the process of adding subnet_id/security_groups_id/key_name to the config to be able to ssh in the machine, will keep you informed š
BTW, is there any specific reason for not upgrading to clearml?
I just didn't have time so far š
` resource_configurations {
A100 {
instance_type = "p3.2xlarge"
is_spot = false
availability_zone = "us-east-1b"
ami_id = "ami-04c0416d6bd8e4b1f"
ebs_device_name = "/dev/xvda"
ebs_volume_size = 100
ebs_volume_type = "gp3"
}
}
queues {
aws_a100 = [["A100", 15]]
}
extra_trains_conf = """
agent.package_manager.system_site_packages = true
agent.package_manager.pip_version = "==20.2.3"
"""
extra_vm_bash_script = """
sudo apt-get install -y libsm6 libxext6 libx...
Probably something's wrong with the instance, which AMI you used? the default one?
The default one is not existing/accessible anymore, I replaced it with the one that was shown in the NVIDIA Deep Learning AMI markplace page https://aws.amazon.com/marketplace/pp/B076K31M1S?qid=1610377938050&sr=0-1&ref_=srh_res_product_title that is: ami-04c0416d6bd8e4b1f
I get the following error:
trains==0.16.4
And after the update, the loss graph appears
and in the logs:
`
agent.worker_name = worker1
agent.force_git_ssh_protocol = false
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version = ==20.2.3
agent.package_manager.system_site_packages = true
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = pytorch
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 = defaults
agent.package_manager.torch_nightly = false
agent.venvs_dir = /...
Ok, I got the following error when uploading the table as an artifact:ValueError('Task object can only be updated if created or in_progress')
Indeed, I actually had the old configuration that was not JSON - I converted to json, now works š
Whohoo! Thanks š
that would work for pytorch and clearml yes, but what about my local package?
Hi AgitatedDove14 , initially I was doing this, but then I realised that with the approach you suggest all the packages of the local environment also end up in the āinstalled packagesā, while in reality I only need the dependencies of the local package. Thatās why I use _update_requirements
, with this approach only the package required will be installed in the agent
Sure! Here are the relevant parts:
` ...
Current configuration (clearml_agent v1.2.3, location: /tmp/.clearml_agent.3m6hdm1_.cfg):
...
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version = ==20.2.3
agent.package_manager.system_site_packages = false
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = pytorch
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 ...
Hi NonchalantHedgehong19 , thanks for the hint! what should be the content of the requirement file then? Can I specify my local package inside? how?
Well, as long as youāre using a single node, it should indeed alleviate the shard disk size limit, but Iām not sure ES will handle that too well. In any case, you canāt change that for existing indices, you can modify the mapping template and reindex the existing index (youāll need to index to another name, delete the original and create an alias to the original name as the new index canāt be renamed...)
Ok thanks!
Well, as long as you use a single node, multiple shards offer no sca...
Yes, that's what it looks like. Somehow when you clone the experiment repo, you correctly set the git creds in the url, but when the dependencies are installed, the git creds are not taken in account
I also don't understand what you mean by unless the domain is different...
The same way ssh keys are global, I would have expected the git creds to be used for any git operation
See my answer in the issue - I am not using docker
I donāt have a registry to push my image to.I think I can get around it actually - Will it work if I just build the image locally once, then start the agent? Docker would recognise that image locally and just use it right? I wonāt need to update that image often anyway
Ok, now I would like to copy from one machine to another via scp, so I copied the whole /opt/trains/data folder, but I got the following errors:
Done! Also I tried to use git cache ( https://git-scm.com/docs/git-credential-cache ) as a workaround (hoping that the first time it clones the experiment repo, it caches the creds for the next times, but I then get a different error: fatal: unable to find a suitable socket path; use --socket
)
(I didn't have this problem so far because I was using ssh keys globaly, but I want know to switch to git auth using Personal Access Token for security reasons)