Reputation
Badges 1
979 × Eureka!Nevermind, nvidia-smi command fails in that instance, the problem lies somewhere else
Still failing with the same error π
I now have a different question: when installing torch from wheels files, I am guaranteed to have the corresponding cuda library and cudnn together right?
I am still confused though - from the get started page of pytorch website, when choosing "conda", the generated installation command includes cudatoolkit, while when choosing "pip" it only uses a wheel file.
Does that mean the wheel file contains cudatoolkit (cuda runtime)?
alright I am starting to get a better picture of this puzzle
From https://discuss.pytorch.org/t/please-help-me-understand-installation-for-cuda-on-linux/14217/4 it looks like my assumption is correct: there is no need for cudatoolkit to be installed since wheels already contain all cuda/cudnn libraries required by torch
Thanks SuccessfulKoala55 for the answer! One followup question:
When I specify:agent.package_manager.pip_version: '==20.2.3'
in the trains.conf, I get:trains_agent: ERROR: Failed parsing /home/machine1/trains.conf (ParseException): Expected end of text, found '=' (at char 326), (line:7, col:37)
Isn't it overkill to run a whole ubuntu 18.04 just to run a dead simple controller task?
I have two controller tasks running in parallel in the trains-agent services queue
Sure, just sent you a screenshot in PM
Still getting the same error, it is not taken into account π€
The host is accessible, I can ping it and even run curl "
http://internal-aws-host-name:9200/_cat/shards "
and get results from the local machine
yes, because it wonβt install the local package which has this setup.py with the problem in its install_requires described in my previous message
That gave me
Running in Docker mode (v19.03 and above) - using default docker image: nvidia/cuda running python3
Building Task 94jfk2479851047c18f1fa60c1364b871 inside docker: ubuntu:18.04
Starting docker build
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ERRO[0000] error waiting for container: context canceled
Thanks for your answer! I am in the process of adding subnet_id/security_groups_id/key_name to the config to be able to ssh in the machine, will keep you informed π
extra_configurations = {"SubnetId": "<subnet-id>"}
That fixed it π
ha nice, where can I find the mapping template of the original clearml so that I can copy and adapt?
trains==0.16.4
I am using pip as a package manager, but i start the trains-agent inside a conda env π
` resource_configurations {
A100 {
instance_type = "p3.2xlarge"
is_spot = false
availability_zone = "us-east-1b"
ami_id = "ami-04c0416d6bd8e4b1f"
ebs_device_name = "/dev/xvda"
ebs_volume_size = 100
ebs_volume_type = "gp3"
}
}
queues {
aws_a100 = [["A100", 15]]
}
extra_trains_conf = """
agent.package_manager.system_site_packages = true
agent.package_manager.pip_version = "==20.2.3"
"""
extra_vm_bash_script = """
sudo apt-get install -y libsm6 libxext6 libx...
Hi CostlyOstrich36 ! no I am running on venv mode
Yes, but a minor one. I would need to do more experiments to understand what is going on with pip skipping some packages but reinstalling others.
Probably something's wrong with the instance, which AMI you used? the default one?
The default one is not existing/accessible anymore, I replaced it with the one that was shown in the NVIDIA Deep Learning AMI markplace page https://aws.amazon.com/marketplace/pp/B076K31M1S?qid=1610377938050&sr=0-1&ref_=srh_res_product_title that is: ami-04c0416d6bd8e4b1f
CostlyOstrich36 super thanks for confirming! I have then the follow-up question: are the artifacts duplicated (copied)? or just referenced?