Reputation
Badges 1
981 × Eureka!yes, the new project is the one where I changed the layout and that gets reset when I move an experiment there
AMI ami-08e9a0e4210f38cb6 , region: eu-west-1a
trains-agent daemon --gpus 0 --queue default & trains-agent daemon --gpus 1 --queue default &
Well, as long as you’re using a single node, it should indeed alleviate the shard disk size limit, but I’m not sure ES will handle that too well. In any case, you can’t change that for existing indices, you can modify the mapping template and reindex the existing index (you’ll need to index to another name, delete the original and create an alias to the original name as the new index can’t be renamed...)
Ok thanks!
Well, as long as you use a single node, multiple shards offer no sca...
and with this setup I can use GPU without any problem, meaning that the wheel does contain the cuda runtime
ubuntu18.04 is actually 64Mo, I can live with that 😛
RuntimeError: CUDA error: no kernel image is available for execution on the device
Yes, I am preparing them 🙂
There’s a reason for the ES index max size
Does ClearML enforce a max index size? what typically happens when that limit is reached?
I mean when sending data from the clearml-agents, does it block the training while sending metrics or is it done in parallel from the main thread?
So the controller task finished and now only the second trains-agent services mode process is showing up as registered. So this is definitly something linked to the switching back to the main process.
I had this problem before
my docker-compose for the master node of the ES cluster is the following:
` version: "3.6"
services:
elasticsearch:
container_name: clearml-elastic
environment:
ES_JAVA_OPTS: -Xms2g -Xmx2g
bootstrap.memory_lock: "true"
cluster.name: clearml-es
cluster.initial_master_nodes: clearml-es-n1, clearml-es-n2, clearml-es-n3
cluster.routing.allocation.node_initial_primaries_recoveries: "500"
cluster.routing.allocation.disk.watermark.low: 500mb
clust...
SO I updated the config with:resource_configurations { A100 { instance_type = "p3.2xlarge" is_spot = false availability_zone = "us-east-1b" ami_id = "ami-04c0416d6bd8e4b1f" ebs_device_name = "/dev/xvda" ebs_volume_size = 100 ebs_volume_type = "gp3" key_name = "<my-key-name>" security_group_ids = ["<my-sg-id>"] subnet_id = "<my-subnet-id>" } }
but I get in the logs of the autoscaler:
` Warning! exception occurred: An error occurred (InvalidParam...
Yes! not a strong use case though, rather I wanted to ask if it was supported somehow
So it seems like it doesn't copy /root/clearml.conf and it doesn't pass the environment variables (CLEARML_API_HOST, CLEARML_API_ACCESS_KEY, CLEARML_API_SECRET_KEY)
So when I create a task using `task = Task.init(project_name=config.get("project_name"), task_name=config.get("task_name"), task_type=Task.TaskTypes.training, output_uri=" s3://my-bucket ") locally, the artifact is correctly logged remotely, but when I create the task remotely (from an agent) the artifact is logged locally (in the agent machine, not on s3)
Sure, where can I find this file?
yes -> but I still don't understand why the post_packages didn't work, could be worth investigating
erf, I have the same problem with ProxyDictPreWrite 😄 What is the use case of this one ?
in the UI the value is correct one (not empty, a string)