Reputation
Badges 1
979 × Eureka!extra_configurations = {"SubnetId": "<subnet-id>"}
That fixed it π
ha nice, where can I find the mapping template of the original clearml so that I can copy and adapt?
trains==0.16.4
I am using pip as a package manager, but i start the trains-agent inside a conda env π
` resource_configurations {
A100 {
instance_type = "p3.2xlarge"
is_spot = false
availability_zone = "us-east-1b"
ami_id = "ami-04c0416d6bd8e4b1f"
ebs_device_name = "/dev/xvda"
ebs_volume_size = 100
ebs_volume_type = "gp3"
}
}
queues {
aws_a100 = [["A100", 15]]
}
extra_trains_conf = """
agent.package_manager.system_site_packages = true
agent.package_manager.pip_version = "==20.2.3"
"""
extra_vm_bash_script = """
sudo apt-get install -y libsm6 libxext6 libx...
Hi CostlyOstrich36 ! no I am running on venv mode
Yes, but a minor one. I would need to do more experiments to understand what is going on with pip skipping some packages but reinstalling others.
Probably something's wrong with the instance, which AMI you used? the default one?
The default one is not existing/accessible anymore, I replaced it with the one that was shown in the NVIDIA Deep Learning AMI markplace page https://aws.amazon.com/marketplace/pp/B076K31M1S?qid=1610377938050&sr=0-1&ref_=srh_res_product_title that is: ami-04c0416d6bd8e4b1f
CostlyOstrich36 super thanks for confirming! I have then the follow-up question: are the artifacts duplicated (copied)? or just referenced?
Yes, thanks! In my case, I was actually using TrainsSaver from pytorch-ignite with a local path, then I understood looking at the code that under the hood it actually changed the output_uri of the current task, thats why my previous_task.output_uri = "
s3://my_bucket
" had no effect (it was placed BEFORE the training)
SO I updated the config with:resource_configurations { A100 { instance_type = "p3.2xlarge" is_spot = false availability_zone = "us-east-1b" ami_id = "ami-04c0416d6bd8e4b1f" ebs_device_name = "/dev/xvda" ebs_volume_size = 100 ebs_volume_type = "gp3" key_name = "<my-key-name>" security_group_ids = ["<my-sg-id>"] subnet_id = "<my-subnet-id>" } }
but I get in the logs of the autoscaler:
` Warning! exception occurred: An error occurred (InvalidParam...
For some reason the configuration object gets updated at runtime toresource_configurations = null queues = null extra_trains_conf = "" extra_vm_bash_script = ""
yes, here is the error (the space at the end of the line is there)
` Applying uncommitted changes
Executing: ('git', 'apply'): b'error: corrupt patch at line 13\n'
Failed applying diff
trains_agent: ERROR: Failed applying git diff:
diff --git a/configs/2.2.2_from_scratch.yaml b/configs/2.2.2_from_scratch.yaml
index 9fece48..5816f78 100644
--- a/configs/2.2.2_from_scratch.yaml
+++ b/configs/2.2.2_from_scratch.yaml
@@ -136,7 +136,7 @@ data_processing:
optimizer:
type: 'RMSprop'
args:
- lr: 2.5e...
In all the steps I want to store them as artifacts to s3 because itβs very convenient.
The last step should merge them all, ie. it needs to know all the other artifacts of the previous steps
Ha I see, it is not supported by the autoscaler > https://github.com/allegroai/clearml/blob/282513ac33096197f82e8f5ed654948d97584c35/trains/automation/aws_auto_scaler.py#L120-L125
(Btw the instance listed in the console has no name, it it normal?)
So I want to be able to visualise it quickly as a table in the UI and be able to download it as a dataframe, which of report_media or artifact is better?
` # Set the python version to use when creating the virtual environment and launching the experiment
# Example values: "/usr/bin/python3" or "/usr/local/bin/python3.6"
# The default is the python executing the clearml_agent
python_binary: ""
# ignore any requested python version (Default: False, if a Task was using a
# specific python version and the system supports multiple python the agent will use the requested python version)
# ignore_requested_python_version: ...
I mean when sending data from the clearml-agents, does it block the training while sending metrics or is it done in parallel from the main thread?
That would be awesome, yes, only from my side I have 0 knowledge of the pip codebase π
Still investigating, task.data.last_iteration
is correct (equal to engine.state["iteration"]
) when I resume the training
You mean it will resolve by itself in the following days or should I do something? Or there is nothing to do and it will stay this way?
If the reporting is done on a subprocess, I can imagine that the task.set_initial_iteration(0)
call is only effective in the main process, not in the subprocess used for reporting. Could it be the case?
There is no way to filter on long types? I canβt believe it
Maybe the agent could be adapted to have a max_batch_size parameter?
Something like that?
` curl "localhost:9200/events-training_stats_scalar-adx3r00cad1bdfvsw2a3b0sa5b1e52b/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"must": [
{
"match": {
"variant": "loss_model"
}
},
{
"match": {
"task": "8f88e4b8cff84f23bde74ed4b7213ec6"
}
}
]
}
},
"aggs": {
"series": {
"terms": { "field": "iter" }
}
}
}...
on /data or /opt/clearml? these are two different disks
It indeed has the old commit, so they match, no problem actually π