Adding back clearml logging with matplotlib.use('agg')
, uses more ram but not that suspicious
Sure, just sent you a screenshot in PM
Ok, I am asking because I often see the autoscaler starting more instances than the number of experiments in the queues, so I guess I just need to increase the max_spin_up_time_min
I see that I have several volumes:
` $ docker volume ls
DRIVER VOLUME NAME
local 5b0bfe5ab1a3d645bd635b2fb6f2aefd2b657d566019343c8305959903996c67
local 43b60287d60db798dc9d1defe1d7d861334c9c8299aefad6da2f20db278cfc5b
local 1406d50aa65ab55d323500d1fb23f19adfc6e721261ab6103a59d20e82146099
local 7367a215bd42a4e888e5d88ce708bf74aedc48a6e9417c72a19739cb80f25e6d
local 7413c39f5e4b6568304832d9d2e925ebdbf47ad31ad22d77830d3618af79237b
local a55cb71edff48c2138a5da9d8d1e26df3b...
extra_configurations = {"SubnetId": "<subnet-id>"}
That fixed it 😄
Ok, in that case it probably doesn’t work, because if the default value is 10 secs, it doesn’t match what I get in the logs of the experiment: every second the tqdm adds a new line
I have a custom way of reading the config file
So the controller task finished and now only the second trains-agent services mode process is showing up as registered. So this is definitly something linked to the switching back to the main process.
mmmh probably yes, I can’t say for sure (because I don’t remember precisely when I upgraded to 0.17) but it looks like that
alright I am starting to get a better picture of this puzzle
There is a pinned github thread on https://github.com/allegroai/clearml/issues/81 , seems to be the right place?
Should I try to disable dynamic mapping before doing the reindex operation?
Hi AgitatedDove14 , Here is the full log.
Both python versions (local and remote) are python 3.6 Locally (macos), I get pytorch3d== (from versions: 0.0.1, 0.1.1, 0.2.0, 0.2.5, 0.3.0, 0.4.0, 0.5.0)
Remotely (Ubuntu), I get (from versions: 0.0.1, 0.1.1, 0.2.0, 0.2.5, 0.3.0)
So I guess it’s not related to clearml-agent really, rather pip that cannot find the proper wheel for ubuntu for latest versions of pytorch3d, right? If yes, is there a way to build the wheel on the remote machine...
I would let the trains team answer this in details, but as a user moving from MLflow to trains, I can share the following insights:
MLflow and trains overlap when it comes to having a system with nice web UI to compare/log experiments/models/metrics. But MFlow lacks a crutial feature IMO which is ML/DevOps: Using MLFlow, you will have to take care of the whole maintenance of your machines, design interactions between them, etc. This is where trains shines, it provides these features out-of-t...
Configuration:
` {
"resource_configurations": {
"v100": {
"instance_type": "g4dn.2xlarge",
"availability_zone": "us-east-1a",
"ami_id": "ami-05e329519be512f1b",
"ebs_device_name": "/dev/sda1",
"ebs_volume_size": 100,
"ebs_volume_type": "gp3",
"key_name": "key.name",
"security_group_ids": [
"sg-asd"
],
"is_spot": false,
"extra_configura...
nothing wrong from ClearML side 🙂
Alright, so the steps would be:
trains-agent build --docker nvidia/cuda --id myTaskId --target base_env_services
That would create me a base docker image base_env_services
. Then how should I ensure that trains-agent uses that base image for the services queue? My guess is:
trains-agent daemon --services-mode --detached --queue services --create-queue --docker base_env_services --cpu-only
Would that work?
AgitatedDove14 In my case I'd rather have it under the "Artifacts" tab because it is a big json file
SuccessfulKoala55 Am I doing/saying something wrong regarding the problem of flushing every 5 secs (See my previous message)
I tested by installing flask in the default env -> which was installed in the ~/.local/lib/python3.6/site-packages
folder. Then I created a venv with flag --system-site-packages
. I activated the venv and flask was indeed available