
Reputation
Badges 1
981 × Eureka!More context:
trains, trains-agent and trains-server all 0.16 Session.api_version -> 2.9
(both when executed in trains-agent and in local script)
And since I ran the task locally with python3.9, it used that version in the docker container
Hi @<1523701205467926528:profile|AgitatedDove14> , I want to circule back on this issue. This is still relevant and I could collect the following on an ec2 instance running a clearml-agent running a stuck task:
- There seems to be a problem with multiprocessing: Although I stopped the task, there are still so many processes forked from the main training process. I guess these are zombies. Please check the htop tree.
- There is a memory leak somewhere, please see the screenshot of datadog mem...
For some reason the configuration object gets updated at runtime toresource_configurations = null queues = null extra_trains_conf = "" extra_vm_bash_script = ""
Thanks for your inputs, I will try that! For completion, here is how I retrieve the parameters:
` from trains import Task
task = Task.init("test", "test")
parent_task = Task.get_task(task.parent)
task.get_logger().report_text(task.get_parameters())
artifact_name = task.get_parameter("General/artifact_name")
artifact = parent_task.artifacts[artifact_name].get() `
Looking at the source code, it seems like I should do:data_processing_task._artifact_manager.flush()
to make sure to have the latest version of artifacts in the task, right?
AppetizingMouse58 Yes and yes
I get the same error when trying to run the task using clearml-agent services-mode with docker, so weird
this is the last line, same a before
Hi AgitatedDove14 , How should we proceed to fix this bug? Should I open an issue in github? Should I try to make a minimal reproducible example? Itβs blocking me atm
Hi AgitatedDove14 , so I ran 3 experiments:
One with my current implementation (using "fork") One using "forkserver" One using "forkserver" + the DataLoader optimizationI sent you the results via MP, here are the outcomes:
fork -> 101 mins, low RAM usage (5Go constant), almost no IO forkserver -> 123 mins, high RAM usage (16Go, fluctuations), high IO forkserver + DataLoader optimization: 105 mins, high RAM usage (from 28Go to 16Go), high IO
CPU/GPU curves are the same for the 3 experiments...
I don't think there is an example for this use case in the repo currently, but the code should be fairly simple (below is a rough draft of what it could look like)
` controller_task = Task.init(...)
controller_task.execute_remotely(queue_name="services", clone=False, exit_process=True)
while True:
periodic_task = Task.clone(template_task_id)
# Change parameters of {periodic_task} if necessary
Task.enqueue(periodic_task, queue="default")
time.sleep(TRIGGER_TASK_INTERVAL_SECS) `
Here are the logs of the agent :)
` (base) user@worker:~$ tail -f /tmp/.clearml_agent_daemon_outjdups8t2.txt
sdk.development.worker.log_stdout = true
sdk.development.worker.report_global_mem_used = false
+----------------------------------+--------+-------+
| id | name | tags |
+----------------------------------+--------+-------+
| 54e4a62a402d5135612ba7b12cfe4e57 | docker | |
+----------------------------------+--------+-------+
Starting infinite tas...
Ok, this I cannot locate
AnxiousSeal95 The main reason for me to not use clearml-serving triton is the lack of documentation tbh π I am not sure how to make my pytorch model run there
That would be amazing!
SO I updated the config with:resource_configurations { A100 { instance_type = "p3.2xlarge" is_spot = false availability_zone = "us-east-1b" ami_id = "ami-04c0416d6bd8e4b1f" ebs_device_name = "/dev/xvda" ebs_volume_size = 100 ebs_volume_type = "gp3" key_name = "<my-key-name>" security_group_ids = ["<my-sg-id>"] subnet_id = "<my-subnet-id>" } }
but I get in the logs of the autoscaler:
` Warning! exception occurred: An error occurred (InvalidParam...
I am also interested in the clearml-serving part π
AgitatedDove14 yes but I don't see in the docs how to attach it to the logger of the earlystopping handler
AgitatedDove14 According to the dependency order you shared, the original message of this thread isn't solved: the agent mentionned used output from nvcc (2) before checking the nvidia driver version (1)
Awesome! Thanks! π
ExcitedFish86 I have several machines with different cuda driver/runtime versions, that I why you might be confused as I am referring to one or another π
And if you need a very small change, you can also simply https://www.geeksforgeeks.org/monkey-patching-in-python-dynamic-behavior/ it
What is weird is:
Executing the task from an agent: task.get_parameters() returns an empty dict Calling task.get_parameters() from a local standalone script returns the correct properties, as shown in web UI, even if I updated them in UI.So I guess the problem comes from trains-agent?
Yes, I stayed with an older version for a compatibility reason I cannot remember now π - just tested with 1.1.2 and itβs the same
I tried specifying the bucket directly in my clearml.conf, same problem. I guess clearml just reads from the env vars first
Stopping the server Editing the docker-compose.yml file, adding the logging section to all services Restarting the serverDocker-compose freed 10Go of logs
Notice the last line should not have
--docker
Did you meant --detached
?
I also think we need to make sure we monitor all agents (this is important as this is the trigger to spin down the instance)
That's what I though yea, no problem, it was rather a question, if I encounter the need for that, I will adapt and open a PR π
So it can be that when restarting the docker-compose, it used another volume, hence the loss of data