JitteryCoyote63

215 Questions, 1023 Answers

Active since 10 January 2023

Last activity one month ago

Reputation

Badges 1

981 × Eureka!

Answers 1023

0 Hi There

More context:
trains, trains-agent and trains-server all 0.16 Session.api_version -> 2.9 (both when executed in trains-agent and in local script)

5 years ago

0 Hi, With Clearml-Agent 1.5.1, I Tried To Run An Experiment Within A Docker With Image Python3:8 And It Failed Executing The Task While Trying To Call Python3.9. I Am Not Sure Why It'S Using Python3.9, Since The Agent.Default_Python Is 3.8 And The Image Is

And since I ran the task locally with python3.9, it used that version in the docker container

2 years ago

0 Hey There, Since A Bit I Often Find Experiments Being Stuck While Training A Model. It Seems To Happen Randomly And I Could Not Find A Reproducible Scenario So Far, But It Happens Often Enough To Be Annoying (I'D Say 1 Out Of 5 Experiments). The Symptoms

Hi @<1523701205467926528:profile|AgitatedDove14> , I want to circule back on this issue. This is still relevant and I could collect the following on an ec2 instance running a clearml-agent running a stuck task:

There seems to be a problem with multiprocessing: Although I stopped the task, there are still so many processes forked from the main training process. I guess these are zombies. Please check the htop tree.
There is a memory leak somewhere, please see the screenshot of datadog mem...

2 years ago

0 Hi, Is It Possible To Pass Environment Variables To Agents Created By The Aws Autoscaler Service?

For some reason the configuration object gets updated at runtime to
resource_configurations = null queues = null extra_trains_conf = "" extra_vm_bash_script = ""

4 years ago

0 Hi There

Thanks for your inputs, I will try that! For completion, here is how I retrieve the parameters:
` from trains import Task

task = Task.init("test", "test")
parent_task = Task.get_task(task.parent)
task.get_logger().report_text(task.get_parameters())
artifact_name = task.get_parameter("General/artifact_name")
artifact = parent_task.artifacts[artifact_name].get() `

5 years ago

0 Hello, In The Following Context:

Looking at the source code, it seems like I should do:
data_processing_task._artifact_manager.flush() to make sure to have the latest version of artifacts in the task, right?

5 years ago

0 Hi, I Just Updated Clearml-Server To 1.1.0 And Got The Following Error When Starting It With Docker-Compose:

AppetizingMouse58 Yes and yes

4 years ago

0 How Can I Do The Following? (Basically, Filtering By Task Type)

Thanks!

4 years ago

0 Hi, I Face A Strange Behavior From The Clearml-Agent: It’S Running In Services Mode, Not In Docker Mode, Cpu Only. I Want To Execute Two Tasks On This Service Agent. One Works, The Other Always Fails After Being Enqueued And Picked By The Agent With The E

I get the same error when trying to run the task using clearml-agent services-mode with docker, so weird

4 years ago

0 Hello, I Tried The Clearml-Session Cli To Start A Jupyter Instance On An Agent, But An Error With The Password, Here Is The Full Cli Log:

this is the last line, same a before

4 years ago

0 Hi, If I Am Starting My Training With The Following Command:

Hi AgitatedDove14 , How should we proceed to fix this bug? Should I open an issue in github? Should I try to make a minimal reproducible example? It’s blocking me atm

3 years ago

0 Hey There, Happy New Year To All Of You

Hi AgitatedDove14 , so I ran 3 experiments:
One with my current implementation (using "fork") One using "forkserver" One using "forkserver" + the DataLoader optimizationI sent you the results via MP, here are the outcomes:
fork -> 101 mins, low RAM usage (5Go constant), almost no IO forkserver -> 123 mins, high RAM usage (16Go, fluctuations), high IO forkserver + DataLoader optimization: 105 mins, high RAM usage (from 28Go to 16Go), high IO
CPU/GPU curves are the same for the 3 experiments...

4 years ago

0 I Am Wondering Is It Possible To Schedule A Task To Run At Certain Time In Periodic Fashion Aka. Cron Style... Thinking Of Having A Monitoring Task To Be Run Routinely ... I Could Use A Cron On One Of The Server But Prefer To Run It On Trains As Then I Am

I don't think there is an example for this use case in the repo currently, but the code should be fairly simple (below is a rough draft of what it could look like)
` controller_task = Task.init(...)
controller_task.execute_remotely(queue_name="services", clone=False, exit_process=True)

while True:
periodic_task = Task.clone(template_task_id)
# Change parameters of {periodic_task} if necessary
Task.enqueue(periodic_task, queue="default")
time.sleep(TRIGGER_TASK_INTERVAL_SECS) `

4 years ago

0 Hello, I Tried The Clearml-Session Cli To Start A Jupyter Instance On An Agent, But An Error With The Password, Here Is The Full Cli Log:

Here are the logs of the agent :)
` (base) user@worker:~$ tail -f /tmp/.clearml_agent_daemon_outjdups8t2.txt
sdk.development.worker.log_stdout = true
sdk.development.worker.report_global_mem_used = false

+----------------------------------+--------+-------+
| id | name | tags |
+----------------------------------+--------+-------+
| 54e4a62a402d5135612ba7b12cfe4e57 | docker | |
+----------------------------------+--------+-------+

Starting infinite tas...

4 years ago

0 Hi, In One Of My Agents With Cuda Version: 11.1 (From Nvidia-Smi), Clearml Agent 0.17.1 Detects Version 100 (I Can See From Experiments Logs:

Ok, this I cannot locate

4 years ago

0 Hey, Clearml Team! When Can We Expect An Updated Roadmap? Last One Is From August

AnxiousSeal95 The main reason for me to not use clearml-serving triton is the lack of documentation tbh 😄 I am not sure how to make my pytorch model run there

3 years ago

0 Hey, Clearml Team! When Can We Expect An Updated Roadmap? Last One Is From August

That would be amazing!

3 years ago

0 Hi, Is It Possible To Pass Environment Variables To Agents Created By The Aws Autoscaler Service?

SO I updated the config with:
resource_configurations { A100 { instance_type = "p3.2xlarge" is_spot = false availability_zone = "us-east-1b" ami_id = "ami-04c0416d6bd8e4b1f" ebs_device_name = "/dev/xvda" ebs_volume_size = 100 ebs_volume_type = "gp3" key_name = "<my-key-name>" security_group_ids = ["<my-sg-id>"] subnet_id = "<my-subnet-id>" } }
but I get in the logs of the autoscaler:
` Warning! exception occurred: An error occurred (InvalidParam...

4 years ago

0 Hey, Clearml Team! When Can We Expect An Updated Roadmap? Last One Is From August

I am also interested in the clearml-serving part 😄

3 years ago

0 Hi, How Can I Get The Logs From The Pytorch Ignite Early Stopping Handler To Be Logged In Clearml?

AgitatedDove14 yes but I don't see in the docs how to attach it to the logger of the earlystopping handler

4 years ago

0 Hi, In One Of My Agents With Cuda Version: 11.1 (From Nvidia-Smi), Clearml Agent 0.17.1 Detects Version 100 (I Can See From Experiments Logs:

AgitatedDove14 According to the dependency order you shared, the original message of this thread isn't solved: the agent mentionned used output from nvcc (2) before checking the nvidia driver version (1)

4 years ago

0 Hello, I Tried The Clearml-Session Cli To Start A Jupyter Instance On An Agent, But An Error With The Password, Here Is The Full Cli Log:

Awesome! Thanks! 🙏

4 years ago

0 Hi, In One Of My Agents With Cuda Version: 11.1 (From Nvidia-Smi), Clearml Agent 0.17.1 Detects Version 100 (I Can See From Experiments Logs:

ExcitedFish86 I have several machines with different cuda driver/runtime versions, that I why you might be confused as I am referring to one or another 🙂

4 years ago

0 Hi

And if you need a very small change, you can also simply https://www.geeksforgeeks.org/monkey-patching-in-python-dynamic-behavior/ it

4 years ago

0 Hi There

What is weird is:
Executing the task from an agent: task.get_parameters() returns an empty dict Calling task.get_parameters() from a local standalone script returns the correct properties, as shown in web UI, even if I updated them in UI.So I guess the problem comes from trains-agent?

5 years ago

0 Hi There, I Have A Bit Of A Problem With Aws Secrets: I Pass Keys As Env Var To Clearml-Agents To Retrieve Data From A Bucket In Us-East-1 But I Use A Bucket To Store Task Artifacts In A Bucket In Eu-Central-1. So When I Pass Aws Keys As Env Vars, The Tas

Yes, I stayed with an older version for a compatibility reason I cannot remember now 😄 - just tested with 1.1.2 and it’s the same
I tried specifying the bucket directly in my clearml.conf, same problem. I guess clearml just reads from the env vars first

3 years ago

0 Hello, ~3 Months Ago I Created A Trains-Server In A Machine With 30Gb Of Disk Space. Today I Wasn'T Able To Connect To Trains-Server, So I Checked The Server And Found That The Disk Full. I Ran:

Stopping the server Editing the docker-compose.yml file, adding the logging section to all services Restarting the serverDocker-compose freed 10Go of logs

4 years ago

0 Hi Folks, Is It Possible To Use An Aws P3 Instance (Which As Several Gpus) With One Agent Per Gpu, All Controlled Through Clearml Aws Autoscheduler? So Clearml Aws Autoscheduler Would Know In Advance How Much Agents To Start In The Instances (Can Be An Op

Notice the last line should not have

--docker

Did you meant --detached ?

I also think we need to make sure we monitor all agents (this is important as this is the trigger to spin down the instance)

That's what I though yea, no problem, it was rather a question, if I encounter the need for that, I will adapt and open a PR 🙂

4 years ago

0 Hello There, I Would Like To Do Run Cleanup Code In Case The User Aborts One Task From The Dashboard (The Agent Is Not Using The Task In Docker). What Signal Should I Listen For In The Task?

using Trains Agent 🙂

4 years ago

0 Hi, I Just Updated Clearml-Server To 1.1.0 And Got The Following Error When Starting It With Docker-Compose:

So it can be that when restarting the docker-compose, it used another volume, hence the loss of data

4 years ago

Show more results