AgitatedDove14

48 Questions, 8051 Answers

Active since 10 January 2023

Last activity 7 months ago

Reputation

Badges 1

25 × Eureka!

Answers 8051

0 Hi I Have An Issue Where Experiments Are All Showing That They Started From Iteration 0. This Is Even True For Experiments Which I Know Used To Show The Correct Iteration, So It Seems To Be Due To An Update Of The Web Interface. Here You Can See That Sup

No, an old experiment changed, nothing was rerun

ohh, that is odd. I think the max iteration value is stored on the DB, which is odd if it changed after an update.
BTW: just making sure, could it be these Tasks were imported ? (i.e. offline execution + import)

2 years ago

0 What Would Be The Best Way To Approach This Flow?

can configuration objects refer to one-another internally in ClearML?

Interesting, please explain?

2 years ago

0 Hello Guys, I Read About Trains Some Days Ago And Think It Is Exectly What I Was Looking For, So I Ran The Docker Image And Started Thinking Of What I Would Like To Do And The Processing Steps I Would Like To Automize Which I Currently Run Manually Trigge

Hi WickedGoat98
This sounds like a great design (obviously you have scale in mind 😉 ) Feel free to ask "stupid" questions, based on what you already wrote I doubt they will be
A few questions that come to mind (probably a few others after):
You mentioned FS synchronization, from where? i.e. what is the single source of truth ? K8s (Rancher 2.0 is basically k8s manager) can take care of mounting volumes, so no need to sync, is this a valid solution ?

BTW : (you can drag and drop an i...

4 years ago

and about a month later for some reason the initial iteration seems to have changed to 0

Hmm, I see your point. Just so I fully understand, your are not saying Old experiments were changed, but new experiments (running the same code-ish) have a totally different max iterations value. Is this correct ?

2 years ago

0 Hi, Is There Any Documentation For Setting Up And Using Ssl Certs With The Clearml Server And Agent?

We're not using a load balancer at the moment.

The easiest way is to add ELB and have amazon add the httpS on top (basically a few clicks on their console)

3 years ago

The data I'm syncing by an data provider wich supports only an ftp connection....

Right ... that makes sense :)

No worries WickedGoat98 , feel free to post questions when they arise. BTW: we are now improving the k8s glue, so by the time you get there the integration will be even easier 🙂

4 years ago

0 Hi, We Have A Bit Old Open Source Clearml Instance. I Want To Create A New Instance On A New Infrastructure. Is There An Easy Way To Migrate Data Between Clearml Instances?

Should work, follow the backup process, and restore into a new machine:
None

one year ago

0 Hi There

4 years ago

0 Hi, What Happens Exactly When I Execute The Following Command:

Hi JitteryCoyote63
The NVIDIA_VISIBLE_DEVICES is set automatically for the process the trains-agent spins, so from your code, it is transparent, you can only "see" GPU 0.
(Obviously not using docker you can forcefully change the OS environment in runtime, but you should avoid that ;))

4 years ago

0 Currently, To Provide Ssh Access To The Docker Images For A Task,

The .ssh is mounted, but the owner is my local user,

sudo -H clearml-agent ...to allow sudo to access home

3 years ago

0 Hi Everyone. I'M New To Trains. I Do Not Have Sudo Access To My Departmental Servers. Can I Still Use Trains Beyond The Demo Server?

ScantWorm7
Tensorboard is automatically captured and sent to the trains server. This is in addition to the local copy of your TB files. Actually in most cases the local copy is redundant

4 years ago

0 Hi There, I’Ve Been Trying To Play Around With The Model Inference Pipeline Following

what do you have here in your docker compose :
None

one year ago

0 Hi All, I'M Trying To Use The Relatively New Jupyter Preview Feature But For Some Reason I Have The Notebook Artifact Under Artifacts But The Preview Is Unavailable.. Am I Missing Some Needed Steps? Thanks!

Hi RipeGoose2
Can you try with the latest from git ?
pip install -U git+

3 years ago

0 Hi There

Okay, I think I understand, but missing something. It seems you call get_parameters from old API , is your code actually calling get_parameters ? The trains-agent runs the code externally, whatever happens inside the agent should have now effect on the code. So who exactly is calling the task.get_parameters, and well, why ? :)

4 years ago

0 Hi, I Am Trying To Setup Multi-Node Training With Pytorch Distributeddataparallel. Ddp Requres A Launch Script With A Set Of Parameters To Be Run On Each Node. One Of These Parameters Is Master Node Address. I Am Currently Using The Following Scheme:

RoughTiger69 yes I think "Scale" tier covers it 😉

2 years ago

0 Question About The File Server. Currently, We Have A Machine With Minio Installed, And All File Communication Is Made Using The Minio Sdk Client. [Minio Is Just Like An S3 Bucket, Fully Compliant With S3 Protocol]. In The Examples I'Ve Seen The

To store all the debug samples, also it can store all the models (if you configure the output_uri=' http://file_server_here:8081 ') Yes: instead of the file server have 's3://<ip_of_minio>:9000/bucket' make sure you add the credentials for the minio in the trains.conf Yes, basically once you have the creendtials in the trains.conf, you could do StorageManager.get_local_copy('s3://<minio>:9000/bucket/file') (also upload of course 🙂 )

4 years ago

0 Hi Everyone, Additional Arguments To The Script Execution, Is It Possible? How Can It Be Done? So At The Moment When My Script Is Being Executed The

PompousBeetle71 you can check this example:
https://github.com/allegroai/trains/blob/master/examples/distributed/example_torch_distributed.py

I think it should help, if you want a more manual approach, you can check the POpen subprocesses here:
https://github.com/allegroai/trains/blob/master/examples/distributed/example_subprocess.py

4 years ago

0 Hi Everyone. I'M New To Trains. I Do Not Have Sudo Access To My Departmental Servers. Can I Still Use Trains Beyond The Demo Server?

Hmm you will have to set the trains-server on a machine somewhere, it can be any machine win / Mac / Linux

4 years ago

0 Hi, I Have A Pre-Processing Steps Not Been Implemented In Python, But Being A Shell Script Calling Wget To Synchronize Data And Creating Intermediate Sqlite Dbs By A Script Been Implemented In 'R' And Would Like To Ask, If Trains Can Be Used Just To Trigg

Hi WickedGoat98

Will I need to wrap their execution in python by system calls?

That would probably be the easiest solution 🙂

Then you can plug it into your pipeline as a preprocessing Task:

You can check this example:
https://github.com/allegroai/trains/tree/master/examples/pipeline

4 years ago

0 Dear Clearml Community, I Am Looking For A Way To Properly Resume A Training In A Way That Initial Scalars Get Reused And Expanded. Clearml Feature For Reusing The Same Task Works Fine (When Using

Hi @<1663354518726774784:profile|CrookedSeal85>

However, I systematically notice a jump of some number of "ghost iterations" when resuming my trainings...

Try the following:

task = Task.init(..., continue_last_task=0

from the Task.init docstring (Notice this value can be both boolean and integer)

        :param bool continue_last_task: Continue the execution of a 
...
          - An integer - Specify initial iteration offset (override the auto automatic last_iteratio...

10 months ago

looks like service-writing-time for me!

Nice!

persist/restore state so that tasks are restartable?

You mean if you write preemption-ready training code ?

2 years ago

0 I Have A Question About The Clearml Self Hosted Instance, I Notice There Is Elastic Search, Mondodb, And Redis In The Helm Chart Are These Required Or Can We Bring Our Own? I'M Wondering What Happens If I Were To Host The Instance And One Of These Were

I'm wondering what happens if i were to host the instance and one of these were to go down from time to time in production, as the deployments provided by the helm chart are not redundant.

Long story short, it will break the clearml-server, please do not take them down, if you do need to do that, also take down the clearml-server. The python clients will wait until it is up again, so no session would be destroyed

2 years ago

Would that help?

4 years ago

0 Question About The Storage Manager. Assuming I Have An Object That Updates Frequently And Always Saved At The Same Path (E.G.

are you referring to the same line? 47 in cache.py?

4 years ago

preempting lower priority tasks to allow a higher priority task to come in

Well this is usually outside of the scope of "single researcher" / "tiny team"...
This typically a large scale problem
That said, it will be fairly easy to write a service that aborts Tasks, "tags them to be "continued", then later (at night?!) push them back into a queue... wdyt?

2 years ago

0 Hi Everyone, We Train Our Ml Models Using The Aws Autoscaler On G4Dn Instances. We Currently Have A 24 Vcpu Limit For G Type Instances In Eu-West. I'M Trying To Get This Limit At Least Doubled Or Tripled. My Request Keeps Stagnating With The Service Team

Are you getting the error from boto failing to launch additional ec2 instances ?

2 years ago

0 Hi, I Am Wondering Why Do I Need To Create Files Before Applying Diff ?

Thanks DefeatedOstrich93
Let me check if I can reproduce it.

3 years ago

0 Question About The Storage Manager. Assuming I Have An Object That Updates Frequently And Always Saved At The Same Path (E.G.