AgitatedDove14

49 Questions, 8111 Answers

Active since 10 January 2023

Last activity one year ago

Reputation

Badges 1

25 × Eureka!

Answers 8111

0 Hi All, Is There A Way To Schedule The Tasks From The Queue Onto The Gpu Instances Based On Factors Such As Gpu Utilisation, Number Of Cpu Cores Present, Free Memory Or Custom Parameters Such As Priority Of The Task, Estimated Time Etc?

I am trying to see if the user can submit a list of resource requirements (e.g 4GPUs, 12 cores, 100GB diskspace) for the task when queuing the task and the agents pick these tasks if they have the requested resources. With this, the user need not think about which queue to send the task to. The users just state what they need and the agents do the scheduling for them.

Can I assume we are talking Kubernetes under the hood for the resource allocation ?

3 years ago

0 Re Dataset Object: Is It Possible To Use Sync_Folder And Upload Several Times Along The Code And Then Finalize The Dataset?

EmbarrassedSpider34

Sync_folder and upload
Several times along the code and then

Do notice they overwrite one another...

3 years ago

0 Hey All. Is There A Best Practice Approach To Deploying Models Trained In Clearml? Does Anyone Have A Standard Workflow That They Employ?

Is this information stored anywhere or do I need to explicitly log this data somehow?

On the creating Task along side all the other reports.
Basically each model stores its creating Task (Task ID), using the Task ID you can query all the metrics reported by the task

3 years ago

0 I Just Deployed Clearml Into K8 Cluster Using Clearml Helm Package. When I Ran A Job, It Gave This Error In The Clearml Web Server (Attached Below). I Sshed Into The Pod Running The Clearml-Agent. Upon Typing Clearml-Agent Init, I Realised The Clearml.Con

Is the agent itself registered on the clearml-server (a.k.a can you see it in the UI?)

4 years ago

0 We Are Planning To Use A Data Versioning System, Because Now We Are Having A Lot Of Folders With Different Names Which Basically Contain The Same Data, Only With Small Changes. The Most Prominent Candidates Are Clearml Data And Dvc. Could You Tell Me What

Hi GreasyPenguin14

Could you tell me what the differences are and why we should use ClearML data?

The first difference is in the approach itself, DVC ties the data with the code (i.e. git repo), where we (ClearML - but not just us) actually think data should be abstracted from the Code-Base and become a standalone argument, allowing users to build/execute against different dataset/versions. ClearML Data becomes part of the workflow as it is visible from the UI including the abili...

3 years ago

0 Ok, Another Question: How To Get

ProudMosquito87 I think this is what you are looking for: https://github.com/allegroai/trains-agent/blob/master/docs/trains.conf#L101

4 years ago

0 I Have Setup A

I suppose the same would need to be done for any

client

PC running

clearml

such that you are submitting dataset upload jobs?

Correct

That is, the dataset is perhaps local to my laptop, or on a development VM that is not in the

clearml

system, but I from there I want to submit a copy of a dataset, then I would need to configure the storage section in the same way as well?

Correct

3 years ago

0 Hey, I'M Trying To Run The Aws Autoscaler And Pull A Docker Image From Ecr (Private Repository). I'M Currently Getting The Error:

Hi CleanPigeon16
You need to pass the private repository docker credentials to the aws instance, I would use the custom bash script option of the aws autoscaler to create the docker credentials file.

3 years ago

0 Hi, I Noticed That Clearml Does Not Work Together With The Debugger In Pycharm. Everytime I Use The Debugger I Have To First Comment Out The Clearml Code. Is It Possible To Solve This?

GreasyPenguin14 thank you! that will make our life a lot easier 🙂

4 years ago

0 So I'Ve Install Allegro On Kubernetes Using Helm, How To I Perform

SubstantialElk6 on the client side?

4 years ago

0 Hi, Is There A Way To Get The Quota Used By Each Task? My "Metrics" Quota Is Filling Up Very Quickly And I Would Like To Understand What'S Causing It.

I can definitely feel you!
(I think the implementation is not trivial, metrics data size is collected and stored as commutative value on the account, going over per Task is actually quite taxing for the backend, maybe it should be an async request ? like get me a list of the X largest Tasks? How would the UI present it? As fyi, keeping some sort of book keeping per task is not trivial either, hence the main issue)

one year ago

0 Hello! I Have A Hard Time Connecting To Non-Aws S3 Bucket To Use It As A Storage To Clearml Dataset. Even Though I Add Access And Secret Key In ~/Clearml.Conf File, When I Trying To Create A New Clearml Dataset:

Hi @<1555362936292118528:profile|AdventurousElephant3>

hard time connecting to non-aws s3 bucket

if this is a non-s3 the output_uri should look something like:
output_uri = " None :port/backet"
Then make sure you have the correct credentials in your clearml.conf:
None

2 years ago

0 Feature Request! Sub-Project In Trains. Or Is It There Already?

BeefyCow3 see this https://allegroai-trains.slack.com/archives/CTK20V944/p1593077204051100 :)

4 years ago

0 Is It Possible To Perform Debugging Operations With Pycharm Integration Using Remote Session?

Thanks for the ping ConvolutedChicken69 , I missed it 😞

from what i see in the docs it's only for Jupyter / VS Code, i didn't see anything about pycharm

PyCharm is basically SSH, which is supported 🙂
(Maybe we should mention it in the docs?)

3 years ago

0 Hi All. If I Understand Right, Dataset And Any Other Task Is Aborted After Some Time Of Inactivity. Can I Configure It On Some Level (Ideally On Task Level)? I Have A Pipeline That Is Expected To Run Several Days.

The issue I want to avoid is aborting of the dataset task that these regular tasks update.

HelpfulHare30 could you post a pseudo code of the dataset update ?
(My point is, I'm not sure the Dataset actually supports updating, as it need to reupload the previous delta snapshot). Wouldn't it be easier to add another child dataset and then use dataset.squash (like one would do in git) ?

3 years ago

0 I’M Getting These Errors When Using Agent In Docker Mode

clearml-agent daemon --detached --queue manual_jobs automated_jobs --docker --gpus 0

If the user running this command can run "docker run", then you should ne fine

3 years ago

DeliciousBluewhale87 could you restart the pod and ssh to the Host and make sure the folder /opt/clearml/agent exists and there is not *.conf file in it ?

4 years ago

0 Can I Use

Hi DangerousDragonfly8
You mean you want to trigger something when users archive a Task ?

2 years ago

0 Hi, I Am Quite Sure, That Someone Has Already Asked This Before, But I Suppose, That The Answer Will Be Simple: I Am Trying To Run Trains-Agent In Docker Mode, But I Need To Setup Pythonpath To Point To The Cloned Repo. I Was Trying To Add Following Arg:

WorriedParrot51 I now see ...
Two solutions that I can quickly think of:
In the code add:import sys sys.path.append('./my_sub_module')Assuming you always have to add the sub-directories to make the code work, and assuming they are part of the repository, this is probably the table stolution
2. In the the UI in the Docker base image, add -e PYTHONPATH=/folder
or from code (which is exactly what you did)
a clean interface task.set_base_docker('nvidia/cids -e PYTHONPATH=/folder")

4 years ago

0 Hi! I Am Using The Modelcheckpoint Callback From Tensorflow To Save The Best Model. When The Experiment Finishes If I Go On The Server To Experiment > Artifacts > Output Model I Can See The Model And Subsequently By Clicking On It The Weights. How Can I

Hi GrievingTurkey78
task.models['output'][-1] should return the last stored model.
What do you have under under task.models['output'][-1].url

Documentation:
https://allegro.ai/clearml/docs/rst/references/clearml_python_ref/model_module/model_outputmodel.html?highlight=model#model-outputmodel

4 years ago

0 Hi,

Hi FloppyDeer99

What is the meaning of no real scheduling

I think the meaning is that from the moment a k8s job is created, the k8s is in charge of actually spinning the container. Since k8s has no real priority/order the scheduling order is not guaranteed form this point.

The idea of the cleaml-k8s -glue is that the glue will launch a job on the k8s cluster only if it is sure there are enough resources to actually spin the job now (as opposed to, sometime in the future), this mea...

3 years ago

0 Hi, I Am New Here, Can I Ask Question On Trains-Server Also?

CooperativeFox72 yes 20 experiments in parallel means that you always have at least 20 connection coming from different machines, and then you have the UI adding on top of it. I'm assuming the sluggishness you feel are the requests being delayed.
You can configure the API server to have more process workers, you just need to make sure the machine has enough memory to support it.

4 years ago

0 I'M Trying To Understand How Clearml Serving Works And Trying To Set It Up. I Have An Agent Listening To The Serving Queue And I'M Trying To Set Up Clearml Serving To Launch On The Serving Queue. Do I Need To Have Clearml-Serving Installed On The Machine

can you tell me what the serving example is in terms of the explanation above and what the triton serving engine is,

Great idea!

This line actually creates the control Task (2)
clearml-serving triton --project "serving" --name "serving example"
This line configures the control Task (the idea is that you can do that even when the control Task is already running, but in this case it is still in draft mode).
Notice the actual model serving configuration is already stored on the crea...

3 years ago

0 Do Tasks That Are Created Through Create_Function_Task Run The Entry_Script Again Instead Of Just The Pure Function?

Hi JealousParrot68

do tasks that are created through create_function_task run the entry_script again instead of just the pure function

Basically they will run the code until the "create_function_task" call, but never after. We are working on adding a decorator to a function, making it a "standalone" script, is this what you actually need ?

3 years ago

0 Hey Folks, When I Run

Could it be the credentials are actually incorrect? because it seems like you can access the server? (I assume you were able to browse to it and generate credentials. right?)

4 years ago

0 Hi

ElegantKangaroo44 what do you think?

4 years ago

0 Hi, I'M Using Clearml'S Hosted Free Saas Offering. I'M Running Model Training In Pytorch On A Server And Pushing Metrics To Cml. I'Ve Noticed That Anytime My Training Job Fails Due To Gpu Oom Issues, Cml Marks The Job As

👍 Let me know if it solved the issue 🙂

2 years ago

0 Dear Clearml Community, I Am Trying To Optimize Storage On My Clearml File Server When Doing A Lot Of Experiments. To Achieve This, I Already Upload Only The Newest And Best Checkpoints To Clearml File Server Instead Of All Checkpoints. Another Component

Hi @<1663354518726774784:profile|CrookedSeal85>

I am trying to optimize storage on my ClearML file server when doing a lot of experiments.

This is not straight forward, you will need to get a list of all the events via
None
filter on image events
and then delete the the URL you are getting via the StorageManager.
But to be honest, why not just direct it to S3 or something like that ?

one year ago

0 Hi Team, How To Configure Gerrit Details In Clearml So That Tasks Or Pipeline Will Be Executed Depends On Gerrit?

@<1542316991337992192:profile|AverageMoth57> it sounds like you should use SSH authentication for the agent, just set
force_git_ssh_protocol: true
None
And make sure you have the SSH kets on the agent's machine

2 years ago

0 Hi! I Recently Updated My Server And My Clearml Version, Now When I Set A Task To Be Executed Remotely Its Default State Is Aborted Hence I Have To Reset And Enqueue, Is There Something I Am Doing Wrong (I Am Using Hydra Too)?

GrievingTurkey78 notice that when enqueuing an aborted Task, the agent will not deleted the previously reported metrics/logs

3 years ago

Show more results