Reputation
Badges 1
25 × Eureka!MotionlessCoral18 I think there is a fix in the latest clearml-agent RC 1.4.0rc0 can you test and update if your are still having this issue?
DeliciousBluewhale87 Yes I think so, do notice that you might end up with maximum of 12 pods.
You can also do the following with max 10 nodes: (notice --queue can always get a list of nodes it will pull based on the order of the queues)python k8s_glue_example.py --queue high_priority_q low_priority_q --ports-mode --num-of-services 10
Hi @<1523701240951738368:profile|RoundMosquito25>
Sure you can 🙂
task = Task.get_task("task_id_here")
metrics = task.get_last_scalar_metrics()
print(metrics[":monitor:gpu"])
shows that the trains-agent is stuck running the first experiment, not
the trains_agent execute --full-monitoring --id a445e40b53c5417da1a6489aad616fee
is the second trains-agent instance running inside the docker, if the task is aborted, this process should have quit...
Any suggestions on how I can reproduce it?
PlainSquid19 Trains will analyze the entire repository if this is a git repo code, and a single script file if there is no repository found.
It will not analyze an entire folder if it is not in a git repository, because it will not be able to recreate this folder anyhow. Make sense ?
Hi @<1524560082761682944:profile|MammothParrot39>
By default you have the last 100 iterations there (not sure why you are only seeing the last 3), but this is configurable:
None
Hi HandsomeCrow5 .
Remember the debug images are events with links to the actual images, so you first have to get the events and then you can download the images with https://allegro.ai/docs/examples/examples_storagehelper/#storagemanager (which by definition has the credentials, because it was able to upload them 🙂
To get the events:from trains.backend_api.session.client import APIClient client = APIClient() client.events.debug_images(task='aabbcc')
Hi AdventurousRabbit79
Try:"extra_clearml_conf" : "aws { s3 {key: A, secret : B, region: C, }} ",Generally speaking no need for the quotes on the secret/key
You also need the comma to separate between keys.
You can test if it is working by adding the same string to your local clearml.conf and importing the cleaml package
Hi @<1663354518726774784:profile|CrookedSeal85>
I am trying to optimize storage on my ClearML file server when doing a lot of experiments.
This is not straight forward, you will need to get a list of all the events via
None
filter on image events
and then delete the the URL you are getting via the StorageManager.
But to be honest, why not just direct it to S3 or something like that ?
Hmm that makes sense, I "think" the enterprise offering has a solution for that as well (i.e. full separation over static cluster), but probably the best way to constituent this avenue is talk to Sales (I'm assuming they'll setup a call to discuss the details)
Going back to the open source, I think that adding the credentials as part of the source code might allow to have "credentials" auto populate as part of the remote execution, wdyt?
CloudyHamster42 FYI the warning will not be shown in the next Trains version, the issue is now fixed, thank you 🙂
Regrading the double axes, see if adding plt.clf() helps. It seems the axes are leftover from the previous figure, that somehow are still there...
Yep 🙂
Also maybe worth changing the entry point of the agent docker to always create a queue if it is missing?
Hmm, how does your preprocessing code looks like?
No worries, you should probably change it to pipe.start(queue= 'queue') not start locally
s it working when you are calling it with start locally ?
One issue that I see is that the Dockerfile inside the agent container
Not sure I follow, these are settings for the default container to be used when the agent spins a Task for you.
How are you running the agent itself ?
Whoa, are you saying there's an autoscaler that
doesn't
use EC2 instances?...
Just to be clear the ClearML Autoscaler (aws) will spin instances up/down based on jobs in the queue it is listening to (the type of EC2 instances and configuration is fully configurable)
@<1546303293918023680:profile|MiniatureRobin9>
, not the pipeline itself. And that's the last part I'm looking for.
Good point, any chance you want to PR this code snippet ?
def add_tags(self, tags):
# type: (Union[Sequence[str], str]) -> None
"""
Add Tags to this pipeline. Old tags are not deleted.
When executing a Pipeline remotely (i.e. launching the pipeline from the UI/enqueuing it), this method has no effect.
:param tags: A li...
(also could you make sure all posts regrading the same question are put in the thread of the first post to the channel?)
BattyLion34 are you saying you do not have the "APP CREDENTIALS" section in the profile page?
Yes clearml is much better 🙂
(joking aside, mlops & orchestration in clearml is miles better)
CheerfulGorilla72 What are you looking for?
Hi LudicrousParrot69
A bit of background:
A Task is a job executed in the system (sometime it is an experiment training, sometime a controller like the pipeline). Basically everything process can be a task.
Specifically the pipeline controller itself (i.e. the process running the Bayesian optimization) is Task in the system (i.e. a job running). What it does (using the HyperParameterOptimizer) is cloning previously executed Tasks (e.g. training experiments), change their parameters and moni...
I'm wondering why this is the case as docker best practices does indicate we should use a non root on production images.
The docker image for the service-agent is not root ?
Hi VexedCat68
So if I understand correctly, the issue is this argument:parameter_override={'Args/dataset_id': '${split_dataset.split_dataset_id}', 'Args/model_id': '${get_latest_model_id.clearml_model_id}'},I think that what is missing is telling it this an artifact:parameter_override={'Args/dataset_id': '${split_dataset.artifacts.split_dataset_id.url}', 'Args/model_id': '${get_latest_model_id.clearml_model_id}'},You can see the example here:
https://clear.ml/docs/latest/docs/ref...
An upload of 11GB took around 20 hours which cannot be right.
That is very very slow this is 152kbps ...
Thanks VexedCat68 !
This is a great example, maybe PR it to the cleamrl-servvng repo ? wdyt?
RoughTiger69 yes I think "Scale" tier covers it 😉
DistressedGoat23 you are correct, since at the end this become a plotly object the extra_layout is for general purpose layout, but this specific entry is next to the data. Bottom line, can you open a github issue, so we do not forget to fix? In the mean time you can use the general plotly reporting as SweetBadger76 suggested
StraightDog31 can you elaborate? where are the parameters stored? who is trying to access them, and maybe for what purpose ?