It's dead simple to install:
Pip install trains-agent
the.n you can simply do:
Trains-agent execute --id myexperimentid
Thanks SparklingHedgehong28
So I think I'm missing information on what you call "Instance protection" ?
You mean like respining spot instances ? or is it away to review the performance of AWS ASG (i.e. like a watchdog of a sort) ?
Okay that kind of makes sense, now my followup question is how are you using the ASG? I mean the clearml autoscaler does not use it, so I just wonder on what the big picture, before we solve this little annoyance 🙂
So this should be easier to implement, and would probably be safer.
You can basically query all the workers (i.e. agents) and check if they are running a Task, then if they are not (for a while) remove the "protection flag"
wdyt?
My task starts up and checks the mounted EFS volume for x data, if x data does not exist there, it then pulls x data from S3.
BoredHedgehog47 you can just use StorageManager and configure clearml cache for the EFS, it will essentially do the same 🙂
Regrading helm chart with EFS,
you need to configure the clearml-glue pod template with the EFS mount
example :
https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/e7f647f4e6fc76f983d61522e635353005f1472f/examples/kubernetes/volu...
SparklingHedgehong28 this is actually quite cool! Still not sure why not just use the built in autoscaler https://github.com/allegroai/clearml/tree/master/examples/services/aws-autoscaler , but it is a really cool usage of ASG 🤩
EFS get downloaded to the k8 pod local volume?
EFS is an Amazon service that mounts a persistent FS into ec2 instances, I believe they have support for k8s as a service as well, which would make it kind of like a PV only as a service.
Does that make sense ?
Hi BoredHedgehog47
You mean like EFS for caching ?
the services queue (where the scaler runs) will be automatically exposed to new EC2 instance?
Yes, using this extra_clearml_conf
parameter you can add configuration that will be passed to the clearml.conf
of the instances it will spin.
Now an example to the values you want to add :agent.extra_docker_arguments: ["-e", "ENV=value"]
https://github.com/allegroai/clearml-agent/blob/a5a797ec5e5e3e90b115213c0411a516cab60e83/docs/clearml.conf#L149
wdyt?
I have timeseries dataset with dimension 1,60,1 which the first dimension is number of data, the second one is timestep
I think it should be --input-size 1 60 ` if the last dimension is the batch size?
(BTW: this goes directly to Triton configuration, it is the information Triton needs in order to run the model itself)
does this work for multiple levels?
Yep 😄
Hmm EmbarrassedPeacock82
Let's try with--input-size -1 60 1 --aux-config input.format=FORMAT_NCHW
BTW: this seems like a triton LSTM configuration issue, we might want to move the discussion to the Triton server issue, wdyt?
It said the command --aux-config got invalid input
This seems like an interface bug.. let me see if we can fix that 🙂
BTW: this seems like a triton LSTM configuration issue, we might want to move the discussion to the Triton server issue, wdyt?
Definitely!
Could you start an issue https://github.com/triton-inference-server/server/issues , and I'll jump join the conversation?
. Is there any reference about integrating kafka data streaming directly to clearml-serving...
Thanks @<1630377234361487360:profile|RoughSeaturtle43>
server certificate verification failed. CAfile: none CRLfile: none
Oh I see this is an https issue inside the container, you need to mount your self signed certificate
add something like that to your agent.conf:
extra_docker_arguments: ["-v", "/path/to/cert.pem:/etc/ssl/certs/myca.pem"]
I want is to manually provide a name to each series equal to the subject name (Subject 1, Subject 2, etc.)
They appear as they are reported to TB. I think this is a PyTorchLightning thing... If you look as the TB produced, you will get the same naming schemes, no?!
On my to do list, but will have to wait for later this week (feel free to ping on this thread to remind me).
Regrading the issue at hand, let me check the requirements it is using.
Hi @<1630377234361487360:profile|RoughSeaturtle43>
code from gitlab repo with ssl cert.
what do you mean by ssl secret? is it SSH or app-token ?
however when I clone or reset said task after completion and then enqueue it again, I get the above error.
This part is somewhat confusing... There is no magic happening behind the scenes, cloning a Task and creating it, is basically the same ... Do you have a reference to the YOLOv5 code base itself, maybe I can figure out what's the issue?
Thanks GentleSwallow91
That's a good tip, where in the docs would you add it?
You're suggesting that the false is considered a string and not a bool?
The clearml-server always stores the values as strings (serializing them), the casting is done when passed back to the code in runtime. The issue here is there is actually no "way" to tell the argparser this is a boolean (basically any value that will be passed is treated as string). What I think we should do is fix the casting function so that if this is exatcly the same value we use the default value (i.e. boole...
Anyway, in the docs, there is a function called task.register_artifact()
Yes, this is rather deprecated... The idea is that it will monitor an obejct and auto sync it (i.e. serialize and upload).
That said, it is just so much easier to do task.upload_artifact
and you can always update/overrwrite if you are passing the same name, that I cannot see the actual use case. Does that make sense? What are you using it for ?
I want to build a real time data streaming anomaly detection service with clearml-serving
Oh, so the way it currently works clearml-serving will push the data in real-time into Prometheus (you can control the stats/input/out), then you can build the anomaly detection in grafana (for example alerts on histograms over time is out-of-the-box, and clearml creates the histograms overtime).
Would you also need access to the stats data in Prometheus ? or are you saying you need to process it ...
Hi GracefulDog98
The agent will map the ~/.ssh folder automatically into the docker's /root/.ssh
It will also convert http links to ssh pull if you set force_git_ssh_protocol
in your clearml.conf :
https://github.com/allegroai/clearml-agent/blob/351f0657c3dcf707659875d7e0a52fa387709978/docs/clearml.conf#L25
This is assuming you can just run two copies of your code, and they will become aware of one another.
Hangs there ? could it be that it's uploading slowly ?
Can you check the network ?
, but it seems like I can only trigger a task using a Task scheduler but not a pipeline.
@<1523701132025663488:profile|SlimyElephant79> Maybe we should better state it, but Pipeline is "just" another type of Task. so triggering a Task with the Pipeline ID is essentially triggering the pipeline (do notice you need to select the "services" queue to be used so that the pipeline runs on the correct resource). Make sense ?
Hi SkinnyPanda43
Are you trying to access the same Task or an external one ?