so if the node went down and then some other node came up, the data is lost
That might be the case. where is the k8s running ? cloud service ?
None of them is problematic, this is what I'm trying to say 🙂
I think the minio browser gets confused.
if you want to test the upload time on the client you can try:task.flush(wait_for_uploads=True) tic = time() task.upload_artifact('test', '/tmp/localfile') task.flush(wait_for_uploads=True) print(time() - tic)
Hmm you will have to set the trains-server on a machine somewhere, it can be any machine win / Mac / Linux
task = Task.get_task(project_name='project', task_name='best_model_ever')
Great!
BTW: you can take some inspiration from here:
https://github.com/allegroai/trains/blob/master/examples/automation/task_piping_example.py
Or from the full pipeline:
https://github.com/allegroai/trains/blob/master/examples/pipeline/pipeline_controller.py
BattyLion34 are you saying you do not have the "APP CREDENTIALS" section in the profile page?
And are you sure your are pointing to the correct API server and not mixing API with WEB address ?
Also what's the clearml-server version?
Hi @<1729309120315527168:profile|ShallowLion60>
How did you create those credentials ?
if i put pipe.start earlier in the code, the pipeline fails to execute the actual steps.
pipe.start should be called after the pipeline was constructed and should be the "last" call of the script.
Not sure I follow what is "before" the code?
WackyRabbit7 you can configure AWS autoscaler with two types of instances , with priority to one of them. So in theory you do not need two autoscaler processes, with that in mind I "think" single IAM should suffice
StaleButterfly40 are you sure you are getting the correct image on your TB (toy255) ?
I want to be able to delete only the logs since they are taking a lot of space in my case.
I see... I do not think this is possible 😞
You can disable the auto logging though ... pass auto_connect_streams=False
to Task.init
No need, it should auto close it if you started it with Task.init (or the agent executed it)
Sorry my bad, you are looking for:
None
Could you right click on the failed experiment , select reset and send it again for execution?
Could that error be a random network issue ?
(Basically this seems like a generic network error not actually related to the trains-agent)
Is the trains-agent
running in docker mode or venv mode?
Open source defaults 😊
Hi @<1607909176359522304:profile|UnevenCow76>
followed the below documentation to implement the clearml monitoring using prometheus and grafana
Did you try following this example, it includes both deploying a model and adding grafana metrics:
None
Hmm yes we should probably provide metrics:client.workers.get_stats(..., items=[dict(key='cpu_usage'), dict(key='gpu_usage')])
When I give my Minio to output_uri argument, it uploads 500 KB /sec as before.
But it worked well when using StorageManager and uploading to the minio directly, is that correct?
.. I give my Minio to output_uri argument
How long did it take to run the demo code I posted?
(The one you mentioned took 0.16s to run locally)
BTW: we are now adding "datasets chunks for a more efficient large dataset storage"
but this gives me an idea, I will try to check if the notebook is considered as trusted, perhaps it isn't and that causes issues?
This is exactly what I was thinking (communication with the jupyter service is done over http, to localhost, sometimes AV/Firewall software will block it, false-positive detection I assume)
ThickDove42 looking at the code, I suspect it fails interacting with the actual jupyter server (that is running on the same machine, but still).
Any chance you have a firewall on the Windows machine ?
Feel free to add to the UI request list:
https://github.com/allegroai/trains/issues/81
Hi HealthyStarfish45
Funny just today I had a similar discussion on slurm:
https://allegroai-trains.slack.com/archives/CTK20V944/p1603794531453000
Anyhow, when you say "[scale up agents]" are you referring to a machine constantly running an agent pulling jobs from the queue, where the machine itself (aka the resource) is managed as a slurm job?
Hi @<1673501397007470592:profile|RelievedDuck3>
how can I configure my alerts to be notified when the distribution of my metrics (variables) changes on my heatmaps?
This can be done inside grafana, here is a simple example:
None
Specifically you need to create a new metric that is the distance of current distribution (i.e. heatmap) from the previous window), then on the distance metric, ...
Hi SmoothSheep78
Do you need to import the previous state of the trains-server, or are you starting from scratch ?