Reputation
Badges 1
25 × Eureka!Do you mean it recently become part of enterprise version?
I do not think so, but it seems this the support for the open-source is more like a PoC
https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py
Hi DepressedChimpanzee34 , took me a while but I think there is a solution:
In your docker file, replace:
https://github.com/allegroai/clearml-server/blob/a64c4d264d00eadd2d11818b37151d3cc6266d99/docker/docker-compose.yml#L5
withentrypoint: /bin/bash command: -c "mkdir -p /var/log/clearml && cd /opt/clearml/ && python3 -m apiserver.apierrors_generator && gunicorn -w 4 -t 600 --bind=0.0.0.0:8008 apiserver.server:app"
Hi MiniatureCrocodile39
I would personally recommend the ClearML show π
https://www.youtube.com/watch?v=XpXLMKhnV5k
https://www.youtube.com/watch?v=qz9x7fTQZZ8
Are you getting the error from boto failing to launch additional ec2 instances ?
I guess. or pipelines that you can compose after running experiments to see that experiments are connected to each other
hmm what do you mean by "compose after running experiments" ? like a way to group them? what is the relation between one "item" to another ?
If this is a sequence of Tasks , are they executed by a controller ?
Awesome! any way to hear the talk w/o/ registering for the whole conference?
CloudySwallow27 Anyway we will make sure we upload the talk to the clearml youtube channel after the Talk
BTW
Grafana Visualizing endpoint request latency as well as prediction result value distributions
Just making sure i understand, you are to upload your models with clearml to the Yandex compatible s3 storage?
Hi WickedElephant66
So I'm trying to upload an artefact to clearmlβs fileserver(I have a self hosted clearml server running),
Are you trying to upload an artifact? If so I would do:task.upload_artifact('local file', artifact_object="/path/to/file")
Or is it about Model files?
You can alst check how to upload artifacts / models here:
https://github.com/allegroai/clearml/blob/master/examples/reporting/artifacts.py
https://github.com/allegroai/clearml/blob/master/examples/reporti...
I "think" I have a clue on the issue that is lost here in the translation:
Specifically to me it all comes down to the definition of "pipeline"
From the clearml perspective:
Manual Task - code that is executed by the user (or any other mechanism Outside of the agent)
Remote Task - code that is executed by the Agent
Pipeline is a Task
Pipeline can be "manual task" but also "remote task"
Pipeline generates "remote tasks"
Task status (e.g. pipeline status as it is also a Task) can be: draft, a...
That would be great! Might have to useΒ
2>/dev/null
Β in some of my bash scripts
Feel free to test and PR :)
One other question regarding connecting. We have setup sshd inside the docker image we are using.
Actually the remote session opens port 10022 on the host machine (so it does not collide with the default ssh port)
It actually runs an additional sshd
inside the docker, setting its port.
And the clearml-session will ssh directly into the container sshd...
@<1523701083040387072:profile|UnevenDolphin73> it's looking for any of the files:
None
Hi CurvedHedgehog15
User aborted: stopping task (3)
?
This means "someone" externally aborted the Task, in your case the HPO aborted it (the sophisticated HyperBand Bayesian optimization algorithms we use, both Optuna and HpBandster) will early stop experiments based on their performance and continue if they need later
Would be cool to let it get untracked as well, especially if we want to as an option
How would you decide what should be tracked?
ColossalDeer61 btw, it turns out the docker-compose services docker was ill configured on the GitHub π I suggest you get the latest copy of it:curl
-o docker-compose.yml
Hi @<1661180197757521920:profile|GiddyShrimp15>
I think the is a better channel for this kind of question
(they will be able to help with that)
Hi @<1663354518726774784:profile|CrookedSeal85>
However, I systematically notice a jump of some number of "ghost iterations" when resuming my trainings...
Try the following:
task = Task.init(..., continue_last_task=0
from the Task.init docstring (Notice this value can be both boolean and integer)
:param bool continue_last_task: Continue the execution of a
...
- An integer - Specify initial iteration offset (override the auto automatic last_iteratio...
Hi @<1661904968040321024:profile|SpotlessOwl43>
My problem is that when the AWS virtual machine is killed, my Pipelines and Scheduling stop working because of the killed ClearML agent,
are you using the ClearML AWS autoscaler to spin that machine ? or are you spinning it manually ?
Ohh then use the AWS autoscaler, basically it what you want, spin an EC2 and set an agent there, then if the EC2 goes down (for example if this is a spot), it will spin it up again automatically with the running Task on it.
wdyt?
Yes that's the reason, basically there is a background thread analyzing the code, at the end of the execution if it is till running (hence the question regrading execution time) we give it extra 10seconds to come up with answers, otherwise we terminate, so the code won't get stuck. Makes sense to you?
Hi @<1526371965655322624:profile|NuttyCamel41>
I think that the only way to actually get huge number of api calls is with a lot of machines.
For example, regardless of the amount of console-logs you print, it will only be a single call, as these are packages every 2-10 seconds. The same with metric reporting etc.
On the free tier you cal already test the amount of API calls, I think the mechanism is exactly the same
fyi: I would put this question in the channel
Hi @<1541229812243238912:profile|PoisedMoth54>
We should probably add a better interface (please feel free to open a github issue on the interface) until then
dataset._task.connect_configuration(configuration="path/to/file", name="my config")
Why does my task execution freeze after pip installation (running agent in foreground mode)?
Hi AdventurousButterfly15
Are you running in agent docker mode or venv mode ?
What do you mean freeze? do you see anything on the Taks console log in the UI? what's the host OS ?
If I checkout/download dataset D on a new machine, it will have to download/extract 15GB worth of data instead of 3GB, right? At least I cannot imagine how you would extract the 3GB of individual files out of zip archives on S3.
Yes, I'm not sure there is an interface to extract only partial files from the zip (although worth checking).
I also remember there is a GitHub issue with uploading 50GB dataset, and the bottom line is, we should support setting chuck size, so that we can uploa...
And other question is clearml-serving ready for serious use?
Define serious use? KFserving support is in the pipeline, if that helps.
Notice that clearml-serving is basically a control plane for the serving engine, not to neglect the importance of it, the heavy lifting is done by Triton π (or any other backend we will integrate with, maybe Seldon)
Pycharm does get confused sometimes
Hi @<1523701337353621504:profile|FlutteringSheep58>
are you asking how to convert a worker IP into a dns resolved host name ?
Hi @<1610083503607648256:profile|DiminutiveToad80>
Yes, it does. They are also cached by default (on the machine with the agent)
None
what do you mean? the same env for all components ? if they are using/importing exactly the same packages, and using the same container, then yes it could
Hmm so you are saying you have to be logged out to make the link work? (I mean pressing the link will log you in and then you get access)