I guess the thing that's missing from offline execution is being able to load an offline task without uploading it to the backend.
UnevenDolphin73 you mean like as to get the Task object from it?
(This might be doable, the main issue would be the metrics / logs loading)
What would be the use case for the testing ?
Hi @<1671689437261598720:profile|FranticWhale40>
Are you positive the Triton container finished syncing ?
Could you provide the docker log (both the serving and the triton)?
What is the clearml-serving version you are using ?
Could you add a print in the "preprocess" function, just to validate you are getting to the correct model version ?
Hi TenderCoyote78
I'm trying to clearml-agent in my dockerfile,
I'm not sure I'm following, Are you traying to create a docker container containing the agent inside? for what purpose ?
(notice that the agent can spin any off the shelf container, there is no need to add the agent into the container it will take of itself when it is running it)
Specifically to your docker file:
RUN curl -sSL
| sh
No need for this line
COPY clearml.conf ~/clearml.conf
Try the ab...
BattyLion34 if everything is installed and used to work, what's the difference from the previous run that worked ?
(You can compare in th UI the working vs non-working, and check the installed packages, it would highlight the diff, maybe the answer is there)
but the requirement was already satisfied.
I'm assuming it is satisfied on the host python environment, do notice that the agent is creating a new clean venv for each experiment. If you are not running in docker-mode, then you ca...
ElegantCoyote26 what you are after is:docker run -v ~/clearml.conf:/root/clearml.conf -p 9501:8085
Notice the internal port (i.e. inside the docker is 8080, but the external one is changed to 9501)
Tried context provider for Task?
I guess that would only make sense inside notebooks ?!
or even different task types
Yes there are:
https://clear.ml/docs/latest/docs/fundamentals/task#task-types
https://github.com/allegroai/clearml/blob/b3176a223b192fdedb78713dbe34ea60ccbf6dfa/clearml/backend_interface/task/task.py#L81
Right now I dun see differences, is this a deliberated design?
You mean on how to use them? I.e. best practice ?
https://clear.ml/docs/latest/docs/fundamentals/task#task-states
Click on the Task it is running and abort it, it seems to be stuck, I guess this is why the others are not pulled
Notice this is only when:
Using Conda as package manager in the agent the requested python version is already installed (multiple python version installation on the same machine/container are supported)
TeenyFly97 the TL;DR is:
Task.close() should be called when you previously used Task.init (i.e the code creating the task)
Task.mark_stopped() should be called to stop a remote Task running.
I hope it helps 🙂
Hi ScantChimpanzee51
Is it possible to run multiple agent on EC2 machines started by the Autoscaler?
I think that by default you cannot,
having the Autoscaler start 1x p3.8xlarge (4 GPU) on AWS might be better than 4x p3.2xlarge (1 GPU) in terms of availability, but then then we’d need one Agent per GPU.
I think that this multi-GPU setup is only available in the enterprise tier.
That said, the AWS pricing is linear, it costs the same having 2 instances with 1 GPU as 1 instanc...
I still don't get resource logging when I run in an agent.
@<1533620191232004096:profile|NuttyLobster9> there should be no difference ... are we still talking about <30 sec? or a sleep test? (no resource logging at all?)
have a separate task that is logging metrics with tensorboard. When running locally, I see the metrics appear in the "scalars" tab in ClearML, but when running in an agent, nothing. Any suggestions on where to look?
This is odd and somewhat consistent with actu...
MortifiedCrow63 , hmmm can you test with manual upload and verify ?
(also what's the clearml version you are using)
CleanWhale17 what is " Online-Training  Support(for Dataset Shifts" ?
PompousParrot44 , so you mean like a base conda env?
Configuring trains-agent to use conda is done here:
https://github.com/allegroai/trains-agent/blob/699d13bbb34649c7e5337b4187cda59b7fa6fd33/docs/trains.conf#L44
Then for every experiment trains-agent will create a new conda environment based on the requirements of that experiment.
You can tell it to inherit the base conda env (or the one it is running from, I think) by settingsystem_site_packages: truehttps://github.com/allegroai/tr...
I mean what is the actual link?
File:// is a path to a file.
If your machine cannot access that path you get an error.
For example:
file:///home/user/file.bin
translates to /home/user/file.bin
If you do not have the file /home/user/file.bin on your machine you get an error.
GrievingTurkey78 make sense ?
Note that by default trains / clearml will not upload your weights file anywhere , only if you set "output_uri" to a specific location it will do that .
Hi @<1645597514990096384:profile|GrievingFish90>
You mean the agent itself inside a docker then the agent spins sibling dockers for the Tasks ?
Can you send the console output of this entire session please ?
I'm not sure on the frequency it updates though
Essentially the example provide just prints out ids to the log file,
What do mean?
Hi PompousParrot44
Well this kind of control is tricky. If you don't mind processes "fighting over cpu" you can just spin two trains-agents in cpu-mode. It will work as long as they have a different TRAINS_WORKER_NAME
The other option (might be a bit of an overkill) is to use K8s, which will set the CPU % for the entire agent.
What do you think?
you need to set
CLEARML_DEFAULT_BASE_SERVE_URL:
So it knows how to access itself
he problem is due to tight security on this k8 cluster, the k8 pod cannot reach the public file server url which is associated with the dataset.
Understood, that makes sense, if this is the case then the path_substitution feature is exactly what you are looking for
@<1651395720067944448:profile|GiddyHedgehong81> just to be clear, Dataset.get_local_copy returns a path to your files,
You have to Manually add the additional path to the specific files you need to use. It does Not know that in advance.
That was the initial issue you had, and I assume it is the same one here. does that make sense ?
And command is a list instead of a single str
"command list", you mean the command argument ?
When using the UI with regex to search for experiments, due to the greedy nature of the search, it consistently pops up the "ERROR Fetch Experiments failed" window when starting to use groups in regex (that is, parentheses of any kind).
hmm that is a good point (i.e. only on enter it would actually search)
Could it be updated so that if an invalid regex pattern is given, it simply highlights the search bar in red (or similar) rather than stop us while writing the search pattern?
...
Oh if this is the case you can probably do
` import os
import subprocess
from clearml import Task
from clearml.backend_api.session.client import APIClient
client = APIClient()
queue_ids = client.queues.get_all(name="queue_name_here")
while True:
result = client.queues.get_next_task(queue=queue_ids[0].id)
if not result or not result.entry:
sleep(5)
continue
task_id = result.entry.task
client.tasks.started(task=task_id)
env = dict(**os.environ)
env['CLEARML_TASK_ID'] = ta...