Reputation
Badges 1
25 × Eureka!Hi HelpfulHare30
I mean situations when training is long and its parts can be parallelized in some way like in Spark or Dask
Yes that makes sense, with both the function we are paralleling usually bottle-necked in both data & cpu, and both frameworks try to split & stream the data.
ClearML does not do data split & stream, but what you can do is launch multiple Tasks from a single "controller" and collect the results. I think that one of the main differences is that a ClearML Task is ...
LovelyHamster1 what do you mean by "assume the permissions of a specific IAM Role" ?
In order to spin an ec2 instance (aws autoscaler) you have to have correct credentials, to pass those credentials you must create a key/secret pair to pass to the autoscaler. There is no direct support for IAM Role. Make sense ?
Hi MoodyCentipede68 , I think I saw something like it, can you post the full log? The triton error is above, also I think it restarted the container automatically and then it worked
TrickySheep9
Is there a way to see a roadmap on such thingsΒ
?Β (edited)
Hmm I think we have some internal one, I have to admit these things change priority all the time (so it is hard to put an actual date on them).
Generally speaking, pipelines with functions should be out in a week or so, TaskScheduler + Task Triggers should be out at about the same time.
UI for creating pipelines directly from the web app is in the working, but I do not have a specific ETA on that
You mean like a name of the artifact ?
The cloning is done in another task, which has the argv parameters I want the cloned task to inherit from
JitteryCoyote63 What do you mean by that?
Hmmm, make sure the task doing the cloning is using 0.16.1 and above , because with .16 we added sections and the compatibility is between the version. Meaning if you have tasks generated with trains .16 you need trains .16 to clone them from code (so you could properly control the arguments)
Is there a reasonΒ
clearml
Β will use the demo server when there is noΒ
~/clearml.conf
?
It's the default server for easy getting started journey, e.g. you run some sample code and it works , with zero configuration.
that said you can set an environment flag to disable the default server behavior .CLEARML_NO_DEFAULT_SERVER=1
ReassuredTiger98
wdyt?
BTW:
it will push potentially proprietary data to the public demo server.
The server if su...
and what are their names ?
worker:0 worker:1 etc ?
the Task scheduler itself is a Task. What we did is we added a new parameter section on the Task (the task.connect call), so that we can later clone and modify it and use the new value in runtime
(Task.connect will put the data from the Task/UI back into the dict when the agent is running the Scheduler)
Does that make sense?
can i run it on an agent that doesn't have gpu?
Sure this is fully supported
when i run clearml-serving it throughs me an error "please provide specific config.pbtxt definion"
Yes this is a small file that tells the Triton server how load the model:
Here is an example:
https://github.com/triton-inference-server/server/blob/main/docs/examples/model_repository/inception_graphdef/config.pbtxt
TroubledHedgehog16 if you have a preinstalled conda env then why would you need to reinstall it from yml file? Also if this is the default python env, clearml-agent will inherit from it and use i, (no real overhead there)
Notice the reason for "inheriting system" python environments is so that the agent could cache the individual Task requirements, meaning next time it will not need to reinstall anything
wdyt?
If possible, can we have a "only one experiment can be given a single tag"
You mean "moving a tag" automatically (i.e. if someone else had the same tag it is removed from it)?
IrritableJellyfish76 hmm maybe we should an an extra argument partial_name_matching=False
to maintain backwards compatibility?
Hi IrritableJellyfish76
https://clear.ml/docs/latest/docs/references/sdk/task#taskget_tasks
task_name
(
str
) β The full name or partial name of the Tasks to match within the specified
project_name
(or all projects if
project_name
is
None
). This method supports regular expressions for name matching. (Optional)
You are right, this is a bit confusing, I will make sure that we add in the docstring an examp...
Number of entries in the dataset cache can be controlled via cleaml.conf : sdk.storage.cache.default_cache_manager_size
If you need to change the values:config_obj.set(...)
You might want to edit the object on a copy, not the original π
Notice you have in the Path:/home/npuser/.clearml/venvs-builds/3.7/task_repository/commons-imagery-models-py/sfi
But you should have:/home/npuser/.clearml/venvs-builds/3.7/task_repository/commons-imagery-models-py/
Hi CheekyElephant36
First you need to run it once on your machine, once this is done (only a few steps is enough), you can one it and enqueue it. Then to actually connect the aws autoscaler (the part that spins machines and runs tasks) go to applications and select the aqs autoscaler.
Btw i think the next video will be about YOLO + autoscaler
Oh no need to specify one, this is optional configuration.
Basically follow these steps only:
https://clear.ml/docs/latest/docs/deploying_clearml/clearml_server_linux_mac
Hi DrabCockroach54
Notice the free GPU memory is global hence (low), but the memory (at least with new nvidia drivers) is per process. I'm assuming that the processes using the memory is not a sub process? could that be ? whats the OS you are running on?
Thanks GorgeousMole24
That is a very good point! passing to product guys
Instead you can do: TRAINS_WORKER_NAME = "trains-agent":$DYNAMIC_INSTANCE_ID
Then the Worker ID will running instance appended to the worker name. This means that even if you use the same $DYNAMIC_INSTANCE_ID twice, you will not have two agent registering on the same name.
IntriguedRat44 how do I reproduce it ?
Can you confirm that marking out the Task.init(..) call will fix it ?
Hi IntriguedRat44
Sorry, I missed this message...
I'm assuming you are running in manual mode (i.e. not through the agent), in that case we do not change the CUDA_VISIBLE_DEVICES.
What do you see in the resource monitoring? Is it a single GPU or multiple GPUs?
(Check the :monitor:gpu in the Scalar tab under results,)
Also what's the Trains/ClearML version you are suing and the OS ?
And is Exectuer actually runs something, or is it IO?
SolidSealion72 this makes sense, clearml deletes artifacts/models after they are uploaded, so I have to assume these are torch internal files