Hi RipeAnt6
What would be the best way to add another model from another project say C to the same triton server serving the previous model?
You can add multiple call to cleaml-serving
, each one with a new endpoint and a new project/model to watch, then when you launch it it will setup all endpoints on a single Triton server (the model optimization loading is taken care by Triton anyhow)
SoggyBeetle95 the question is, where does clearml stores these arguments, and the answer is on the Task object (from there the agent will take them and apply to the docker execution). Now since all users see all the tasks, they also see these arguments. Wdyt?
You can query the system and get all the experiments based on date, then grab the machine GPU metrics.
DefeatedCrab47 check the cleanup service, it queries the system with the Apiclient.
https://github.com/allegroai/trains/blob/10ec4d56fb4a1f933128b35d68c727189310aae8/examples/services/cleanup/cleanup_service.py#L72
Happy new year @<1618780810947596288:profile|ExuberantLion50>
- Is this the right place to mention such bugs?Definitely the right place to discuss them, usually if verified we ask to also add in github for easier traceability / visibility
m (i.e. there's two plots shown side-by-side but they're actually both just the first experiment that was selected). This is happening across all experiments, all my workspaces, and all the browsers I've tried.
Can you share a screenshot? is this r...
Using agent v1.01r1 in k8s glue.
I think a fix was recently committed, let me check it
I see now.
Let's assume you know which snapshot that was:
` prev_task = Task.get_task(task_id='the_first_training_task_id')
get the second from last checkpoint
task.models['output'][-2].url
prev_scalars = prev_task.get_reported_scalars()
new_task = Task.init('example', 'new task')
logger = new_task.get_logger()
do some fpr loop and report the prev_scalars with logger.report_scalars
new_task.flush(wait_for_uploads=True)
new_task.set_initial_iteration(22000)
start the train `
i'm sorry, I mean if the queue name is not provided to the agent , the agent will look for the queue with the "default" tag. If you are specifying the queue name, there is no need to add the tag.
Is it working now?
Hi TastyOwl44
So this depends on your code itself, but usually you need a CPU machine to run ClearML server (or use the free community server), than a machine to run the pipeline controller (usually the same machine running the clearml-server
, as the pipeline control code is basically controller only and does not execute the Task itself), lastly you need machines with GPU running the clearml-agent
(these GPU machines are the one actually doing the training inference etc.)
Make ...
Will this still be considered as
global site-packages
This is a pip settings, I "think" it inherits from the local user's installation, but I would actually install with "sudo pip" that will definitely be "inherited"
What is the specific use case, updating a file on existing dataset and creating a new version?
if they're mission critical, but rather the clearml cache folder?
hmmm... they are important, but only when starting the process. any specific suggestion ?
(and they are deleted after the Task is done, so they are temp)
any idea why i cannot selected text inside the table?
Ichh, seems again like plotly 😞 I have to admit quite annoying to me as well ... I would vote here: None
The issue itself is changing the default user.
USER appuser
WORKDIR /home/appuser
Any reason for it ?
feature request: tell me what gets passed along each edge of the pipeline graph
Nice! please feel free to add to GH issue 🙂
Hmm yes that is odd, let me see if I can reproduce
Yes please 🙂
BTW: I originally thought the double quotes (in your PR) were also a bug, this is why I was asking, wdyt?
I mean manually you can get the results and rescale but, not through the UI
Check the links that are generated in the ui when you upload an artifact or model
I see TightElk12
You can always setup the OS environments : CLEARML_API_HOST CLEARML_WEB_HOST CLEARML_FILES_HOST with the correct configuration Or you can simply set CLEARML_NO_DEFAULT_SERVER=1 which will prevent any usage of the default demo serverwdyt?
doing some extra "services"
what do you mean by "services" ? (from the system perspective any Task that is executed by an agent that is running in "services-mode" is a service, there are no actual limitation on what it can do 🙂 )
Sen the full Task log, you can DM it if it is easier
Hmm that makes sense, btw the PYTHONPATH set by the agent would be the working dir listed under the Task, But if you set the agent.force_git_root_python_path
the agent would also add the root git repo to the python path
Hmm, so the way the configuration works is it loads the default configuration (equivalent to the example in the docs) then it adds the ~/clearml.conf on top. That means that you can tell your users to just copy paste the credentials from the UI into a template you make. How is that ?
yes, or (because I deployed clearml using helm in kubernetes) from the same machine, but multiple pods (tasks).
Oh now I see, long story short, no 😞 the correct way of doing that is every node/pod creates it's own dataset,
then when you are done, you create a new version with the X datasets that you created as parents, the newly created version is just "meta" it basically tells the system how to combine the previously generated datasets (i.e. no data is actually re-uploa...
HandsomeCrow5 I see, my bad.
BTW: Did you see this one?
https://github.com/allegroai/trains/blob/master/examples/optimization/hyper-parameter-optimization/hyper_parameter_optimizer.py
And the helper classes here: https://github.com/allegroai/trains/tree/master/trains/automation
Still I wonder if it is normal behavior that clearml exits the experiments with status "completed" and not with failure
Well that depends on the process exit code, if for some reason (not sure why) the process exits with return code 0, it means everything was okay.
I assume this is doing something "Detected an exited process, so exiting main" this is an internal print of your code, I guess it just leaves the process with exitcode 0
Understood, then I would use Task.remote_execution()
Basically :
task = Task.init(...)
# config some stuff
task.remote_execute(quque_name_here)
# this line will be executed on the remote machine only
This will both automatically log your code / repo with Task.init, and the call to Task.remote_execute will stop the local process (on your machine that runs the hydra sweep) and continue on the remote machine.
This will both allow you to use Hydra sweet & schedule / run on remote ...
So the thing is, regardless of the link you should end with:helper <clearml.storage.helper.StorageHelper object at 0x....>
But the code that failed seemed to return None, which makes me suspect the url itself is somehow broken.
Any chance you have a space before the "s3://" ?
BTW : what's the clearml version you are using ?
Although I didn't understand why you mentioned
torch
in my case?
Just a guess 🙂 other frameworks do multi-process as well,
I would guess it relates to parallelization of Tasks execution of the
HyperParameterOptimizer
class?
Yes that might be it, it's basically by product of using python "Process" class for multiprocessing. we are working on a fix, not a trivial unfortunately