Hi WickedElephant66
So I'm trying to upload an artefact to clearmlβs fileserver(I have a self hosted clearml server running),
Are you trying to upload an artifact? If so I would do:task.upload_artifact('local file', artifact_object="/path/to/file")
Or is it about Model files?
You can alst check how to upload artifacts / models here:
https://github.com/allegroai/clearml/blob/master/examples/reporting/artifacts.py
https://github.com/allegroai/clearml/blob/master/examples/reporti...
Are Kwargs supported in functions decorated as a pipeline component?
They are, but I think the main issue is the casting, without prior knowledge, everything will be a tring
I see,
@<1571308003204796416:profile|HollowPeacock58> can you please send the full log?
(The odd thing is it is trying to install the python 3.10 version of torch, when your command line suggest it is running python 3.8)
Yes in the UI, clone or reset the Task, then youcan edit the installed packages section under the Execution tab
Can you do the following
Clone the Task you previously sent me the installed packages of, then enqueue the cloned task to the queue the agent with the conda.
Then send me the full log of the task that the agent run
So the original looks good, could it be you tried to clone a Task that was executed with an agent with pip, and then pushed into an agent running conda?
@<1569496075083976704:profile|SweetShells3> remove these from your pbtext:
name: "conformer_encoder"
platform: "onnxruntime_onnx"
default_model_filename: "model.bin"
Second, what do you have in your preprocess_encoder.py
?
And where are you getting the Error? (is it from the triton container? or from the Rest request?
data["encoded_lengths"]
This makes no sense to me, data is a numpy array, not a pandas frame...
CrookedWalrus33 I found the issue, this is only failing with Python 3.6.
Let me check something
Oh think I understand you point now.
basically you can:
Create the initial Task, once it is in the system clone it and adjust parameters externally. A simple example here:
https://github.com/allegroai/clearml/blob/0397f2b41e41325db2a191070e01b218251bc8b2/examples/automation/manual_random_param_search_example.py#L41
wdyt?
Hi FancyWhale93 you can disable the auto model uploading with@PipelineDecorator.component(..., auto_connect_frameworks={'pytorch': False}) def step(): pass
Hi AttractiveCockroach17
. Many of these experiments appear with status running on clearml even though they have finish running,
Could it be their process just terminated? (i.e. not properly shutdown) ?
How are you running these multiple experiments?
BTW: if the server does not see any change in a Task for (I think the default is 2 hours) it will automatically mark these Task as aborted
Hi IrritableGiraffe81
You can access the model object with, task.models['output']
To set the model metadata I would recommend making sure you have the latest clearml package, I think this is relatively new addition
Hi IrritableGiraffe81
PipelineDecorator.debug_pipeline() runs everything as regular python functions, but "PipelineDecorator.run_locally()" is actually sumulating all the steps on the same local machine (so that it is easier to debug the "real" pipeline running on multiple machines)
What I think is happening is that the casting of the arguments passed to the component fail.
Basically the type hints are currently ignored (we are working on using them for casting in the next version)
but righ...
I suppose one way to perform this is with a
that kicks
Yes, that was my thinking.
It seems more efficient to support a triggered response to task fail.
Not sure I follow this one, I mean the pipeline logic itself monitors the execution. If I'm not mistaken, try/except will catch a step that files, and a global will catch the entire pipeline. Am I missing something ?
it's saved in a
lightning_logs
folder where i started the script instead.
It should be saved there + it should upload it to your file server
Can you send the Task log? (this is odd)
yey working π
You can already sort and filter experiments based on any hyper parameter or metric that the experiment reports, there is no need for any custom language query. Also all created filter/sorted table can be shared exactly as they are, so you can create leaderboards and share specific filters. You can also use the search bar in order to filter based on experiment name / comment. Tags will be added soon as well π
Example of custom columns is here (the screen grab is a bit old, now there is als...
what is the best approach to update the package if we have frequent update on this common code?
since this package has an indirect affect on the model endpoint, I would package with the preprocess code of the endpoint.
Each server is updating it's own local copy, and it will make sure it can take it and deploy it hand over hand without breaking its ability to serve these endpoints.
the "wastefulness" of holding multiple copies is negligible when comparing to a situation where everyone ...
However, regarding your recommendation of using
StorageManager
class to delete the URL, it seems that this class only contains methods for checking existence of files, downloading files and uploading files, but
no method
for actually
deleting
files based on their URL (see doc
and
).
Yes you are correct π you should use a "deeper" class:
helper = StorageHelper.get(remote_url)
helper.delete(remo...
, but what I really want to achieve is to share this code:
You mean to share the code between them, unless this is a "preinstalled" package in the container, each endpoint has it's own separate set of modules / files
(this is on purpose, so you could actually change them, just image diff versions of the same common.py file)
Should work out of the box, maybe the only thing to notice is that you will get a Task for every local_rank 0 process
does that make sense ?
Okay this is a bit tricky (and come to think about it, we should allow a more direct interface):pipe.add_step(name='train', parents=['data_pipeline', ], base_task_project='xxx', base_task_name='yyy', task_overrides={'configuration.OmegaConf': dict(value=yaml.dump(MY_NEW_CONFIG), name='OmegaConf', type='OmegaConf YAML')} )
Notice that if you had any other configuration on the base task, you should add them as well (basically it overwrites the configurati...
I start the TaskScheduler, register a task, and stop the scheduler, how do I restart the TaskScheduler in a way that re-register the tasks?
if it's aborted, just re-enqueue it?
(it serializes itself and stores it's state on the Task object, so when re-launched it will deserialize from the last state)
Hi @<1556812486840160256:profile|SuccessfulRaven86>
Every clearml-serving session (you can have multiple different "sessions") is assumed to be homogeneous, this would mean it will serve the same models on as many nodes as possible supporting multiple models per pod.
In your example I think the easiest is to create two serving sessions one with a node selector for the 24GB node and another for the 16GB node, wdyt?
Hi @<1587615463670550528:profile|DepravedDolphin12>
Is there anyway to get the id of the pipeline using pipeline name?
In the UI top right "details" panel should have the Pipeline ID
Is this what you are looking for ?
btw, I looked deeper into the log:
File "/tmp/tmpfa8ifmka.py", line 80, in <module>
model.train(data='coco128.yaml',epochs=20)
I'm assuming this all starts here, I think that the pipeline is Not running the code from the same folder, and you are just missing the 'coco128.yaml' try to pass a full path, wdyt?
I think I was not able to fully express my point. Let me try again.
When you are running the pipeline Fully locally (both logic and components) the assumption is this is for debugging purposes.
This means that the code of each component is locally available, could that be a reason?
Hi @<1644147961996775424:profile|HurtStarfish47>
. I see
Add image.jpg
being printed for all my data items ...
I assume you forgot to call upload
? the sync "marks" files for uploaded / deletion but the upload call actually does the work,
Kind of like git add / push , if that makes sense ?
Hi CooperativeFox72 ,
From the backend guys, long story short, upgrade your machine => more cpu cores , more processes , it is that easy π