data:image/s3,"s3://crabby-images/ea8fc/ea8fc4a242d3fbf9f124d8906a48b69b89ea53a2" alt="Profile picture"
Reputation
Badges 1
25 × Eureka!Hi SmarmyDolphin68
You have two options:
Automatically upload the models when training pass output_uri
to Task.init. For example output_uri=True
will upload to the clearml-server, output_uri='
s3://bucket/folder '
will upload to S3 etc. Manually upload a model that you have locally: https://github.com/allegroai/clearml/blob/9ff52a8699266fec1cca486b239efa5ff1f681bc/examples/reporting/model_config.py#L37
ohh sorry, weights_url=path
Basically url can be the local path to the weights file π
GloriousPanda26 wouldn't it make more sense that multi run would create multiple experiments ?
but I have no idea what's behingΒ
1
,Β
2
Β andΒ
3
Β compare to the first execution
This is why I would think multiple experiments, since it will store all the arguments (and I think these arguments are somehow being lost.
wdyt?
is how you would create different queues,
SarcasticSquirrel56 you can create them from the UI, when the server is already running
(if you are saying, how do I create them in the first installaiton, then yes you are correct, this is possible in the helm chart, I think π )
just want to be very precise an concise about them
Always appreciated π
I guess this is from clearml-server and seems to be bottlenecking artifact transfer speed.
I'm assuming you need multiple "file-server" instances running on the "clearml-server" with a load-balancer of a sort...
that is odd..
So if you have 3 agents, how many concurrent experiment are they running ? (actually running, not registered as running)
Let me see if I can reproduce something
FileNotFoundError: [Errno 2] No such file or directory: 'tritonserver': 'tritonserver'
This is oddd.
Can you retry with the latest from the github ?pip install git+
Hi WickedGoat98 ,
I think you are correct π
I would guess it is something with the ingress configuration (i.e. ConfigMap)
LOL I see a meme waiting for GrumpyPenguin23 π
ClearML does not work easily with Google Drive.
Yes, google drive is not google storage (which ClearML supports π )
Seems like you solved it?
LOL π
Make sure that when you train the model or create it manually you set the default "output_uri"
task = Task.init(..., output_uri=True)
or
task = Task.init(..., output_uri="s3://...")
What about the epochs though? Is there a recommended number of epochs when you train on that new batch?
I'm assuming you are also using the "old" images ?
The main factor here is the ratio between the previously used data and the newly added data, you might also want to resample (i.e. train on more) new data vs old data. make sense ?
Hi FierceHamster54
Thanks for bringing it up π
... in term of secret managements/key-value stores
Currently the open-source version does not include the Vault support (e.g. secret management), this is something they added to the enterprise version a few versions away, and as far as I understand this is a per user/project/company granularity feature (i.e. company wide merging with project merging with user specific).
Is this what you are looking for or am I missing something ?
Very lacking wrt to how things interact with one another
If I'm reading it correctly, what you are saying is that some of the "big picture" / holistic approach on how different parts interact with one another is missing, is that correct?
I think ClearML would benefit itself a lot if it adopted a documentation structure similar to numpy ecosystem
Interesting thought, what exactly would you suggest we "borrow" in terms of approach?
So this is verry odd, it looks like a pip bug:
The agent is trying to install torch==2.1.0.*
because by default it ignores the 4th+ parts (they are unstable and torch have tendency to remove them) . and for some reason pip will not match 2.1.0.*
with for example "2.1.0.dev20230306+cu118"
but based on the docs it should work:
see here: None
As a workaround you can always edit and change to the final url for example: so ...
Hi @<1526371965655322624:profile|NuttyCamel41>
I think that the only way to actually get huge number of api calls is with a lot of machines.
For example, regardless of the amount of console-logs you print, it will only be a single call, as these are packages every 2-10 seconds. The same with metric reporting etc.
On the free tier you cal already test the amount of API calls, I think the mechanism is exactly the same
fyi: I would put this question in the channel
Hi ComfortableHorse5
Yes this is more of a suggestion that you should write them using the platform capabilities, the UI implementation is being worked on, as well as a few helpers classes, I thin you'll be able to see a few in the next release π
Now in case I needed to do it, can I add new parameters to cloned experiment or will these get deleted?
Adding new parameters is supported π
Hi RipeGoose2
when creating a task the default path is still there
What do you mean by "PATH" do you want to provide path for the config file? is it for trains
manual execution or the agent
?
I'm hoping i can find an end to end solution that also includes experiment management
Well of course biased here, but ClearML with the hyperdatasets is probably the most complete one.
Specifically with model performance analysis I would add voxel open-source to dissect specific results. but the combination of the abstraction and query capabilities of hyperdatasets, orchestration and experiment management are really unmatched for.
(and again of course I'm biased, but really there is n...
Hi BroadMole98
What I think I am understanding about trains so far is that it's great at tracking one-off script runs and storing artifacts and metadata about training jobs, but doesn't replace kubeflow or snakemake's DAG as a first-class citizen.Β How does Allegro handle DAGgy workflows?
Long story short, yes you are correct. kubeflow and snakemake for that matter, are all about DAGs where each node is running a docker (bash) for you. The missing portions (for both) are:
How do I cr...
Hi GiganticTurtle0
You can keep clearml following the dictionary auto updating the UI
args = task.connect(args)
, the easiest way possible would be if could just some how run task and let the lsf manage the environment
You mean let the LSF set the conda/venv ? or do you also mean to get the code-base, changes etc ?
Yes! I checked it should work (it checks if you have load(...) function on the preprocess class and if you do it will use it:
None
def load(local_file)
self._model = joblib.load(local_file_name)
self._preprocess_model = joblib.load(Model(hard_coded_model_id).get_weights())
Assuming this is a followup on:
https://clearml.slack.com/archives/CTK20V944/p1626184974199700?thread_ts=1625407069.458400&cid=CTK20V944
This depends on how you set it with the clearml-serving --endpoint my_model_entrycurl <serving-engine-ip>:8000/v2/models/my_model_entry/versions/1