Reputation
Badges 1
25 × Eureka!ElegantCoyote26 It means we need to have a keras logger that logs everything to trains, then we need to hook it automatically.
Do you feel like PR-ing the logger (the hooking I can take care of π )?
LethalDolphin75 Yes you are correct, we should add here:
https://github.com/allegroai/clearml/blob/400c6ec103d9f2193694c54d7491bb1a74bbe8e8/clearml/automation/optuna/optuna.py#L210elif isinstance(p, UniformLogarithmicParameterRange): hp_type = 'suggest_float' hp_params = dict(low=p.min_value, high=p.max_value if p.include_max else p.max_value - p.step_size, log=True, step=p.step_size)
btw: I'm not sure if the ...
any chance StorageManager could re-download files only if their size is different from file in cache (as an option)?
I think there is force
argument, to force download.
I think the main issue is getting the size from different backends (i.e. s3 /https / etc.)
Maybe we should add it as a GitHub feature request issue?
The main limitation is that the driver "list()" does not return file size.
For example it might be an issue with the default http files-server.
wdyt?
is it consistent ? (the error), meaning it happens on other integer values ?
So in a simple "all-or-nothing"
Actually this is the only solution unless preemption is supported, i.e. abort running Task to free-up an agent...
There is no "magic" solution for complex multi-node scheduling, even SLURM will essentially do the same ...
BeefyHippopotamus73 this error seems like it is coming from boto3, are you sure the credentials are properly configured and that you have read permission ?
Actually it would be interesting to combine the two, feast is fully open-source and supported by the linux foundation, so I cannot see the harm in that.
wdyt?
TightElk12 are you still looking for a way to create a new "sub-task" ?
when I duplicate the experiment and clone it remote, the call is ignored and the recorded values are used?
Yes ScantChimpanzee51 exactly.
Think of it as the inital value you want to put on the Task when you are running the code on your machine, later when you clone the Task, you can edit the base docker image in the UI (or with the API), of course the new value is used when the agent spins this Task, and to avoid the actual docker (the one you changed in the UI) to be overwritten by ...
Correct, and that also means the code the runs is not auto-magically logged.
Hi @<1546303293918023680:profile|MiniatureRobin9>
This is the "regular" message when calling Dataset.get
without an alias.
This means the Dataset is not registered on the Task itself, just give it a name (i.e. pass the alias
argument to get
)
HelplessCrocodile8
Basically the file URI might be different on a different machine (out of my control) but they point to the same artifact storage location
We might have thought of that...
in your clearml.conf file:
` sdk{
storage {
path_substitution = [
# Replace registered links with local prefixes,
# Solve mapping issues, and allow for external resource caching.
{
registered_prefix = file:///mnt/data/...
Ohh I see, okay next pipeline version (coming very very soon π will have the option of function as Task, would that be better for your use case ?
(Also in case of local execution, and I can totally see why this is important, how would you specify where is the current code base ? are you expecting it to be local ?)
I'm all for trying to help with debugging pipeline, because this is really challenging.
BTW: you can run your code as if it is executed from an agent (including the param ove...
Are you seeing the entire jupyter notebook in the "uncommitted changes" section
You can also see the code there, could you run a quick test against the demo-server, this might be a UI issue that was fixed in the last version.
NICE! CurvedHedgehog15 cool stuff! and my pleasure π
My task starts up and checks the mounted EFS volume for x data, if x data does not exist there, it then pulls x data from S3.
BoredHedgehog47 you can just use StorageManager and configure clearml cache for the EFS, it will essentially do the same π
Regrading helm chart with EFS,
you need to configure the clearml-glue pod template with the EFS mount
example :
https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/e7f647f4e6fc76f983d61522e635353005f1472f/examples/kubernetes/volu...
Hi @<1523711619815706624:profile|StrangePelican34>
if I am trying to deploy 100 models on a GPU that can handle 5 concurrently,
Main limitation is Triton's ability to dynamically load / unload models. We know Nvidia is adding this capability, but I think this is still not out, once they support it, it should be transparent
Hi BeefyHippopotamus73
. I checked the template task and the list of βInstalled Packagesβ indeed does not have one of my required packages in the list.
Basically the "installed packages" is auto populated based on the directly imported packages n your code base.
Could it be you do not have import snowflake-connector-python
and this is a derivative package (i.e. required from a different package)
BTW: when you clone your Task in the UI you can edit and add the missing packages,...
Hi @<1671689437261598720:profile|FranticWhale40>
Are you positive the Triton container finished syncing ?
Could you provide the docker log (both the serving and the triton)?
What is the clearml-serving version you are using ?
Could you add a print in the "preprocess" function, just to validate you are getting to the correct model version ?
This one should work:
` path = task.connect_configuration(path, name=name)
if task.running_locally():
my_params = read_from_path(path)
my_params = change_parmas(my_params) # change some staff
store back the change, my_params assumed to be the content of the param file (text)
task.set_configuration_object(name=name, config_taxt=my_params) `
JitteryCoyote63 no I think this is all controlled from the python side.
Let me check something
SoreDragonfly16 . In the hyper parameters Tab, you have "parallel coordinates" (next to the "add experiment" the button saying "values" press on it and there should be " parallel coordinates")
Is that it?
Hi @<1644147961996775424:profile|HurtStarfish47>
. I see
Add image.jpg
being printed for all my data items ...
I assume you forgot to call upload
? the sync "marks" files for uploaded / deletion but the upload call actually does the work,
Kind of like git add / push , if that makes sense ?
Wait, @<1686547375457308672:profile|VastLobster56> per your config clearml-fileserver
who sets this domain name? could it be that it is only on our host machine? you can quickly test by running any docker on your machine and running ping clearml-fileserver
from the docker itself.
also your log showed "could not download None ..." , I would expect it to be None ...
, no?
CooperativeFox72 a bit of info on how it works:
In "manual" execution (i.e. without an agent)
path = task.connect_configuration(local_path, name=name
path = local_path , and the content of local_path is stored on the Task
In "remote" execution (i.e. agent)
path = task.connect_configuration(local_path, name=name
"local_path" is ignored, path is a temp file, and the content of the temp file is the content that is stored (or edited) on the Task configuration.
Make sense ?
Is Task.current_task() creating a task?
Hmm it should not, it should return a Task instance if one was already created.
That said, I remember there was a bug (not sure if it was in a released version or an RC) that caused it to create a new Task if there isn't an existing one. Could that be the case ?
CooperativeFox72
Could you try to run the docker and then inside the docker try to do:su root whoami