Reputation
Badges 1
25 × Eureka!Hi @<1585078763312386048:profile|ArrogantButterfly10>
Now i want to clone the pipeline and change the hyperparameters of train task, is it possible? If so, how??
the pipeline arguments are for the pipeline DAG/logic, you need to pass one of the arguments as an argument for the training step/task. Make sense ?
@<1585078763312386048:profile|ArrogantButterfly10> could it be that in the "base task" of the pipeline step, you do not have any hyper-parameter ? (I mean the Task that the pipeline clones and is supposed to set new hyperparameters for...)
Although I didn't understand why you mentioned
torch
in my case?
Just a guess π other frameworks do multi-process as well,
I would guess it relates to parallelization of Tasks execution of the
HyperParameterOptimizer
class?
Yes that might be it, it's basically by product of using python "Process" class for multiprocessing. we are working on a fix, not a trivial unfortunately
In order to facilitate the multiple credentials one must use the Clearml SDK obviously.
Yes π
Is this example working for you?
https://github.com/allegroai/clearml/blob/master/examples/reporting/model_config.py
Okay, now I'm lost, is this reproducible ? are you saying Dataset with remote links to S3 does not work?
Did you provide credntials to your S3 (in tour clear.conf) ?
default is clearml data server
Yes the default is the clearml files server, what did you configure it to ? (e.g. should be something like None )
It is available of course, but I think you have to have clearmls-server 1.9+
Which version are you running ?
Let say I donβt have the data on my local machine but only S3 bucket.
You can still register it, but make sure you do not delete it from the S3 bucket because it will keep a link to it
Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known')': /
what did you put in output_uri
?
Hi @<1562610699555835904:profile|VirtuousHedgehong97>
I think you need to upgrade your self-hosted clearml-server, could that be the case?
Hi
The Squash operation copies all the data and is no longer linked to previous commits?
Yes, basically the idea is if you have data version that relies on many parents that needs to be merged, the squash will create a merged copy and push it all as a single version, and then yes the parent versions are no longer needed
I thought this operation is like git squash but it seems to me
yeah... we did not want to actually delete the parents because unlike git, the operation is done ...
@<1535793988726951936:profile|YummyElephant76>
Whenever I create any task the "uncommitted changes" are the contents of
ipykernel_launcher.py
, is there a way to make ClearML recognize that I'm running inside a venv?
This sounds like a bug, it should have the entire notebook there, no?
So it sounds as if for some reason calling Task.init inide a notebook on your jupyterhub is not detecting the notebook.
Is there anything special about the jupyterhub deployment ? how is it deployed ? is it password protected ? is this reproducible ?
@<1535793988726951936:profile|YummyElephant76> oh you mean like jupyter server was running, then inside the notebook you would start a new venv, in that venv "notebook" package was missing, hence it failed detecting the notebook ?
. Does
Task.connect
send each element of the dictionary as a separate api request? Has anyone else encountered this issue?
Hi SuperiorPanda77
the task.connect ends up as a single call with all the data being sent on a single request.
That said, maybe the connect dict is not the best solution for thousand key dictionary ...
Maybe artifact, or connect_configuration are better suited ?
wdyt?
Thanks a lot. I meant running a bash script after cloning the repository and setting the environment
Hmm that is currently not supported π
The main issue in adding support is where to store this bash script...
Perhaps somewhere inside clear ml there is an order of actions for starting that can be changed?
Not that I can think of,
but let's assume you could have such a thing, what would you have put in the bash script (basically I want to see maybe there is a worka...
eval
Β built-in. wdyt?
eval
is never recommended as basically you could do Args/float='os.system("rm ...")'
π
In theory type is stored on the hyper parameter (this is a relatively new feature the backend supports)
The casting though, is done based on the Original value type, which means Task.connect needs to be called with the original dict. Is there a specific reason for using get_parameters instead of task.connect ?
BTW: what happens if you pass the same s3://bucket to Task.init output_uri
? I assume you are getting the same access issue ?
Hi FierceHamster54
Do I need to instantiate a task inside my component ? Seems a bit redundant....
Yes, so the idea is that the Task (along the code) will be automatically linked with the output model, for better traceability.
That said you can "import" a model into the system (i.e. it was created somewhere else and you want to register it with InputModel.import_model
https://clear.ml/docs/latest/docs/clearml_sdk/model_sdk#importing-models
I guess "Input" from that perspecti...
FierceHamster54 are you sure you have write permissions ?
Hi FierceHamster54
Thanks for bringing it up π
... in term of secret managements/key-value stores
Currently the open-source version does not include the Vault support (e.g. secret management), this is something they added to the enterprise version a few versions away, and as far as I understand this is a per user/project/company granularity feature (i.e. company wide merging with project merging with user specific).
Is this what you are looking for or am I missing something ?
Hi @<1658281093108862976:profile|EncouragingPenguin15>
Should work, I'm assuming multiple nodes are running agents ? or are you saying Ray spins the jobs and clearml logs them ?
Should work out of the box, maybe the only thing to notice is that you will get a Task for every local_rank 0 process
does that make sense ?
connect_configuration
seems to take about the same amount of time unfortunately!
I think it is a better solution, that said from your description it sounds the issue is the upload bandwidth (i.e. json-ing the dict itself), could that be it?
(and even 1000 entries seems like something that would end up at 1mb upload, that is not that much)
SuperiorPanda77 I have to admit, not sure what would cause the slowness only on GCP ... (if anything I would expect the network infrastructure would be faster)
Hi @<1554275802437128192:profile|CumbersomeBee33>
what do you mean by "will the dependencies will be removed or not" ?
The next time the agent spin a new Task it will create a new venv and delete the previous one