Reputation
Badges 1
25 × Eureka!Hi MuddySquid7 issue is verified, v1.1.1 will be released in a few hours with a fix.
Thank you for noticing!
Hi @<1657918724084076544:profile|EnergeticCow77>
Can I launch training with HugginFaces accelerate package using multi-gpu
Yes,
It detects torch distributed but I guess I need to setup main task?
It should 🤞
Under the execution Tab script path, you should see something like -m torch.distributed.launch ...
When I passed specific arguments (for example --steps) it ignored them...
script.py test blah1 blah2 blah3 42
Is this how it is intended to be used ?
CooperativeFox72 a bit of info on how it works:
In "manual" execution (i.e. without an agent)
path = task.connect_configuration(local_path, name=name
path = local_path , and the content of local_path is stored on the Task
In "remote" execution (i.e. agent)
path = task.connect_configuration(local_path, name=name
"local_path" is ignored, path is a temp file, and the content of the temp file is the content that is stored (or edited) on the Task configuration.
Make sense ?
CooperativeFox72 could you expand on "not working"?
If you have a yaml file, I would do:
` # local_path = './my_config.yaml'
path = task.connect_configuration(local_path, name=name)
if task.running_locally():
with open(local_path, "r") as config_file:
my_params_dict = yaml.load(config_file, Loader=yaml.FullLoader)
my_params_dict['change_me'] = 'new value'
my_params_text = yaml.dump(my_params_dict)
store back the change, my_params assumed to be the content of the param file (tex...
if so is there any doc/examples about this?
Good point, passing to docs 🙂
https://github.com/allegroai/clearml/blob/51af6e833ddc5a8ba1efaaf75980f58616b25e85/examples/optimization/hyper-parameter-optimization/hyper_parameter_optimizer.py#L123
I mean it is mentioned, but we should highlight it better
i've tried setting up a clearml application on openshift
First, my condolences 🙂 openshift ...
Second, what you need to make sure is that each container (i.e. ELK/Monogo etc) has their own PV for persistent storage , I'm assuming this is the root cause for the error.
Make sense to you ?
Hi @<1627478122452488192:profile|AdorableDeer85>
I'm sorry I'm a bit confused here, any chance you can share the entire notebook ?
Also any reason why this is pointing to "localhost" and not IP/host of the clearml-server ? is the agent running on the same machine ?
Hi @<1627478122452488192:profile|AdorableDeer85>
Are you referring to running the pipeline on a remote machine ? could you provide the full Task/Pipeline log ?
I'm getting lot of bizarre errors running without a docker image attached
I think there is a mix in terminology
ClearML Agent can run in two different modes:
- virtual env - where it create a new venv for every Task executed
- docker mode- where it spins a docker as Base environment, then inside the docker (in real time) it will fetch the code, install missing python packages etc.There is no need to build a specific docker container, for example you can use the "python:3.10-bullseye" d...
@<1523701868901961728:profile|ReassuredTiger98> what are you getting with:
nvidia-smi
And here:
ls -la /usr/local/
LittleShrimp86 what do you have in the Configuration Tab of the Cloned Pipeline?
(I think that it has empty configuration -> which means empty DAG, so it does nothing and leaves)
Hi @<1729309120315527168:profile|ShallowLion60>
Clearml in our case installed on k8s using helm chart (version: 7.11.0)
It should be done "automatically", I think there is a configuration var in the helm chart to configure that.
What urls are you urls seeing now, and what should be there?
Hi CooperativeFox72
I think the upload reporting (files over 5mb) was added post 0.17 version, hence the log.
The default is upload chunk reporting is 5MB, but it is not configurable, maybe we should add it to the clearml.conf ? wdyt?
The main reason we need the above mentioned functionality is because there are some experiments that need to run for a long time. Let's say weeks.
Good point!
. We need to temporarily pause(kill or something else) running HPO task and reassign the resource for other needs.
Oh I see now....
Later, when more important experiments has been completed, we can continue HPO task from the same state.
Quick question when you say the HPO Task, you mean the HPO controller logic Task...
what is user properties
Think of them as parameters you can add post execution, that you can also add to the Task table (i.e. customize columns)
how can I add parameters
task.set_user_properties([{"name": "backbone", "description": "network type", "value": "great"},]
A few more details on the New RC (1.1.2rc0) change set:
Upload dataset now supports chunksize, for multi-part upload/download (useful with large datasets)
backwards compatibility, i.e. parent datasets do not have to support multi-part datasets
Notice multi-part datasets should be accessed with latest RCcleaml-data upload --chunk-size Dataset().upload(..., chunk_size=None)
Get Dataset support partial download (i.e. for debugging, or for more efficient multi-node support)
Notice total n...
None
So this is the only place we need to change to support it, do you feel like messing around with it and adding IAM roles ?
Hi @<1546303293918023680:profile|MiniatureRobin9> could it be the pipeline logic is created via the clrarml-task CLI? If this is the case, I think this is an edge case we should fix. Basically it creates a Task instead of pipeline, which in.essence only effects the UI. To solve it, just run the pipeline locally, notice that by default when you start it, it will actually stop the local run and relaunch itself on an agent.
Also, could you open a GitHub issue so we add a flag for it?
Hi GrittyKangaroo27
How could I turn off model logging when running this training step?
This is a good point! I think we cannot pass these arguments.
Would this make sense to you?PipelineDecorator.component(..., auto_connect_frameworks)
wdyt?
Hi JitteryCoyote63 ,
These properties are usually not available on the UI and are used internal, hence the lack of documentation. Regrading parent property, it will hold a parent Task.id (str) , that said it has no real effect on the Task itself. You can however search for Tasks with a specific parent ID (For examples, this is how the the hyper parameter class is using this property)
Hi RoughTiger69
but still get the semantics of knowing when an (external) file changed?
How would you know it changed?
This implies you have a way to verify hash, which means you download the data , no?
Hmm, interesting, why would you want that? Is this because some of the packages will fail?
SarcasticSparrow10 LOL there is a hack around it 🙂
Run your code with python -O
Which basically skips over all assertion checks
I pass my dataset as parameter of pipeline:
@<1523704757024198656:profile|MysteriousWalrus11> I think you were expecting the dataset_df dataframe to be automatically serialized and passed, is that correct ?
If you are using add_step, all arguments are simple types (i.e. str, int etc.)
If you want to pass complex types, your code should be able to upload it as an artifact and then you can pass the artifact url (or name) for the next step.
Another option is to use pipeline from dec...
Apparently the error comes when I try to access from
get_model_and_features
the pipeline component
load_model
. If it is not set as pipeline component and only as helper function (provided it is declared before the components that calls it (I already understood that and fixed, different from the code I sent above).
ShallowGoldfish8 so now I'm a bit confused, are you saying that now it works as expected ?
Hi PerplexedCow66
I'm assuming an extension for this:
https://github.com/allegroai/clearml-serving/issues/32
Basically JWT can be used as a general access/block all endpoints, which is most efficnely used if handled by k8s loadbalancer (nginx/envoy),
but if you want a per-endpoint check (or maybe do something based on the JWT values)
See adding JWT to FastAPI here:
https://fastapi.tiangolo.com/tutorial/security/oauth2-jwt/?h=jwt#oauth2-with-password-and-hashing-bearer-with-jwt-tokens
T...
Hi SuperiorDucks36
Could you post the entire log?
(could not resolve host seems to be coming from the "git clone" call).
Are you able to manually clone the repository on the machine running trains-agent