Reputation
Badges 1
25 × Eureka!Okay found the issue, to disable SSL verification global add the following env variable:CLEARML_API_HOST_VERIFY_CERT=0
(I will make sure we fix the actual issue with the config file)
I see...
Current (and this will change soon) the entire delta is stored in a single file, so there is no real way to download a "subset" of the data, only a parent version ๐
Lets say that this small dataset has a ID ....
Yes this would be exactly the way to do so:
` param ={'dataset': small_train_dataset_id_here}
task.connect(param)
dataset_folder = Dataset.get(param['dataset']).get_local_copy()
... Locally it will use the
small_train_dataset_id_here ` , then whe...
Do people use ClearML with huggingface transformers? The code is std transformers code.
I believe they do ๐
There is no real way to differentiate between, "storing model" using torch.save
and storing configuration ...
Hi @<1687653458951278592:profile|StrangeStork48>
I have good news, v1.0 is out with hashed passwords support.
Yes that should work, only thing is you need to call Task init on the master process (and make sure you call Task.current_task() on the subprocesses, if you want to automagic to kick in, that said, usually there is no need, they are supposed to report everything back to the main one anyhow
basically
` @call_parse
def main(
ย ย gpus:Param("The GPUs to use for distributed training", str)='all',
ย ย script:Param("Script to run", str, opt=False)='',
ย ย args:Param("Args to pass to script", nargs=...
(or woman or in between, we are supportive as long as code is working ๐ )
but I still clearml-agent will raise the same error
which one?
Hi WickedGoat98
Regardless on the ingress configuration (which seems like you have the hang of), the API instance itself needs to be configured with persistent volume (the web / file server do not need direct access to the API server).
Can you get the API to run properly ?
Regrading the trains-agent
once you have the API/Web/File server configured, you can configure it like the trains-agent-services is configured inside the docker-compose (e.g. set the environment variable with the c...
For example, ServerA stores file at /opt/clearml but ServeB stores at /some_path/clearml
As long as you adjust your docker-compose yaml file, should be just fine
check the latest RC, it solved an issue with dataset uploading,
Let me check if it also solved this issue
Hi FlatOctopus65
You are almost thereprev_task: Task = Task.get_task(task_id=<prev_task_id_here>) model = prev_task.models['output'][-1] my_check_point = model.get_local_copy()
Hi TenderCoyote78
I'm trying to clearml-agent in my dockerfile,
I'm not sure I'm following, Are you traying to create a docker container containing the agent inside? for what purpose ?
(notice that the agent can spin any off the shelf container, there is no need to add the agent into the container it will take of itself when it is running it)
Specifically to your docker file:
RUN curl -sSL
| sh
No need for this line
COPY clearml.conf ~/clearml.conf
Try the ab...
So that agent on different nodes will probably require different cuda-version images.
That makes sense SarcasticSquirrel56
I would edit the helm chart (or deploy manually) based on a selector that will select the different nodes/gpus and assign the correct containers (i.e. matching CUDA versions to the diff GPUs / drivers)
BTW: you can also playaround with k8s glue, which would dynamically spin pods based on clearml Tasks.
wdyt?
So like a UI for creating pipelines doing different things on the different solutions ?
SubstantialElk6 This seems to be the issuecp: failed to access '/root/default_clearml.conf': Permission denied clearml_agent: ERROR: Could not find task id=024a421c0e174650a1c7ff64af756c26 (for host: )
Notice it seems it just cannot read the clearml.conf
, wdyt?
Hi JoyousElephant80
Another possibility would be to run a process somewhere that periodically polls ClearML Server for tasks that have recently finished
this is the easiest way to implement what you are after, and have full control over the logic itself.
Basically you inherit from the Monitor class
And implement the callback function:
https://github.com/allegroa...
This is the reason you are getting an error ๐
Basically the session asks the agent to setup a new SSH server with credentials on the remote machine, this is not an issue inside a container, as this is an isolated environment, but when running in venv mode the User running the agent is not root, hence it cannot spin/configure an SSH server.
Make sense ?
BattyLion34 is this running with an agent ?
What's the comparison with a previously working Task (in terms of python packages) ?
Okay that makes sense, if this is the case I'm assuming you have set the files server to point to your S3 bucket is that correct ?
could it be you are missing the credentials for that (it is trying to upload the preprocessing code there, so the clearml-serving container would be able to pull it later)
LittleShrimp86 did you try to run the pipeline form the UI on remote machines (i.e. with the agents)? Did that work?
sets up the venv correctly, prints
Starting Task Execution:
then does nothing
Can you provide a log?
Do you see the code/git reference in the Pipeline Task details - Execution Tab ?
Hi ScaryKoala63
Sure, add the following to your clearml.conf:sdk.storage.cache.default_cache_manager_size = 400
I think you are correct, it seems like for some reason you hit the cache limit, and a previous entry was deleted
RoughTiger69 the easiest thing would be to use the override option of Hydra:parameter_override={'Args/overrides': '[the_hydra_key={}]'.format(a_new_value)})
wdyt?
After testing the code again, I see the task parameter dictionary has been removed properly
Great!
However, I still have the same problem with duplicate tasks, as you can see in the image.
Any chance the pipeline script Itself is running from the agent (as opposed to running the pipeline code locally, then the pipelines are executed on the agent)?
BTW: What's the TF / Keras version?