Reputation
Badges 1
25 × Eureka!f I log 20 scalars every 2000 training steps and train for 1 million steps (which is not that big an experiment), that's already 10k API calls...
They are batched together, so at least in theory if this is fast you should not get to 10K so fast, But a Very good point
Oh nice! Is that for all logged values? How will that count against the API call budget?
Basically this is the "auto flush" it will flash (and batch) all the logs in 30sec period, and yes this is for all the logs (...
Notice that you need to pass the returned scroll_id to the next call
scroll_id = response["scroll_id"]
TenseOstrich47 / PleasantGiraffe85
The next version (I think releasing today) will already contain scheduling, and the next one (probably RC right after) will include triggering. That said currently the UI wizard for both (i.e. creating the triggers), is only available in the community hosted service. That said I think that creating it from code (triggers/schedule) actually makes a lot of sense,
pipeline presented in a clear UI,
This is actually actively worked on, I think Anxious...
UpsetCrocodile10
Does this method expectΒ
my_train_func
Β to be in the same file as
As long as you import it and you can pass it, it should work.
Child exp get's abortedΒ immediately ...
It seems it cannot find the file "main.py" , it assumes all code is part of a single repository, is that the case ? What do you have under the "Execution" tab for the experiment ?
Hi @<1657918724084076544:profile|EnergeticCow77>
Can I launch training with HugginFaces accelerate package using multi-gpu
Yes,
It detects torch distributed but I guess I need to setup main task?
It should π€
Under the execution Tab script path, you should see something like -m torch.distributed.launch ...
however setting up the interpertier on pycharm is different on mac for some reason, and the video just didnt match what I see
MiniatureCrocodile39 Are you running on a remote machine (i.e. PyCharm + remote ssh) ?
If you are using the "default" queue for the agent, notice you might need to run the agent with --services-mode to allow for multiple pipeline components on the same machine
HurtWoodpecker30 currently in the open source only AWS is supported, I know the SaaS pro version supports it (I'm assuming enterprise as well).
You can however manually spin an instance on GCP and launch an agent on the instance (like you would on any machine)
Okay, I'll make sure we always qoute " , since it seems to work either way.
We will release an RC soon, with this fix.
Sounds good?
odd message though ... it should have said something about boto3
ConvolutedSealion94 what's your python version?
(the error itself is clearml failing to execute git diff, or read the output, I suspect unicode or something, assuming you were able to run the same command manually)
Generally speaking, for the exact reason if you are passing a list of files, or a folder, it will actually zip them and upload the zip file. Specifically to pipeline it should be similar. BTW I think you can change the number of parallel upload threads in StorageManager, but as you mentioned it is faster to zip into one file. Make sense?
BTW: the above error is a mismatch between the TF and the docker, TF is looking for cuda 10, and the docker contains cuda 11
Hi BroadSeaturtle49torchvision!=0.13.0,>=0.8.1 is this what you have in the requirements ?
The clearml-agent is parsing the requested version and tries to match it to the version found/supported by the installed cuda
There is the possibility the combinarion wither does not exist or fore some reason the parsing (i.e. clearml-agent's parsing) fails
can you maybe provide the Task's full log?
UnsightlySeagull42 the assumption is that the agent has a read-only all access user.
As the moment there is no way to configure it to have diff user/pass per repository in the clearml.conf
You can however:
embed the user/pass on the repository link (not very secure) Use ssh-key and have it on .ssh on the host machine Use .git-credentials and configure them (with per project user/pass)
Sounds like something very similar, I'll try to use it,
You can set it per container with -e CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=1
Or add it here:
https://github.com/allegroai/clearml-agent/blob/51eb0a713cc78bd35ca15ed9440ddc92ffe7f37c/docs/clearml.conf#L149extra_docker_arguments: ["-e", "CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=1"]
BTW: from the instance name it seems like it is a VM with preinstalled pytorch, why don't you add system site packages, so the venv will inherit all the preinstalled packages, it might also save some space π
DeterminedToad86 see here:
https://github.com/allegroai/clearml-agent/blob/0462af6a3d3ef6f2bc54fd08f0eb88f53a70724c/docs/clearml.conf#L55
Change it on the agent's conf file to:system_site_packages: true
I guess the thing that's missing from offline execution is being able to load an offline task without uploading it to the backend.
UnevenDolphin73 you mean like as to get the Task object from it?
(This might be doable, the main issue would be the metrics / logs loading)
What would be the use case for the testing ?
Hi @<1671689437261598720:profile|FranticWhale40>
Are you positive the Triton container finished syncing ?
Could you provide the docker log (both the serving and the triton)?
What is the clearml-serving version you are using ?
Could you add a print in the "preprocess" function, just to validate you are getting to the correct model version ?
Hi TenderCoyote78
I'm trying to clearml-agent in my dockerfile,
I'm not sure I'm following, Are you traying to create a docker container containing the agent inside? for what purpose ?
(notice that the agent can spin any off the shelf container, there is no need to add the agent into the container it will take of itself when it is running it)
Specifically to your docker file:
RUN curl -sSL
| sh
No need for this line
COPY clearml.conf ~/clearml.conf
Try the ab...
BattyLion34 if everything is installed and used to work, what's the difference from the previous run that worked ?
(You can compare in th UI the working vs non-working, and check the installed packages, it would highlight the diff, maybe the answer is there)
but the requirement was already satisfied.
I'm assuming it is satisfied on the host python environment, do notice that the agent is creating a new clean venv for each experiment. If you are not running in docker-mode, then you ca...
ElegantCoyote26 what you are after is:docker run -v ~/clearml.conf:/root/clearml.conf -p 9501:8085
Notice the internal port (i.e. inside the docker is 8080, but the external one is changed to 9501)
Tried context provider for Task?
I guess that would only make sense inside notebooks ?!
or even different task types
Yes there are:
https://clear.ml/docs/latest/docs/fundamentals/task#task-types
https://github.com/allegroai/clearml/blob/b3176a223b192fdedb78713dbe34ea60ccbf6dfa/clearml/backend_interface/task/task.py#L81
Right now I dun see differences, is this a deliberated design?
You mean on how to use them? I.e. best practice ?
https://clear.ml/docs/latest/docs/fundamentals/task#task-states
Click on the Task it is running and abort it, it seems to be stuck, I guess this is why the others are not pulled
Notice this is only when:
Using Conda as package manager in the agent the requested python version is already installed (multiple python version installation on the same machine/container are supported)