Reputation
Badges 1
25 × Eureka!seems like pip 20.1.1 has the issue, but >= 22.2.2 do not.
Notice we changed the value there, it now has two versions, pne for python 3.10 < and one for python 3.10>=
The main reason is that pip changed their resolving algorithm, and the new one can break its own dependencies (i.e. pip freeze > requirements.txt -> pip install might not actually work)
None
But I have no idea what will be input of step2.
What do you mean by that? the assumption is that somehow the output of step 1 will be passed (a string reference) to step 2, what am I missing ?
But from your other answer, I think I'm understanding that you
can
have multiple agents on a single instance listening to the same queue.
Correct
So we could maybe initialize 4 instances of the agent on a single EC2 instance which would allow us to handle a higher volume of small batches concurrently without tying up the entire instance.
Correct (that said I do not understand how come a single Task does not utilize the CPU, I was under the impression it is run...
Hmm, we could add an optional test for the python version, and the fail the Task if the python version is not found. wdyt?
Let me check what's the subsampling threshold
Oh in that case add --remote-gateway <external_ip> It will connect to the provided address instead of the local one. (you can also add --public-ip which will automatically resolve the public IP of the server
I found the issue, the first run it jumps over the first day (let me check if we can quickly fix that)
BattyLion34 the closest I can think of the is monitoring class that can easily be extended.
Datasets are a type of Task, so we can monitor a project and trigger an action when we see a change in number of Tasks/Datasets that are completed.
Monitoring class:
https://github.com/allegroai/clearml/blob/master/clearml/automation/monitor.py
Monitoring example:
https://github.com/allegroai/clearml/blob/master/examples/services/monitoring/slack_alerts.py
I think a dataset monitoring example wil...
BTW: you still can get race/starvation cases... But at least no crash
Yes, or at least credentials and API...
Maybe inside your code you can later copy the model into fixed location ?
This way you have the model in the model repository and a copy in a fixed location (StorageManager can upload to a specific bucket/folder with the same credentials you already have)
Would that work?
p.s. you should remove this line 🙂extra_index_url: ["git@github.com:salimmj/xxxx"]
Hi @<1610808279263350784:profile|FriendlyShrimp96>
Is there a way to get a list of variants given a metric, or even just a full list of metrics and variants for a given task id?
Try this
None
from clearml.backend_api.session.client import APIClient
c = APIClient()
metrics = c.events.get_task_metrics(tasks=["TASK_ID_HERE"], event_type="training_debug_image")
print(metrics)
I think API ...
I'm glad it worked out, thanks SmallBluewhale13 🙂
Ok, just my ignorance then?
LOL, no it is just that with a single discrete parameter the strategy makes less sense 🙂
That makes sense...
Basically in the open-source version the approach is everyone sees everything for maximum transparency (and also ease of use). I know there are access-roles in the paid tier and vault for exactly these types of things...
Where do you currently save them? and how do you pass them to the remote machine ?
So the original looks good, could it be you tried to clone a Task that was executed with an agent with pip, and then pushed into an agent running conda?
JitteryCoyote63 the agent.cuda_version (or CUDA_VERSION env) tell the agent which pytorch wheel to download. CUDNN library can be included inside any wheel and it will work as long as the cuda / cudart exist on the system, for example pytorch wheels include the cudnn they use . agent.cudnn_version should actually be deprecated, and is not actually used.
For future reference, dependency order:
Nvidia Drivers CUDA library and CUDA-runtime libraries (libcuda.so / libcudart.so) CUDN...
Hi @<1523701797800120320:profile|SteadySeagull18>
...the job -> requeue it from the GUI, then a different environment is installed
The way that it works is, in the "originating" (i.e. first manual) execution only the directly imported packages are listed (no derivative packages that re required by the original packages)
But when the agent is reproducing the job, it creates a whole clean venv for the experiment, installs the required packages, then pip resolves the derivatives, and ...
There may be cases where failure occurs before my code starts to run (and, perhaps, after it completes)
Yes that makes sense, especially from IT failure perspective
Hi RipeGoose2 all PR's are welcome, feel free to submit :)
Hi ThankfulOwl72 checkout TrainsJob object. It should essentially do what you need:
https://github.com/allegroai/trains/blob/master/trains/automation/job.py#L14
You can always access the entire experiment data from python
'Task.get_task(Id).data'
It should all be there.
What's the exact use case you had in mind?
So now for it to take place you need to enqueue the Task and set an agent to pick it up and run it.
When the agent is running the Task the new parameter will be passed.
does that make sense ?
In case of scalars it is easy to see (maximum number of iterations is a good starting point
Thanks for the logs @<1627478122452488192:profile|AdorableDeer85>
Notice that the log you attached means the preprocessing is executed and the GPU backend is returning an error.
Could you provide the log of the docker compose specifically the intersting part is the Triton container, I want to verify it loads the model properly
Hi TrickyRaccoon92
If you are reporting to tensor-board, then "iteration" equals step. Is this the case?
For example, for some of our models we create pdf reports, that we save in a folder in the NFS disk
Oh, why not as artifacts ? at least you will be able to access from the web UI, and avoid VFS credential hell 🙂
Regrading clearml datasets:
https://www.youtube.com/watch?v=S2pz9jn26uI