Reputation
Badges 1
25 × Eureka!I see, actually what you should do is a fully custom endpoint,
- preprocessing -> doenload video
- processing -> extract frames and send them to Triton with gRPC (see below how)
- post processing, return a human readable answer
Regrading the processing itself, what you need is to take this function (copy paste):
None
have it as internal `_process...
BTW: the same hold for tagging multiple experiments at once
yep, that's the reason it is failing, how did you train the model itself ?
HurtWoodpecker30 in order to have the venv cache activated, it uses the full "pip freeze" it stores on the "installed packages", this means that when you cloned a Task that was already executed, you will see it is using the cached venv.
(BTW: the packages themselves are cached locally, meaning no time is spent on downloading just on installing, but this is also time consuming, hence the full venv cache feature).
Make sense ?
Question - why is this the expected behavior?
It is 🙂 I mean the original python version is stored, but pip does not support replacing python version. It is doable with conda, but than you have to use conda for everything...
You mean for running a worker? (I think plain vanilla python / ubuntu works)
The only change would be pip install clearml / clearml-agent ...
MysteriousBee56 there is no way to tell the trains-agent to pull from local copy of your repository...
You might be able to hack it, if you copy the entire local repo to the trains-agent version control cache. would that help you?
OSError: [Errno 28] No space left on device
Hi PreciousParrot26
I think this says it all 🙂 there is no more storage left to run all those subprocesses
btw:
I am curious about why a
ThreadPool
of
16
threads is gathered,
This is the maximum simultaneous jobs it will try to launch (it will launch more after the launching is doe, notice not the actual execution) but this is just a way to limit it.
@<1523701868901961728:profile|ReassuredTiger98> if you use the latest RC! i sent and run with --debug
in the log you will see the full /tmp/conda_envaz1ne897.yml
content
Here it is copied from your log, do you want to see if this one works:
channels:
- defaults
- conda-forge
- pytorch
dependencies:
- blas~=1.0
- bzip2~=1.0.8
- ca-certificates~=2020.10.14
- certifi~=2020.6.20
- cloudpickle~=1.6.0
- cudatoolkit~=11.1.1
- cycler~=0.10.0
- cytoolz~=0.11.0
- dask-core~=2021.2.0
- de...
I have to problem that "debug samples" are not shown anymore after running many iterations.
ReassuredTiger98 could you expand on it? What do you mean by "not shown anymore" ?
Can you see other reports ?
I see, so basically pull a fixed set of configuration for everyone from the server.
Currently only the scale/enterprise version supports such a feature 😞
If I checkout/download dataset D on a new machine, it will have to download/extract 15GB worth of data instead of 3GB, right? At least I cannot imagine how you would extract the 3GB of individual files out of zip archives on S3.
Yes, I'm not sure there is an interface to extract only partial files from the zip (although worth checking).
I also remember there is a GitHub issue with uploading 50GB dataset, and the bottom line is, we should support setting chuck size, so that we can uploa...
Yep 🙂 but only in RC (or github)
but here I can tell them: return a dictionary of what you want to save
If this is the case you have two options, either store the dict as an artifact (this makes sense if this is not standalone model you would like to later use), or store as an artifact.
Artifact example:
https://github.com/allegroai/clearml/blob/master/examples/reporting/artifacts.py
getting them back
https://github.com/allegroai/clearml/blob/master/examples/reporting/artifacts_retrieval.py
Model example:
https:/...
OddAlligator72 what you are saying is, take the repository / packages from the runtime, aka the python code calling the "Task.create(start_task_func)" ?
Is that correct ?
BTW: notice that the execution itself will be launched on other remote machines, not on this local machine
so that you can get the latest artifacts of that experiment
what do you mean by " the latest artifacts "? do you have multiple artifacts on the same Task or s it the latest Task holding a specific artifact?
Hi CloudySwallow27
This error occurs randomly during training (in other words training does successfully start).
What's the cleamrl-agent version you are using, and the clearml version ?
Our server is deployed on a kube cluster. I'm not too clear on how Helm charts etc.
The only thing that I can think of is that something is not right the the load balancer on the server so maybe some requests coming from an instance on the cluster are blocked ...
Hmm, saying that aloud that actually could be?! Try to add the following line to the end of the clearml.conf on the machine running the agent:
api.http.default_method: "put"
Thanks TroubledHedgehog16 for the context.
sdk.development.worker.report_period_sec
Yes please update to the latest version 1.8.0 for full support (to be released today, I think)
https://github.com/allegroai/clearml/blob/f6238b8a0fb662540bca9095cc0c22bd7af483c1/docs/clearml.conf#L196
https://github.com/allegroai/clearml/blob/f6238b8a0fb662540bca9095cc0c22bd7af483c1/docs/clearml.conf#L199
we have have been running agents on 3 on-premise systems.
Do notice that by default an...
EnviousStarfish54 Yes i'm not sure what happens there we will have to dive deeper, but now that you got us a code snippet to reproduce the issue it should not be very complicated to fix (I hope 🤞 )
Although it's still really weird how it was failing silently
totally agree, I think the main issue was the agent had the correct configuration, but the container / env the agent was spinning was missing it,
I'll double check how come it did not print anything
Just to clarify, where do I run the second command?
Anywhere just open a python console and import the offline task:from trains import TaskTask.import_offline_session('./my_task_aaa.zip')
Related, how to I specify in my code the cache_dir where the zip is saved?
This is the Trains cache folder, you can set it in the trains.conf file:
https://github.com/allegroai/trains/blob/10ec4d56fb4a1f933128b35d68c727189310aae8/docs/trains.conf#L24
Hi @<1691620877822595072:profile|FlutteringMouse14>
In the latest project I created, Hydra conf is not logged automatically.
Any chance the Task.init call is not on the main script (where the Hydra is) ?
Hi GrittyCormorant73
When I archive the pipeline and go into the archive and delete the pipeline, the artifacts are not deleted.
Which clearml-server version are you using? The artifact delete was only recently added
Hi RoundSeahorse20
Try the following , let me know if it worked.clear_logger = logging.getLogger('clearml.metrics') clear_logger.setLevel(logging.ERROR)
I look forward to your response on Github.
Great, I would like to make this discussion a bit more open and accessible so GitHub is probably better
I'd like to start contributing to the project...
That will be awesome!
Awesome ! thank you so much!
1.0.2 will be out in an hour
PompousParrot44 That should be very easy to do, basically a service mode code that clones a base task and puts it into a queue:
This should more or less do what you need :)
` from trains import Task
task = Task.init('devops', 'daily train', task_type='controller')
stop the local execution of this code, and put it into the service queue, so we have a remote machine running it.
task = execute_remotely('services')
while True:
a_task = Task.clone(base_task_id='aaabb111')
Task.enqueu...
My only point is, if we have no force_git_ssh_port
or force_git_ssh_user
we should not touch the SSH link (i.e. less chance of us messing with the original URL if no one asked us to)