Reputation
Badges 1
25 × Eureka!Test it on your local setup (I would hate to push a broken fix)
Is that possible?
Our datasets are more than 1TB in size and will grow in size (probably 4TB and up), this means we also need 4TB local storage
Yes, because somewhere you will have to store your unzipped files.
Or you point to the S3 bucket, and fetch the data when you need to access it (ore prefetch it) with the S3 links the Dataset stores, i.e. only when accessed
SoggyBeetle95 the question is, where does clearml stores these arguments, and the answer is on the Task object (from there the agent will take them and apply to the docker execution). Now since all users see all the tasks, they also see these arguments. Wdyt?
BTW, how can I run 'execute_orchestrator' concurrently?
It is launching simultaneously, (i.e. if you are not processing the output of the pipeline step function, the execution will not wait for its completion, notice that the call itself might take a few seconds, as it create a task and enqueues/sub-process it, but is it Not waiting for it)
Hi @<1784754456546512896:profile|ConfusedSealion46>
clear ml server took so much memory usage, especially for elastic search
Yeah that depends on how many metrics/logs you have there, but you really have to have at least 8GB RAM
delete old experiments ?
Sure thing, feel free to ping π
Think multiple hyper-paremter sections that we need to reference
(under the Tasks Configuration Tab, the Hyper parameters can have multiple sections)
No after, do you see the poetry lock removed in the uncommitted changes?
NastySeahorse61 I would try to open in incognito mode (i.e. no cookies etc.), did you also change the address of the server?
So basically development on a "shared" GPU?
but I cannot compare between them
I think we noticed it, and this will be fixed in the next server update (again, some plotly.js issue there)
CrookedWalrus33 any chance you can think of a sample code to reproduce?
Here this new entry in the log is 2 min after env completed =>
1702378941039 box132 DEBUG 2023-12-12 11:02:16,112 - clearml.model - INFO - Selected model id: 9be79667ca644d7dbdf26732345f5415
This seems to be something in your code, just add print("starting") in your entry python file, Before any imports (because they might actually do something)
Because form the agent's perspective after printing Starting Task Execution: it literally calls the python script, nothing else...
In order to clone the Task it needs to complete sync, which implies closing. I guess the use case for execute remotely while still running was not considered. How / why is this your workflow? Specifically how does Jupyter get into the picture?
For future readers, see discussion here:
https://clearml.slack.com/archives/CTK20V944/p1629840257158900?thread_ts=1629091260.446400&cid=CTK20V944
Hi ScaryBluewhale66
TaskScheduler I created. The status is still
running
. Any idea?
The TaskScheduler needs to actually run in order to trigger the jobs (think cron daemon)
Usually it will be executed on the clearml-agent services queue/mahine.
Make sense ?
using only a subset of the features
ShallowGoldfish8 if you have some parameter that controls it (i.e. select different features) then you can launch it with two sets f parameters.
Am I missing something?
for example:
` my_features_select = {"type": "set_a"}
Task.current_task().connect(my_features_select)
if my_features_select["type"] == "set_a":
do something
else
do something else `wdyt?
Verified, and already fixed with 1.0.6rc2
I don't want a new task every 5 minutes as that will create a lot of tasks over a day. It would be better if I had just one task.
Oh you mean the Task that will be launched will override the previous "instance", correct ?
https://stackoverflow.com/questions/60860121/plotly-how-to-make-an-annotated-confusion-matrix-using-a-heatmap
MagnificentSeaurchin79 see plotly example here:
https://allegro.ai/clearml/docs/docs/examples/reporting/plotly_reporting.html
JitteryCoyote63
are the calls from the agents made asynchronously/in a non blocking separate thread?
You mean like request processing on the apiserver are multi-threaded / multi-processed ?
Correct, but do notice that (1) task names are not unique and you can change them after the Task was executed (2) when you clone the Task, you can actually rename it, when an agent is running the Task, basically the init function is ignored, because the Task already exists. Make sense ?
As we canβt create keys in our AWS due to infosec requirements
Hmmm
π’ any chance you have a toy pytest that replicates it ?
it will constantly try to resend logs
Notice this happens in the background, in theory you will just get stderr messages when it fails to send but the training should continue
Thanks @<1523701868901961728:profile|ReassuredTiger98>
From the log this is what conda is installing, it should have worked
/tmp/conda_env1991w09m.yml:
channels:
- defaults
- conda-forge
- pytorch
dependencies:
- blas~=1.0
- bzip2~=1.0.8
- ca-certificates~=2020.10.14
- certifi~=2020.6.20
- cloudpickle~=1.6.0
- cudatoolkit~=11.1.1
- cycler~=0.10.0
- cytoolz~=0.11.0
- dask-core~=2021.2.0
- decorator~=4.4.2
- ffmpeg~=4.3
- freetype~=2.10.4
- gmp~=6.2.1
- gnutls~=3.6.13
- imageio~=2.9.0
-...