Reputation
Badges 1
25 × Eureka!but I have no idea what's behingΒ
1
,Β
2
Β andΒ
3
Β compare to the first execution
This is why I would think multiple experiments, since it will store all the arguments (and I think these arguments are somehow being lost.
wdyt?
I think it would make sense to have one task per run to make the comparison on hyper-parameters easier
I agree. Could you maybe open a GitHub issue on it, I want to make sure we solve this issue π
Hmm MiniatureHawk42 how many files in the zip ?
correct, you can pass it as keys on the "task_filter" argument, e.g:Task.get_tasks(..., task_filter={'status': ['failed']})
Hi JealousParrot68
I'll try to shed some light on these modules and use cases.
Storagemanager is general speaking, low level access to http/object-storage/files utility. In most cases there is no need to directly use it if objects are already stored/managed on clearml (for example artifacts/models/datasets). But, it is quite handy to use with your S3 buckets etc.
Artifacts: Passing an artifact between Tasks will usually be something like:
` artifact_object = Task.get_task('task_id').artifa...
Hi JealousParrot68
This is the same as:
https://clearml.slack.com/archives/CTK20V944/p1627819701055200
and,
https://github.com/allegroai/clearml/issues/411
There is something odd happening in the files-server as it replaces the header (i.e. guessing the content o fthe stream) and this breaks the download (what happens is the clients automatically ungzip the csv).
We are working on a hit fix to he issue (BTW: if you are using object-storage / shared folders, this will not happen)
Working on it as we speak π probably a day worst case 2. This is quite strange and we are not sure where is the fault, as nothing in the code itself changed...
That should work π
BTW, you might play around with "clearml-agent execute --id <task_id_here>"
This will basically clone the code, create a venv with the python packages, apply uncommitted changes and will run the actual code. This could be a replacement for your bash. (notice it means that you need to clone the Task in the UI, then you can Change parameters, then the run the agent manually in SLURM and it will take the params from the UI.)
Hi UnsightlyLion90
from my understanding agent do the job of SLURM,
That is kind of correct (they overlap in some ways π )
Any guide of how to integrate both of them?
The easiest way is to just add the "Task.init()" call to your code, and use SLURM to schedule the job. this will make sure all jobs are fully logged (this can also includes automatically uploading the models, and artifact support etc)
Full SLURM support (i.e. similar to the k8s glue support), is currently ou...
I think so, when you are saying "clearml (bash script..." you basically mean, "put my code + packages + and run it" , correct ?
Hi DisgustedDove53
Now for the clearml-session tasks, a port-forward should be done each time if I need to access the Jupyter notebook UI for example.
So basically this is why the k8s glue has --ports-mode.
Essentially you setup a k8s service (doing the ingest TCP ports) then the template.yaml that is used by the k8s glue should specify said service. Then the clearml-session knows how to access the actual pod, by a the parameters the k8s glue sets on the Task.
Make sense ?
Correct (with the port mapping service in it)
Hi JealousParrot68
You mean by artifact names ?
Let me check the API reference
https://clear.ml/docs/latest/docs/references/api/endpoints#post-tasksget_all
So not straight query, but maybe:
https://clear.ml/docs/latest/docs/references/api/endpoints#post-tasksget_all_exall
section might do the trick.
SuccessfulKoala55 any chance you have an idea on what to pass there ?
The file itslef is csv.gz compressed, it's actually sending from the file-server back that messes things
(you can test with output_uri=/tmp/folder
)
Btw I sometimes get a gzip error when I am accessing artefacts via the '.get()' part.
Hmm this is odd, is this a download issue? if this is reproducible maybe we should investigate further...
Actually doesn't matter (systemd and init.d are diff ways to spin services on diff linux distros) you can pick whatever seems more continent for you, and whichever is supported by the linux you are running (in most cases both are) π
So good news (1) Dashboard is being worked on as we speak. (2) we released clearml-serving doing exactly that, the next release of clearml-serving will include integration with kfserving (under the hood) essentially managing the serving endpoints on top of the k8s cluster , wdyt?
Let me know if I understand you correctly, the main goal is to control the model serving, and deploy to your K8s cluster, is that correct ?
Hi DisgustedDove53
When you say "deployment" there are a lot of way to interpret that π what exactly are you looking for ?
BTW: if you feel like pushing forward with integration I'll be more than happy to help PRing new capabilities, even before the "official" release
This part is odd:SCRIPT PATH: tmp.7dSvBcyI7m
How did you end with this random filename? how are you running this code?
. Iβm using the default operation mode which uses kubectl run. Should I use templates and specify a service in there to be able to connect to the pods?
Ohh the default "kubectl run" does not support the "ports-mode" π
Thereβs a static number of pod which services are created forβ¦
You got it! π
SoreDragonfly16 could you test with Task.init using reuse_last_task_id=False
for example:task = Task.init('project', 'experiment', reuse_last_task_id=False)
The only thing that I can think of is running two experiments with the same project/name on the same machine, this will ensure every time you run the code, you create a new experiment.
SoreDragonfly16 the torchvision warning has nothing to do with the Trains
warning.
The Trains warning means that somehow someone changes the state of the Task from running (in_progress) to "stopped" (aborted). Could it be one of the subprocesses raised an exception ?
Hi SoreDragonfly16
The warning you mention means that someone state of the experiment was changed to aborted
, which in term will actually kill the process.
What do you mean by "If I disable the logger," ?