Reputation
Badges 1
25 × Eureka!Maybe something similar to dockers
I like this approach maybe we could add --name as well, so it is easier to name them.trains-agent daemon stop --gpus all
trains-agent daemon stop --cpu-only
trains-agent daemon stop --gpus 0
What do you think?
Also being able to separate their configurations files would be good (maybe there is and I don't know?)
This is already supported --config-file
, see trains-agent --help
for details 🙂
hi @<1546303293918023680:profile|MiniatureRobin9>
I can still see the metrics in Grafana. I
it will not delete it from grafana, it means it's no longer collected, make sense ?
okay that makes sense, if this is the case I would just use clearml-agent execute --id <task_id here>
to continue the training Task.
Do notice you have to reload your last chekcpoint from the Task's models/artifacts to continue 🙂
Last question, what is the HPO optimization algorithm, is it just grid/random search or optuna hbop/optuna, if this is the later, how do make it "continue" ?
should reload the reported scalars
Exactly (notice it also understand when was the last report of scalars so it should automatically increase the iterations (i.e. you will not accidentally overwrite previously reported scalars)
and the task needs to reload last checkpoints only, right?
Correct 🙂
We didn't figure out the best way of continuing for both the grid and optuna. Can you suggest something?
That is a good point, not sure if we have a GH issue, for that but wo...
Hi UnevenDolphin73
Is there an easy way to add a link to one of the tasks panels? (as an artifact, configuration, info, etc)?
You can add a link as an artifact, that is probably the easiest:tasl.upload_artifact(name="just link", artifact_object="
")
EDIT: And follow up regarding the dataset. As discussed somewhere previously, the datasets are now automatically moved to a hidden "sub-project" prefixed with
.datasets
. This creates several annoyances that I...
For now we've monkey-patched it to our usecase:
LOL, that's a cool hack
That gives us the benefit of creating "local datasets" (confined to the scope of the project, do not appear in
Datasets
tabs, but appear as normal tasks within the project)
So what would be a "perfect" solution here?
I think I'm missing the point on why it became an issue in the first place.
Notice that in new versions Dataset will be registered on the Tasks that use them (they are already...
Actually this should be a flag
Is there a way to document these non-standard entry points
@<1541954607595393024:profile|BattyCrocodile47> you should see the "run" in the Args section under Configuration
in case of HF you should see "-m huggingface" and then the rest in the Args section
(if this does not work, then I assume this is a bug 🙂 )
The idea is of course that you can always enqueue and reproduce, so if that part is broken we should fix it 😊
do you have your Task.init
call inside the "train.py" script ? (and if you do, what are you getting in the Execution tab of the task) ?
Why does ClearML hide the dataset task from the main WebUI?
Basically you have the details from the Dataset page, why should it be mixed with the others ?
If I specified a project for the dataset, I specifically want it there, in that project, not hidden away in some
.datasets
hidden sub-project.
This maybe a request for "Dataset" tab under project, why would you need the Dataset Task itself is the main question?
Not all dataset objects are equal, and perhap...
Hi RoughTiger69
A. Yes makes total sense . Basically you can use Task.export Task.import to do achieve this process (notice we assume the dataset artifacts links are available on both, usually this is the case)
B. The easiest way would be to use Process , then one subprocess is exporting from dev , where the credentials and configuration is passed with os environment. The another subprocess imports it to the prod server (again with os environment pointing to the prod server). Make sense?
HungryArcticwolf62 transformer model is at the end a pytorch/tf model, with pre/post processing.
the pytorch/tf model inference is done with Triton (probably the most efficient engine today), where clearml runs the pre/post on a different CPU machine (making sure we fully utilize all the HW. Does that answer the question?
Latest docs here:
https://github.com/allegroai/clearml-serving/tree/dev
expect a release after the weekend 😉
The main reason we need the above mentioned functionality is because there are some experiments that need to run for a long time. Let's say weeks.
Good point!
. We need to temporarily pause(kill or something else) running HPO task and reassign the resource for other needs.
Oh I see now....
Later, when more important experiments has been completed, we can continue HPO task from the same state.
Quick question when you say the HPO Task, you mean the HPO controller logic Task...
"warm" as you do not need to sync it with the dataset, every time you access the dataset, clearml
will make sure it is there in the cache, when you switch to a new dataset the new dataset will be cached. make sense?
repeat it until they are all dead 🙂
I see... We could definitely add an argument to control it. I'll update here once there is an RC
ElegantCoyote26 what you are after is:docker run -v ~/clearml.conf:/root/clearml.conf -p 9501:8085
Notice the internal port (i.e. inside the docker is 8080, but the external one is changed to 9501)
Thanks BroadSeaturtle49
I think I was able to locate the issue !=
breaks the pytroch lookup
I will make sure we fix asap and release an RC.
BTW: how come 0.13.x have No linux x64 support? and the same for 0.12.x
https://download.pytorch.org/whl/cu111/torch_stable.html
SlipperyDove40
FYI:args = task.connect(args, name="Args")
Is "kind of" reserved section for argparse. Meaning you can always use it, but argparse will also push/pull things from there. Is there any specific reason for not using a different section name?
SlipperyDove40 following on the missing section name, this seems like backwards compatibility issue. Try calling with backwards_compatibility=False
my_params = Task.get_parameters(backwards_compatibility=False)
This should always add the section name prefix.
I think they (DevOps) said something about next week, internal roll-out is this week (I think)
So I assume, trains assumes I have nvidia-docker installed on the agent machine?
docker + nvidia-docker-runtime are assumed to be installed
nvidia/cuda docaker image is pulled when requested (like any other container image)
Moreover, since I'm going to use
Task.execute_remotely
(and not through the UI) is there any code way to specify the docker image to be used?
Sure, task.set_base_docker(docker_cmd='nvidia/cuda -v /mnt:/tmp')
Notice that you can not only pass the dock...
Hi MortifiedDove27
Looks like there is a limit of 100 images per experiment,
The limit is 100 unique combination of title/series per image.
This means that changing the title or the series name will add 100 more images (notice the 100 limit is for previous iterations)
LOL love that approach.
Basically here is what I'm thinking,
` from clearml import Task, InputModel, OutputModel
task = Task.init(...)
run this part once
if task.running_locally():
my_auxiliary_stuff = OutputModel()
my_auxiliary_stuff.system_tags = ["DATA"]
my_auxiliary_stuff.update_weights_package(weights_path="/path/to/additional/files")
input_my_auxiliary = InputModel(model_id=my_auxiliary_stuff.id)
task.connect(input_my_auxiliary, "my_auxiliary")
task.execute_remotely()
my_a...
This seems to only work for a single file (weights_path implies a single file, not multiple ones). Is that the case?See update_weights_package
actually packages an entire folder as zip and will do the extraction when you get it back (check the function docstring, I think you can also specify wildcard etc if needed)
Why do you see this as preferred to the dataset method we have now?
So it answers a few requirements that you raised
It is fully visible as part of the project and se...
SubstantialElk6 on the client side?