Okay this seems correct...
Can you share both yaml files (server & serving) and env file?
We do upload the final model manually.
wait you said upload manually, and now you are saying "saved automatically", I'm confused.
Would it suffice to provide the git credentials ...
That should be enough, basically this is where they should be:
https://github.com/allegroai/clearml-agent/blob/0462af6a3d3ef6f2bc54fd08f0eb88f53a70724c/docs/clearml.conf#L18
HandsomeCrow5 check the latest RC, I just run the same code and it worked 🙂
Oh sorry, from the docstring, this will work:
` :param bool continue_last_task: Continue the execution of a previously executed Task (experiment)
.. note::
When continuing the executing of a previously executed Task,
all previous artifacts / models/ logs are intact.
New logs will continue iteration/step based on the previous-execution maximum iteration value.
For example:
The last train/loss scalar reported was iteration 100, the next report will b...
UnevenDolphin73 you mean the clearml-server
helm chart ?
The default cleanup service should work with S3 with a correctly configured clearml service agent if I understand the workings correctly.
Yes I think you are correct
I am referring to the UI.
In that case, no 😞 . This is actually a backend server change (from the UI it should be relatively simple). Is this somehow a showstopper ?
Hi GrittyCormorant73
At the end everything goes through session.send, you can add a print there?
btw: why would you print all the requests? what are we debugging here?
- Could you explain how I can reproduce the missing jupyter notebook (i.e. the ipykernel_launcher.py)
ChubbyLouse32 and this works when running python code and not when the agent is running ?
On the same machine ?
BroadMole98
I'm still exploring what trains is for.
I guess you can think of Trains as Experiment manager + MLOps tied together.
The idea is to give a quick and easy way to move from coding/running on one machine to scaling it to multiple remote machines, with everything that comes with it.
In some ways it is like snakemake, it setups your environment and execute the code. Snakemake also allows you to setup data, which in Trains is done via code (StorageManager), pipelines are also...
Hi ContemplativeCockroach39
Seems like you are running the exact code as in the git repo:
Basically it points you to the exact repository https://github.com/allegroai/clearml and the script examples/reporting/pandas_reporting.py
Specifically:
https://github.com/allegroai/clearml/blob/34c41cfc8c3419e06cd4ac954e4b23034667c4d9/examples/reporting/pandas_reporting.py
That said, you might have accessed the artifacts before any of them were registered
My question was about the automatically uploaded models. Those that were uploaded by clearml client.
So there is a way to add a callback would that work?
https://github.com/allegroai/clearml/blob/cf7361e134554f4effd939ca67e8ecb2345bebff/clearml/binding/frameworks/init.py#L137def callback(_, model_info): model_info.name = "my new name" return model_info
ResponsiveCamel97
could you attach the full log?
Hi SharpDove45
what
suggested about how it fails on bad/missing credentials
Yes, this is correct, since you specifically set the hosts worst case you will end up with wrong credentials 🙂
Okay this is very close to what the agent is building:
Could you start a new conda env,
then install cudatoolkit=11.1
then run:
conda env update -p <conda_env_path_here> --file the_env_yaml.yml
(fyi: once we have a solid idea here, please open a github issue on the feature request, I'll try to see if we can push it fwd for the next RC 🙂 )
Hi CluelessElephant89
Hi guys, if I spot issue with documentations, where should I post them?
The best way from our perspective PR the fix 🙂 this is why we put it on GitHub
It runs into the above error when I clone the task or reset it.
from here:
AssertionError: ERROR: --resume checkpoint does not exist
I assume the "internal" code state changed, and now it is looking for a file that does not exist, how would your code state change, in other words why would it be looking for the file only when cloning? could it be you put the state on the Task, then you clone it (i.e. clone the exact same dict, and now the newly cloned Task "thinks" it resuming ?!)
Verified, and already fixed with 1.0.6rc2
My bad I wrote refresh and then edited it to the correct "reload" 😞
I'd prefer to use config_dict, I think it's cleaner
I'm definitely with you
Good news:
new
best_model
is saved, add a tag
best
,
Already supported, (you just can't see the tag, but it is there :))
My question is, what do you think would be the easiest interface to tell (post/pre) store, tag/mark this model as best so far (btw, obviously if we know it's not good, why do we bother to store it in the first place...)
Just making sure, pip package installed on your Conda env, correct?
however when I clone or reset said task after completion and then enqueue it again, I get the above error.
This part is somewhat confusing... There is no magic happening behind the scenes, cloning a Task and creating it, is basically the same ... Do you have a reference to the YOLOv5 code base itself, maybe I can figure out what's the issue?