Reputation
Badges 1
25 × Eureka!Maybe something similar to dockers
I like this approach maybe we could add --name as well, so it is easier to name them.trains-agent daemon stop --gpus all
trains-agent daemon stop --cpu-only
trains-agent daemon stop --gpus 0
What do you think?
Also being able to separate their configurations files would be good (maybe there is and I don't know?)
This is already supported --config-file
, see trains-agent --help
for details 🙂
Oh that makes sense.
So now you can just get the models as dict as well (basically clearml allows you to access them both as a list, so it is easy to get the last created, and as dict so you can match the filenames)
This one will get the list of modelsprint(task.models["output"].keys())
Now you can just pick the best onemodel = task.models["output"]["epoch13-..."] my_model_file = model.get_local_copy()
Hi @<1523703472304689152:profile|UpsetTurkey67>
I circumvented the problem by putting timestamp in task name, but I don't think this is necessary.
Just pass reuse_last_task_id=False
to Task.init, it will never try to reuse them 🙂
None
Maybe the configuration file changed?
None
The logic is if the name and project are the same, and there are no artifacts/models, and the last time it was created was under 72 hours, reuse the Task
Yes, the same will work with artifacts, use pass the full url to the artifact_object
it should just register it as is.
DilapidatedDucks58 so is this more like a pipeline DAG that is built ?
I'm assuming this is more than just grouping ?
(by that I mean, accessing a Tasks artifact does necessarily point to a "connection", no? Is it a single Task everyone is accessing, or a "type" of a Task ?
Is this process fixed, i.e. for a certain project we have a flow (1) executed Task of type A, then Task of type (B) using the artifacts fro Task (A). This implies we might have multiple Tasks of types A/B but they are alw...
should i only do mongodb
No, you should do all 3 DBs ELK , Mongo, Redis
In both case if I get the element from the list, I am not able to get when the task started. Where is info stored?
If you are using client.tasks.get_all( ...)
should be under started
field
Specifically you can probably also do:queried_tasks = Task.query_tasks(additional_return_fields=['started']) print(queried_tasks[0]['id'], queried_tasks[0]['started'],)
os.environ['CLEARML_PROC_MASTER_ID'] = ''
Nice catch! (I'm assuming you also called Task.init somewhere before, otherwise I do not think this was necessary)
I think i solved it by deleting the project and running the base_task one time before the hyper parameter optimzation
So isit working now? everything is there ?
I'm running hyper parameter optimzation on LSF cluster where every task is an LSF job running without clearml-agent
WOW this is so cool! 🎊
Hi BitterStarfish58
What's the clearml version you are using ?
dataset upload both work fine
Artifacts / Datasets are uploaded correctly ?
Can you test if it works if you change " http://files.community.clear.ml " to " http://files.clear.ml " ?
UnevenDolphin73 something like this one?
https://github.com/allegroai/clearml/pull/225
When I look at the details, model artifact in the ClearML UI, it's been saved the usual way, and no tags that I added in the OutputModel constructor are there.
Did you disable the autologging ? Are you saying the tags not appearing is a bug (it might be) ?
Also, I don't mind auto logging either if I have control over publishing the model or not directly from that script, and adding tags etc, like OutputModel.
Sure you can publish models / add tags etc, wither from the UI or pr...
Thanks PompousBeetle71
Quick question, what frameworks are you using?
Do you use save
method directly to file stream (or any other direct storage)?
Also, is there a way to reproduce this issue of not capturing the model?
PompousBeetle71 , the reason I'm asking is the warning you see is due to the fact it cannot detect the filename you are saving your model to ... I'm trying to figure out how that actually happened .
BTW: in the next version we will probably remove this warning altogether, but I'm still curious on how to reproduce 🙂
Hi PompousBeetle71
Could you test the latest RC, I think the warning were fixed:pip install trains==0.16.2rc0
Let me know...
BoredHedgehog47 can you test this one? Is it close to your code ?
I basically moved the Task.init() call below the imports
Okay that is odd, can you copy pate the before/after of the import, so we can fix that?!
I think the crux of the issue is the subprocess calls I removed.
That kind of makes sense, though if the subprocess function also had Task.init call it should have worked.
Would that be the setup to try to replicate?
BoredHedgehog47 I tried changing the order of imports on the sample code I shared before, it worked in both cases ...
Thanks BoredHedgehog47 !
And yes if the Task.init() call was only in main.py
then the TB inside the subprocess (train.py) would as you perceived not be captured.
Did you by any chance test calling Task.init in Both main.py
and train.py
?
So in summary: subprocess calls appear to break clearML tracking, even if I do Task.init() in both main.py and train.py.
Okay let me see if we can reproduce & fix this, it should not be long
BoredHedgehog47 you need to make sure "<path here>/train.py" also calls Task.init (again no need to worry about calling it twice with different project/name)
The Task.init call will make sure the auto-connect works.
BTW: if you do os.fork , then there is no need for the Task.init, the main difference is that POpen starts a whole new process, and we need to make sure the newly created process is auto-connected as well (i.e. calling Task.init)
Maybe before everything else, can you share some background on the rational if starting a new sub process?
What I try to do is that DSes have some lightweight baseclass that is independent of clearml they use and a framework have all the clearml specific code. This will allow them to experiment outside of clearml and only switch to it when they are in an OK state. This will also help not to pollute clearml spaces with half backed ideas
So you want the DS to manually tell the baseclasss what to store ?
then the base class will store it for them, for example with joblib
, is this the...
but here I can tell them: return a dictionary of what you want to save
If this is the case you have two options, either store the dict as an artifact (this makes sense if this is not standalone model you would like to later use), or store as an artifact.
Artifact example:
https://github.com/allegroai/clearml/blob/master/examples/reporting/artifacts.py
getting them back
https://github.com/allegroai/clearml/blob/master/examples/reporting/artifacts_retrieval.py
Model example:
https:/...