Reputation
Badges 1
19 × Eureka!It seems that once the job as completed once, it doesn't accept any new report...
we are at Server: 3.18.2-1126 and pypi version 1.12.2.
It worked! I added this call shortly after Task.init
:tf.summary.create_file_writer("C:/mypath/logs")
So, accordintg to the article (and the code as far as I could tell), OpenNmt-tf automatically enabled TensorBoard. That is, it auto-logs the relevant features through tf.summary ( https://www.tensorflow.org/api_docs/python/tf/summary ). This is output on the cmd line with the likes of:
` INFO:tensorflow:Evaluation result for step 9000: loss = 1.190986 ; perplexity = 3.290324 ; bleu = 63.569644
INFO:tensorflow:Step = 9100 ; steps/s = 2.17, source words/s = 28293, target words/s = 39388 ; Lea...
In Tensorflow's init .py, tensorboard appears to be initialized (including tf.summary):
` # Hook external TensorFlow modules.
Import compat before trying to import summary from tensorboard, so that
reexport_tf_summary can get compat from sys.modules. Only needed if using
lazy loading.
_current_module.compat.v2 # pylint: disable=pointless-statement
try:
from tensorboard.summary._tf import summary
_current_module.path = (
[_module_util.get_parent_dir(summary)] + _current_m...
That is great to hear! Is there any documentation on how it works, and if it can be configured?
@<1523701205467926528:profile|AgitatedDove14> - for some reason none of those solutions are working. I am forcing "mark_started" - but it doesn't register. Models don't have the report_* endpoints and even trying with the artifact - once the job finishes, the artifact will no longer update.
The only place I see subprocess being called in opennmt is to determine the batch size, but not for the primary training task.
Thanks! I'll look forward to it. If there is a draft that can give a general scope and function to the integration, that would also be helpful for this phase of our project.
@<1523701205467926528:profile|AgitatedDove14> - after the model_trainer.train step it is marked as complete. This is done using our own repo - None . The extra reporting steps are not added here (I am working on that locally) but it is calling the job complete.
But with poetry running in a docker container, the cache is not persistent. Should I map the poetry cache volume to a location on the host?
No TB (Tesnorboard) is not enabled. I just googled it and found this: https://forum.opennmt.net/t/running-tensorboard/4242 . I will try enabling TB and see if that fixes it.
At least I made S3 enough like Google Drive to make our team happy.
Move from gdrive to S3 bucket storage. This mounts the S3 bucket storage as a network drive. ClearML does not work easily with Google Drive.
If ClearML does not implement this, we may have to ourselves - None .
Let's see if I understand:
- Triton server deployments only have manual, static deployment of models for inferencing (without enterprise)
- ClearML can load and unload models based upon usage, but has to do so from the hard drive
- Triton server does not support saving models off to normal RAM for faster loading/unloading
- Therefore, currently, we can deploy 100 models when only 5 can be concurrently loaded, but when they are unloaded/loaded (automatically by ClearML), it will take a few sec...