Reputation
Badges 1
19 × Eureka!Thanks! I'll look forward to it. If there is a draft that can give a general scope and function to the integration, that would also be helpful for this phase of our project.
It worked! I added this call shortly after Task.init
:tf.summary.create_file_writer("C:/mypath/logs")
The only place I see subprocess being called in opennmt is to determine the batch size, but not for the primary training task.
But with poetry running in a docker container, the cache is not persistent. Should I map the poetry cache volume to a location on the host?
At least I made S3 enough like Google Drive to make our team happy.
Move from gdrive to S3 bucket storage. This mounts the S3 bucket storage as a network drive. ClearML does not work easily with Google Drive.
In Tensorflow's init .py, tensorboard appears to be initialized (including tf.summary):
` # Hook external TensorFlow modules.
Import compat before trying to import summary from tensorboard, so that
reexport_tf_summary can get compat from sys.modules. Only needed if using
lazy loading.
_current_module.compat.v2 # pylint: disable=pointless-statement
try:
from tensorboard.summary._tf import summary
_current_module.path = (
[_module_util.get_parent_dir(summary)] + _current_m...
Let's see if I understand:
- Triton server deployments only have manual, static deployment of models for inferencing (without enterprise)
- ClearML can load and unload models based upon usage, but has to do so from the hard drive
- Triton server does not support saving models off to normal RAM for faster loading/unloading
- Therefore, currently, we can deploy 100 models when only 5 can be concurrently loaded, but when they are unloaded/loaded (automatically by ClearML), it will take a few sec...
That is great to hear! Is there any documentation on how it works, and if it can be configured?
@<1523701205467926528:profile|AgitatedDove14> - after the model_trainer.train step it is marked as complete. This is done using our own repo - None . The extra reporting steps are not added here (I am working on that locally) but it is calling the job complete.
If ClearML does not implement this, we may have to ourselves - None .
It seems that once the job as completed once, it doesn't accept any new report...
we are at Server: 3.18.2-1126 and pypi version 1.12.2.
So, accordintg to the article (and the code as far as I could tell), OpenNmt-tf automatically enabled TensorBoard. That is, it auto-logs the relevant features through tf.summary ( https://www.tensorflow.org/api_docs/python/tf/summary ). This is output on the cmd line with the likes of:
` INFO:tensorflow:Evaluation result for step 9000: loss = 1.190986 ; perplexity = 3.290324 ; bleu = 63.569644
INFO:tensorflow:Step = 9100 ; steps/s = 2.17, source words/s = 28293, target words/s = 39388 ; Lea...
@<1523701205467926528:profile|AgitatedDove14> - for some reason none of those solutions are working. I am forcing "mark_started" - but it doesn't register. Models don't have the report_* endpoints and even trying with the artifact - once the job finishes, the artifact will no longer update.
No TB (Tesnorboard) is not enabled. I just googled it and found this: https://forum.opennmt.net/t/running-tensorboard/4242 . I will try enabling TB and see if that fixes it.