Reputation
Badges 1
25 × Eureka!Hi FierceFly22
Hi, does anyone know where trains stores tensorboard data
Tesnorboard data is stored wherever you point your file-writer to 🙂
What trains is doing is while tensorboard writes it's own data to disk, it takes the data (in-flight) and sends it to the trains-server. The trains-server puts everything in the DB, so later everything is viewable & searchable.
Basically you don't need to store your TB files after your experiment is done, you have all the data in the trains-s...
Hi @<1529633468214939648:profile|CostlyElephant1>
Is it possible to get user ID of the current user
On the Task.data object itself there should be a filed named " user " that's the user ID of the owner (creator) of the Task.
You can filter based on this id with
Tasks.get_tasks(..., task_filter={'user': ["user-id-here"]})
wdyt?
Yes I think the writer.add_figure somehow crops the image
SteadyFox10 I suspect you are correct 🙂
CourageousLizard33 see also section (4) here:
https://github.com/allegroai/trains-server/blob/master/docs/install_linux_mac.md#launching-the-trains-server-docker-in-linux-or-macos
however if I want multiple machines syncing with the optimizer, for pulling the sampled hyper parameters and reporting results, I can't see how it would work
I have to admit, this is where I'm loosing you.
I thought you wanted to avoid the agent, since you wanted to run everything locally, wasn't that the issue ?
Maybe there is some background missing here, let me see if I can explain how the optimizer works.
In your actual training code you have something like:` params = {'lr': 0.3, ...
The main question I have is why is the ALB not passing the request, I think you are correct it never reaches the serving server at all, which leads me to think the ALB is "thinking" the service is down or is not responding, wdyt?
@<1523703080200179712:profile|NastySeahorse61> so glad you managed to solve it 🎊 🚀
Hi @<1523704207914307584:profile|ObedientToad56>
hat would be the right way to extend this with let's say a custom engine that is currently not supported ?
as you said 'custom' 🙂
None
This is actually a custom engine, (see (3) in the readme, and the preprocessing.py implementing it) I think we should actually add a specific example to custom so this is more visible. Any thoughts on what would...
@<1523707653782507520:profile|MelancholyElk85>
What's the clearml version you are using ?
Just making sure... base_task_id has to point to a Task that is in "draft" mode, for the pipeline to use it
But it should work out of the box ...
Yes it should ....
The user and personal access token are used as is and it propagates down to submodules, since those are simply another git repository.
Can you manually successfully run:git clone --recursive https://user:token@github.com/company/repo_with_submodules
@<1523707653782507520:profile|MelancholyElk85> I just run a single step pipeline and it seemed to use the "base_task_id" without cloning it...
Any insight on how to reproduce ?
A few examples here:
None
Grafana model performance example:
browse to
login with: admin/admin
create a new dashboard
select Prometheus as data source
Add a query: 100 * increase(test_model_sklearn:_latency_bucket[1m]) / increase(test_model_sklearn:_latency_sum[1m])
Change type to heatmap, and select on the right hand-side under "Data Format" s...
I am logging debug images via Tensorboard (via
add_image
function), however apparently these debug images are not collected within fileserver,
ZanyPig66 what do you mean not collected to the file server? are you saying the TB add_image is not automatically uploading images? or that you cannot access the files on your files server?
StaleButterfly40 are you sure you are getting the correct image on your TB (toy255) ?
. I'm trying to run to get a task to run using a specific docker image and to source a bash script before execution of the python script.
Are you running an agent in docker mode ? if so you should be able to see the Output of your bash script first thing in the log
(and it will appear in the docker CMD)
Oh :)task.get_parameters_as_dict()
Hi IcySwallow94
Are you deploying the clearml server with the helm chart ?
Hi @<1523701868901961728:profile|ReassuredTiger98>
is there something like a clearml context manager to disable automatic logging?
Sure just do a wildcard with the files you actually want to autolog the rest will be ignored:
None
task = Task.init(..., auto_connect_frameworks={'pytorch' : '*.pt'}
Thanks BitterStarfish58 !
Hmm let check again something.
Thanks for the detials @<1597762318140182528:profile|EnchantingPenguin77>
clearml.Auto-Scaler - INFO - New instance b97e702d-e2b3-4f28-adab-be59648601ea listening to test-gpu queue
This looks like a new agent was spined on your EC2 account, can you see it in the "Workers" page ?
no requests are being served as in there is no traffic indeed
It might be that it only pings when requests are served
what is actually setting the task status to
Aborted
?
server watchdog, basically saying, no one is pinging "I'm alive" on this "Task" I should abort it
my understanding was that the deamon thread was deserializing the task of the control plane every 300 seconds by default
Yeah.. let me check that
Basically this sounds like a sort of a bug,...
GreasyPenguin14 GrittyKangaroo27 the new release contains a fix, could you verify it solves the issue in your scenario as well (there is now a smart timeout to detect the inconsistent state, that means the close/exit procedure might be delayed (10sec) instead of hanging in these specific rare scenarios)
However, it's very interesting why ability to cache the step impacts artifacts behavior
From you log:
videos_df = StorageManager.download_file(videos_df)
Seems like "videos_df" is the DataFrame, why are you trying to download the DataFrame ? I would expect to try and download the pandas file, not a DataFrame object
Hi @<1634001100262608896:profile|LazyAlligator31>
Is this because the code repo is being recreated in this directory?
Yes this is correct 🙂
Basically the entire code base + venv is installed there, to make sure it does not intyerfere with the "system" preinstalled environment
(it also allows for caching on the host machine 🙂 )
Hi JollyChimpanzee19
I found this one:
https://clearml.slack.com/archives/CTK20V944/p1622134271306500
TroubledHedgehog16 if you have a preinstalled conda env then why would you need to reinstall it from yml file? Also if this is the default python env, clearml-agent will inherit from it and use i, (no real overhead there)
Notice the reason for "inheriting system" python environments is so that the agent could cache the individual Task requirements, meaning next time it will not need to reinstall anything
wdyt?
My use case is when I have a merge request for a model modification I need to provide several informations for our Quality Management System one is to show that the experiment is a success and the model has some improvement over the previous iteration.
Sounds likes good approach 🙂
Obviously I don't want the reviewer to see all ...
Maybe move publish the experiment and move it to a dedicated folder ? Then even if they see all other experiments, they are under "development" p...
Hi SpicyOtter88plt.plot([0, 1], [0, 1], 'r--', label='')ti cannot have a legend without a label, so it gives it "anonymous" label, I think it should just get "unlabeled 0" wdyt?
Thanks VexedKangaroo32 , this is great news :)