Reputation
Badges 1
533 × Eureka!that is because my own machine has 10.2 (not the docker, the machine the agent is on)
Lets see if this is really the issue
I showed you this phenomenon in the UI photos in the other thread
inference table is a pandas dataframe
and the machine I have is 10.2.
I also tried nvidia/cuda:10.2-base-ubuntu18.04 which is the latest
I don't htink I can, this is private IP and to create a dummy example of a pipeline and execution will take me more time than I can dedicate to this
but remember, it didnt work also with the default one (nvidia/cuda)
I guess not many tensorflowers running agents around here if this wasn't brought up already
Thanks very much
Now something else is failing, but I'm pretty sure its on my side now... So have a good day and see you in the next question 😄
That's awesome, but my problem right now is that I have my own cronjob deleting the contents of /tmp
each interval, and it deletes the cfg files... So I understand I must skip deleting them from now on
So how do I solve the problem? Should I just relaunch the agents? Because they can't execute jobs now
I guess the AMI auto updated
Increased to 20, lets see how long will it last 🙂
why does it deplete so fast?
I mean, I barely have 20 experiments
I'll check if this works tomorrow
(it works now, with 20 GB)
Maybe the case is that after start
/ start_locally
the reference to the pipeline task disappears somehow? O_O
Okay so regarding the version - we are using 1.1.1
The thing with this error it that it happens sometimes, and when it happens it never goes away...
I don't know what causes it, but we have one host where it works okay, then someone else checks out the repo and tried and it fails for this error, while another guy can do the same and it will work for him
This is a part of a bigger process which times quite some time and resources, I hope I can try this soon if this will help get to the bottom of this
If you want we can do live zoom or something so you can see what happens