Hi CooperativeFox72
Sure 🙂task.set_resource_monitor_iteration_timeout(seconds_from_start=1800)
DilapidatedDucks58
all our workers went down after starting the slack bot, is it expected?)
Oh dear... I can;t see any connection... What is the last log you have there?
JitteryCoyote63 I think that without specifically adding torch to the requirements, the agent will not be able to automatically resolve the correct cuda/torch version. Basically you should add torch to the requirements.txt file, and provide it to Task create, or use Task.force_requirements_env_freeze
JitteryCoyote63 did you add the bash script here: https://github.com/allegroai/trains-agent/blob/master/docs/trains.conf#L99
I pass my dataset as parameter of pipeline:
@<1523704757024198656:profile|MysteriousWalrus11> I think you were expecting the dataset_df
dataframe to be automatically serialized and passed, is that correct ?
If you are using add_step, all arguments are simple types (i.e. str, int etc.)
If you want to pass complex types, your code should be able to upload it as an artifact and then you can pass the artifact url (or name) for the next step.
Another option is to use pipeline from dec...
Hi MistakenDragonfly51
Notice that Models are their own entity, you can query them based on tags/projects/names etc.
Querying and getting Models is done by Model class:
https://clear.ml/docs/latest/docs/references/sdk/model_model#modelquery_models
task.get_models()
is always empty. (edited)
How come there are no Models on the Task? (in other words how come this is empty?)
Hi MelancholyChicken65
I'm not sure you an control it, the ui deduces the URL based on the address you are browsing to: so if you go yo http://app.clearml.example.com you will get the correct ones, but you have to put them on the right subdomains:
https://clear.ml/docs/latest/docs/deploying_clearml/clearml_server_config#subdomain-configuration
for a TPU with more than 16GB GRAM and less than 40GB, so sometime we need to provision a A100 to get the training speed we want but we don't use all the GRAM
Oh that makes sense...
Just saw this one, this might help?
https://www.globenewswire.com/news-release/2022/10/24/2539924/0/en/ClearML-and-Genesis-Cloud-Announce-New-MLOps-Partnership-Delivering-100-Green-Energy-Compute-Solution-for-Machine-Learning.html
Yes it seems so 😞
Seems like a Task contained an invalid artifact link.
I wouldn't sweat over it, it basically a warning that it could not locate the actual file to delete (albeit an ugly warning 🙂 )
I think AnxiousSeal95 would know when will the new version be ready.
regardless, is it actually deleting old Tasks ?
Hi RipeGoose2
Any logs on the console ?
Could you test with a dummy example on the demoserver ?
CourageousLizard33 specifically section (4) is the issue (and it's related to any elastic docker, nothing specific to trains-server)echo "vm.max_map_count=262144" > /tmp/99-trains.conf sudo mv /tmp/99-trains.conf /etc/sysctl.d/99-trains.conf sudo sysctl -w vm.max_map_count=262144 sudo service docker restart
Did you try the above, and you are still getting the same error ?
MysteriousBee56 what do you mean "delete a worker"
stop the agent running remotely ?
Task.connect is "automagic" i.e. to server when in Manual mode, from server in agent mode,
set_parameter is one way only and should be used to set an external Task's parameters.
Hi FiercePenguin76
It seems it fails detecting the notebook server and thinks this is a "script running".
What is exactly your setup?
docker image ?
jupyter-lab version ?
clearml version?
Also are you getting any warning when calling Task.init ?
BoredHedgehog47 you need to configure the clearml k8s glue to spin pods (instead of allocating agents per pods statically) does that make sense ?
A quick fix will be:
` import dotenv
dotenv.load_dotenv('~/.env')
from clearml import Task # Now we can load it.
import argparse
if name == "main":
# do stuff `wdyt?
my question is how to recover, must i recreate the agents or there is another way?
Yes you have to recreate the Task (I assume they failed, no?!)
server-->agent is fast, but agent-->server is slow.
Then multiple connection will not help, this is the bottleneck of the upload speed of your machine, regardless of what the target is (file-server, S3, etc...)
Notice the parents
argument when creating a new Dataset
UnevenDolphin73 are you saying offline does not work?
stream.write(msg + self.terminator) ValueError: I/O operation on closed file.
This is internal python error, how come there is no stream?
Then as you suggested, I would just use sys.path it is probably the easiest and actually very safe (because the subfolders are Always next to the "main" source code)
i've tried setting up a clearml application on openshift
First, my condolences 🙂 openshift ...
Second, what you need to make sure is that each container (i.e. ELK/Monogo etc) has their own PV for persistent storage , I'm assuming this is the root cause for the error.
Make sense to you ?
If this is the case, there is nothing you need to change, just provide the docker image (no need to pass packages
)
Wait, how do I reproduce it on community server? Maybe it has something to do with number of columns ? Or whether it is already wider than the screen? What's your browser / OS ?
Hmm I tested on chromium and it seemed to work, let me see if I can reproduce it...
Thanks GiganticTurtle0
So the bug is "mock_step" is storing "NUMBER_2" argument value in the second instance?
no need for it actually