Reputation
Badges 1
25 × Eureka!Clearml 1.13.1
Could you try the latest (1.16.2)? I remember there was a fix specific to Datasets
FierceHamster54 what you are saying that Inside the container it took 20 min to run? or that spinning the GCP instance until it registered as an Agent took 20min ?
Most of the time is took by building wheels for
nympy
and
pandas
...
BTW: This happens if there is a version mismatch and pip decides it needs to build the numpy from source, Can you send the full logs of that? Maybe we can somehow avoid that?
Hmm I guess doable 🙂 could you open a github issue with feature request ?
If we have enough support it will bump it in the priority 🤞
DefiantCrab67
Where will you copy it from ?
I was just able to reproduce with "localhost"
UnevenDolphin73 following the discussion https://clearml.slack.com/archives/CTK20V944/p1643731949324449 , I suggest this change in the pseudo code
` # task code
task = Task.init(...)
if not task.running_locally() and task.is_main_task():
# pre-init stage
StorageManager.download_folder(...) # Prepare local files for execution
else:
StorageManager.upload_file(...) # Repeated for many files needed
task.execute_remotely(...) `Now when I look at is, it kinds of make sense to h...
I'm assuming you cannot directly access port 10022 (default ssh port on the remote machine) from your local machine, hence the connection issue. Could that be?
Thanks!
fyi: This section is not necessary if you you have clearml.conf file in ~/Task.set_credentials( api_host="
", web_host="
", files_host="
", key='********************', secret='***********************' )
Let me check the code for a min
Hi CooperativeFox72
Sure 🙂task.set_resource_monitor_iteration_timeout(seconds_from_start=1800)
DilapidatedDucks58
all our workers went down after starting the slack bot, is it expected?)
Oh dear... I can;t see any connection... What is the last log you have there?
JitteryCoyote63 I think that without specifically adding torch to the requirements, the agent will not be able to automatically resolve the correct cuda/torch version. Basically you should add torch to the requirements.txt file, and provide it to Task create, or use Task.force_requirements_env_freeze
JitteryCoyote63 did you add the bash script here: https://github.com/allegroai/trains-agent/blob/master/docs/trains.conf#L99
I pass my dataset as parameter of pipeline:
@<1523704757024198656:profile|MysteriousWalrus11> I think you were expecting the dataset_df
dataframe to be automatically serialized and passed, is that correct ?
If you are using add_step, all arguments are simple types (i.e. str, int etc.)
If you want to pass complex types, your code should be able to upload it as an artifact and then you can pass the artifact url (or name) for the next step.
Another option is to use pipeline from dec...
Hi MistakenDragonfly51
Notice that Models are their own entity, you can query them based on tags/projects/names etc.
Querying and getting Models is done by Model class:
https://clear.ml/docs/latest/docs/references/sdk/model_model#modelquery_models
task.get_models()
is always empty. (edited)
How come there are no Models on the Task? (in other words how come this is empty?)
Hi MelancholyChicken65
I'm not sure you an control it, the ui deduces the URL based on the address you are browsing to: so if you go yo http://app.clearml.example.com you will get the correct ones, but you have to put them on the right subdomains:
https://clear.ml/docs/latest/docs/deploying_clearml/clearml_server_config#subdomain-configuration
for a TPU with more than 16GB GRAM and less than 40GB, so sometime we need to provision a A100 to get the training speed we want but we don't use all the GRAM
Oh that makes sense...
Just saw this one, this might help?
https://www.globenewswire.com/news-release/2022/10/24/2539924/0/en/ClearML-and-Genesis-Cloud-Announce-New-MLOps-Partnership-Delivering-100-Green-Energy-Compute-Solution-for-Machine-Learning.html
Yes it seems so 😞
Seems like a Task contained an invalid artifact link.
I wouldn't sweat over it, it basically a warning that it could not locate the actual file to delete (albeit an ugly warning 🙂 )
I think AnxiousSeal95 would know when will the new version be ready.
regardless, is it actually deleting old Tasks ?
Hi RipeGoose2
Any logs on the console ?
Could you test with a dummy example on the demoserver ?
CourageousLizard33 specifically section (4) is the issue (and it's related to any elastic docker, nothing specific to trains-server)echo "vm.max_map_count=262144" > /tmp/99-trains.conf sudo mv /tmp/99-trains.conf /etc/sysctl.d/99-trains.conf sudo sysctl -w vm.max_map_count=262144 sudo service docker restart
Did you try the above, and you are still getting the same error ?
MysteriousBee56 what do you mean "delete a worker"
stop the agent running remotely ?
Task.connect is "automagic" i.e. to server when in Manual mode, from server in agent mode,
set_parameter is one way only and should be used to set an external Task's parameters.
Hi FiercePenguin76
It seems it fails detecting the notebook server and thinks this is a "script running".
What is exactly your setup?
docker image ?
jupyter-lab version ?
clearml version?
Also are you getting any warning when calling Task.init ?
BoredHedgehog47 you need to configure the clearml k8s glue to spin pods (instead of allocating agents per pods statically) does that make sense ?
A quick fix will be:
` import dotenv
dotenv.load_dotenv('~/.env')
from clearml import Task # Now we can load it.
import argparse
if name == "main":
# do stuff `wdyt?
my question is how to recover, must i recreate the agents or there is another way?
Yes you have to recreate the Task (I assume they failed, no?!)
server-->agent is fast, but agent-->server is slow.
Then multiple connection will not help, this is the bottleneck of the upload speed of your machine, regardless of what the target is (file-server, S3, etc...)
Notice the parents
argument when creating a new Dataset
UnevenDolphin73 are you saying offline does not work?
stream.write(msg + self.terminator) ValueError: I/O operation on closed file.
This is internal python error, how come there is no stream?