Reputation
Badges 1
25 × Eureka!RobustGoldfish9 I see.
So in theory spinning an experiment on an gent would be clone code -> build docker -> mount code -> execute code inside docker?
(no need for requirements etc.?)
Not yet π
It should not be complex to implement,
The actual aws auto scaler class is implementing just two functions:
def spin_up_worker(self, resource, worker_id_prefix, queue_name):
https://github.com/allegroai/clearml/blob/e9f8fc949db7f82b6a6f1c1ca64f94347196f4c0/clearml/automation/auto_scaler.py#L104
def spin_down_worker(self, instance_id):
https://github.com/allegroai/clearml/blob/e9f8fc949db7f82b6a6f1c1ca64f94347196f4c0/clearml/automation/auto_scaler.py#L...
You can see the class here:
https://github.com/allegroai/clearml/blob/9b962bae4b1ccc448e1807e1688fe193454c1da1/clearml/binding/frameworks/init.py#L52
Basically you do:
` def my_callback(load_or_save, model):
# type: (str, WeightsFileHandler.ModelInfo) -> WeightsFileHandler.ModelInfo
assert load_or_save not in ('load', 'save')
# do something
if skip:
return None
return model
WeightsFileHandler.add_pre_callback(my_callback) `
I came across it before but thought its only relevant for credentials
We are working on improving the docs, hopefully it will get clearer π
Hi SillyPuppy19
I think I lost you half way through.
I have a single script that launches training jobs for various models.
Is this like the automation example on the Github, i.e. cloning/enqueue experiments?
flag which is the model name, and dynamically loading the module to train it.
a Model has a UUID in the system as well, so you can use that instead of name (which is not unique), would that solve the problem?
This didn't mesh well with Trains, because the project a...
iβm just curious about how does trains server on different nodes communicate about the task queue
We start manual, we tell the agent just execute the task (notice we never enqueued it), if all goes well we will get to multi-node part π
what if cleanup service is launched using ClearML-Agent Services container
The easiest is to use the container args and pass the AWS credentials as env variables:-e AWS_ACCESS_KEY_ID=abcd -e ....
Make sense ?
simply record the type of each argument when you store it, and keep it in the database, unbeknownst to the user, what do you say?
This is now supported, but then you still need to flatten the dict.
Maybe we can just support "empty_dict/new_value = 42" if the original was "empty_dict = {}"
WDYT?
Train Data Params/a = {} Train Data Params/b = ...
Then maybe we could "hack" it so that if you edit it in the UI like so:Train Data Params/a = {'new': 'value'} Train Data Params/b = ...
You end up withparam = {'a': {'new': 'value'}, 'b' : ... }
What do you think?
yes ...
What's your use case for passing an empty dict ? (meaning how would one use it later)
Hi SmarmySeaurchin8
Could you open a bug on GitHub, so this is not lost? Let's assume 'a' is tracked, how would one change 'a' in the UI?
Thanks SolidSealion72 !
Also, I found out that adding "pool.join()" after pool.close() seem to solve the issue in the minimal example.
This is interesting, I'm pretty sure it has something to do with the subprocess not "closing" properly (or too fast or something)
Let me see if I can reproduce
Oh that is odd... let me check something
FierceHamster54 what you are saying that Inside the container it took 20 min to run? or that spinning the GCP instance until it registered as an Agent took 20min ?
Most of the time is took by building wheels for
nympy
and
pandas
...
BTW: This happens if there is a version mismatch and pip decides it needs to build the numpy from source, Can you send the full logs of that? Maybe we can somehow avoid that?
are you referring toΒ
extra_docker_shell_
scrip
t
Correct
the thing is that this runs before you create the virtual environment, so then in the new environment those settings are no longer there
Actually that is better, because this is what we need to setup the pip before it is used. So instead of passing --trusted-host
just do:
` extra_docker_shell_script: ["echo "[global] \n trusted-host = pypi.python.org pypi.org files.pythonhosted.org YOUR_S...
DeliciousBluewhale87 you can try:
` import sqlite3
import pandas as pd
conn = sqlite3.connect('test_database')
sql_query = pd.read_sql_query ('''
SELECT
*
FROM products
''', conn)
sql_query.to_csv(...) `
I always have my notebooks in git repo but suddenly it's not running them correctly.
What do you mean?
Can I switch off git diff (change detection?)
Yes, Task.init(..., auto_connect_frameworks={"detect_repository": False})
Have to get glue setup, which I couldnβt understand fully, so thatβs a different topic
I suggest using the apply template setup (basically you provide a Job/Service template, and it uses that to setup k8s jobs based on the Tasks coming in from the specific queue)
And can I store models with no attachment to tasks?
Assuming you have the Model ID :model = InputModel(model_id='aabbcc') local_file_or_folder = model.get_weights()
Is this what you are looking for?
ReassuredTiger98 do you know if tensorboard (not tensorboardX) also supports gif there ?
Okay, make sure that in your trains.conf
on all the trains-agent machine you add the following:agent.extra_docker_arguments: ["-v", "/etc/hosts:/etc/hosts",]
cleamrl sdk (i.e. python client)
The issue is that the Task.create did not add the repo, link (again as mentioned above, you need to pass the local folder or repo link to the repo
argument of the Task.create function). I "think" it could automatically deduce the repo from the script entry point, but I'm not sure. hence my question on the clearml package version
Actually scikit implies joblib π (so you should use scikit, anyhow I'll make sure we add joblib as it is more explicit)
Hi RotundSquirrel78
How did you end up with this command line?/home/sigalr/.clearml/venvs-builds/3.8/code/unet_sindiff_1_level_2_resblk --dataset humanml --device 0 --arch unet --channel_mult 1 --num_res_blocks 2 --use_scale_shift_norm --use_checkpoint --num_steps 300000
the arguments passed are odd (there should be none, they are passed inside the execution) and I suspect this is the issue
Okay that might explain the issue...
MysteriousBee56 so what you are saying ispython3 -m trains-agent --help
does NOT work
but trains-agent --help
does work?