Hi GleamingGrasshopper63
How well can the ML Ops component handle job queuing on a multi-GPU server
This is fully supported π
You can think of queues as a way to simplify resources for users (you can do more than that,but let's start simple)
Basicalli qou can create a queue per type of GPU, for example a list of queues could be: on_prem_1gpu, on_prem_2gpus, ..., ec2_t4, ec2_v100
Then when you spin the agents, per type of machine you attach the agent to the "correct" queue.
Int...
Try adding this environment variable:export TRAINS_CUDA_VERSION=0
how I can turn off git diff uploading?
Sure, see here
None
Please send the full log, I just tested it here, and it seems to be working
Hmm what do you have here?
os.system("cat /var/log/studio/kernel_gateway.log")
@<1541954607595393024:profile|BattyCrocodile47> first let me say I β€ the dark theme you have going on there, we should definitly add that π
When I run
python set_triggers.py; python basic_task.py
, they seem to execute, b
Seems like you forgot to start the trigger, i.e.
None
(this will cause the entire script of the trigger inc...
Hi @<1523701797800120320:profile|SteadySeagull18>
...the job -> requeue it from the GUI, then a different environment is installed
The way that it works is, in the "originating" (i.e. first manual) execution only the directly imported packages are listed (no derivative packages that re required by the original packages)
But when the agent is reproducing the job, it creates a whole clean venv for the experiment, installs the required packages, then pip resolves the derivatives, and ...
GiddyTurkey39 can you ping the server-address
(just making sure, this should be the IP of the server not 'localhost')
Hi @<1544128915683938304:profile|DepravedBee6>
You mean like backup the entire instance and restore it on another machine? Or are you referring to specific data you want to migrate?
BTW if you are upgrading old versions of the server I would recommend upgrading to every version in the middle (there are some migration scripts that need to be run in a few of them)
Should work, follow the backup process, and restore into a new machine:
None
to get all the image metrics:client.events.get_task_metrics(tasks=['6adb929f66d14731bc76e3493ab89d80'], event_type='training_debug_image')
metric=image is the name in the dropdown of the denugimages
FYI:ssh -R 8080:localhost:8080 -R 8008:localhost:8008 -R 8081:localhost:8081 replace_with_username@ubuntu_ip_here
solved the issue π
Hi UptightBeetle98
The hyper parameter example assumes you have agents ( trains-agent
) connected to your account. These agents will pull the jobs from the queue (which they are now, aka pending) setup the environment for the jobs (venv or docker+venv) and execute the job with the specific arguments the optimizer chose.
Make sense ?
Hi @<1544853695869489152:profile|NonchalantOx99>
I would assume the clearml-server configuration / access key is misconfigured in your copy of example.env
JitteryCoyote63 you mean from code?
HandsomeCrow5 check the latest RC, I just run the same code and it worked π
That said, it might be different backend, I'll test with the demoserver
JitteryCoyote63 s3 should work, you can go to your profile page, see if you do not have some old credentials already there, maybe this is the issue.
Hi ApprehensiveFox95
You mean from code remove the argparse arguments ?
Or post execution in the UI?
Sure :task = Task.init(..., auto_connect_arg_parser={'arg_not_to_log': False})
This will cause all argparse to automatically be logged (and later editable) with the exception of the argument arg_not_to_log
Notice that if you have --arg-something, to exclude it add to the dict arg_something': False
now it stopped working locally as well
At least this is consistent π
How so ? Is the "main" Task still running ?
because comparing experiments using graphs is very useful. I think it is a nice to have feature.
So currently when you compare the graphs you can select the specific scalars to compare, and it Update in Real Time!
You can also bookmark the actual URL and it is fully reproducible (i.e. full state is stored)
You can also add custom columns to the experiment table (with the metrics) and sort / filter based on them, and create a summary dashboard (again like ll pages in the web app, URL is...
YEY π π
it should be fairly easy to write such a daemon
from clearml.backend_api.session.client import APIClient
client = APIClient()
timestamp = time() - 60 * 60 * 2 # last 2 hours
tasks = client.tasks.get_all(
status=["in_progress"],
only_fields=["id"],
order_by=["-last_update"],
page_size=100,
page=0,
created =[">{}".format(datetime.utcfromtimestamp(timestamp))],
)
...
references:
[None](https://clear.ml/...
I guess I would need to put this in the extra_vm_bash_script param of the auto-scaler, but it will reboot in loop right? Isnβt there an easier way to achieve that?
You can edit the extra_vm_bash_script
which means the next time the instance is booted you will have the bash script executed,
In the meantime, you can ssh to the running instance and change the ulimit manually, wdyt?