Reputation
Badges 1
132 × Eureka!Then when I queue up a job on the 1x16gb
queue it would run on one of the two GPUs?
Oh, I forgot to mention: pip install tensorboard also
OK, so if I've got, like, 2x16GB GPUs and 2x32GB I could allocate all the 16GB GPUs to one Queue? And all the 32GB ones to another?
Or we could do@misc{clearml, title = {ClearML - Your entire MLOps stack in one open-source tool}, year = {2019}, note = {Software available from
}, url={
}, author = {Allegro AI}, }
IrritableOwl63 pm'd you a task ID
CostlyOstrich36 I made a code snippet for you:
` from clearml import Task
figuring out the project ID
project_list = Task.get_projects() # get all the projects
project_id = Task.get_project_id("your project name here")
getting all the tasks for a project
tasks = Task.get_all(project=[project_id]).response.tasks
loop through and get approximate maximum gpu-seconds by type.
import random
from collections import defaultdict
task = random.choice(tasks)
print(dir(task))
print(task.runtim...
I suppose I could upload 200 different "datasets", rather than one dataset with 200 folders in it, but then clearml-data search
would have 200 entries in it? It seemed like a good idea to put them all in one at the time
I gather there's a distinction between the two, with app.clear being the public cloud-based SaaS version
Or do you just want:@misc{clearml, title = {ClearML - Your entire MLOps stack in one open-source tool}, year = {2019}, note = {Software available from
}, url={
}, author = {ClearML}, }
AgitatedDove14 yes I see the scalars. Attached screenshot
Code to reproduce: I'll try to come up with a sample you will be able to run. But the code we're using is basically just https://github.com/huggingface/transformers/blob/f6e254474cb4f90f8a168a599b9aaf3544c37890/examples/pytorch/language-modeling/run_mlm.py
Good point! Any pointers to API docs to start looking?
Reproduce the training:# How to run
`
You need to pip install requirements first. I think the following would do: transformers datasets clearml tokenizers torch
CLEAR_DATA has train.txt and validation.txt, the .txt files just need to have text data on separate lines. For debugging, anything should do.
For training you need tokenizer files as well, vocab.json, merges.txt, and tokenizer.json.
you also need a config.json,
should work.
export CLEAR_DATA="./data/dataset_for...
I suppose the flow would be something like:
select all experiments from project x with iterations greater than y, pull runtime for each one add them all up. I just don't know what API calls to make for 1 and 2
Yup, not hoping to open the server to the world. As for "rerun it", I don't think I want to rerun the experiments, I want to show the results of the original training runs.
Is there any way to export the results from the internal server?
sounds good to me!
Sounds doable, I will give it a try.
The task.execute_remotely
thing is quite interesting, I didn't know about that!
Here's the hours/days version, corrected now lol:gpu_hours = {} gpu_days = {} for gpu_type, gpu_time_seconds in gpu_seconds.items(): gpu_time_hours = gpu_time_seconds/3600 gpu_hours[gpu_type] = gpu_time_hours gpu_days[gpu_type] = gpu_time_hours/24
This sort of behavior is what I was thinking about when I saw "wildcard or pathlib Path" listed as options
Sure, I don't seem to be having any trouble with 1.03rc1. As for 1.02, like I said, the original issue seems to have mysteriously gone away, like some sort of heisenbug that goes away when I mess with the Notebook.
With a completely fresh notebook I added the cells to install clearml 1.02 and initiate a Task, and ran the notebook again, and... the issue seems to have disappeared again.
Not sure how to even replicate the original issue anymore, sorry I couldn't be of more help!
I see a "publish" button on here, but would that make it visible on the wider internet?
OK, so with the RC, the issue has gone away. I can now import torch without issue.
I might not be able to get to that but if you create an issue I'd be happy to link or post what I came up with, wdyt?
Well, I can just work around it now that I know, by creating a folder with no subfolders and uploading that. But... 🤔 perhaps allow the interface to take in a list or generator? As in,files_to_upload = [f for f in output_dir.glob("*") if f.is_file()] Task.current_task().upload_artifact( "best_checkpoint", artifact_object=files_to_upload)
And then it could zip up the list and name it "best_checkpoint"?
Before I enqueued the job, I manually edited Installed Packages thus:boto3 datasets clearml tokenizers torch
and addedpip install git+
to the setup script.
And the docker image isnvidia/cuda:11.2.2-cudnn8-runtime-ubuntu18.04
I did all that because I've been having this other issue: https://clearml.slack.com/archives/CTK20V944/p1624892113376500
This should work. It has the tokenizer files, the train.txt, the validation.txt and a config.json
Ah, makes sense! Have you considered adding a "this is the old website! Click here to get to the new one!" banner, kinda like on docs for python2 functions? https://docs.python.org/2.7/library/string.html