Reputation
Badges 1
132 × Eureka!SuccessfulKoala55 I think I just realized I had a misunderstanding. I don't think we are running a local server version of ClearML, no. We have a workstation running a queue/agents, but ClearML itself is via http://app.pro.clear.ml , I don't think we have ClearML running locally. We were tracking experiments before we setup the queue and the workers and all that.
IrritableOwl63 can you confirm - we didn't setup our own server to, like, handle experiment tracking and such?
This discussion might be relevant, it shows how to query a Task for metrics in code: https://clearml.slack.com/archives/CTK20V944/p1626992991375500?thread_ts=1626981377.374400&cid=CTK20V944
` {'input': ['Input Model #0'], 'output': [<clearml.model.Model object at 0x7f6d7d6a2750>,
...omitted some here
<clearml.model.Model object at 0x7f6d7d4b1350>]}
Input Model #0
AttributeError Traceback (most recent call last)
<ipython-input-83-65009a52f91b> in <module>()
22
23
---> 24 pretraining_task_id = input_model.task
25 print(f"pretraining_task_id {pretraini...
It's not a big deal because it happens after I'm done with everything, I can just reset the Colab runtime and start over
As an alternate solution, if I could group runs and get stats across the group, that would be cool
AgitatedDove14 I should have probably expanded my last message a bit more to say "Right, natanM, right now it's on http://app.pro.clear.ml , not http://app.clear.ml , can you advise, given that it is on .pro?"
So for example:
` {'output_dir': 'shiba_ner_trainer', 'overwrite_output_dir': False, 'do_train': True, 'do_eval': True, 'do_predict': True, 'evaluation_strategy': 'epoch', 'prediction_loss_only': False, 'per_device_train_batch_size': 16, 'per_device_eval_batch_size': 16, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 1, 'eval_accumulation_steps': None, 'learning_rate': 0.0004, 'weight_decay': 0.0, 'adam_beta1': 0.9, 'adam_beta2': 0.999, 'adam...
Or at least not conveniently
Anyhow, it seems that moving it to main() didn't help. Any ideas?
Ah... so there actually is a way to share it then, so long as people are signed up? How would one do this? Do I just share a link to the experiment, like https://app.pro.clear.ml/projects/b4a1875539cb4d9798529439801402ee/experiments/6f4cb4718c7c4a25b3a041c63f6ff2b4/output/execution?columns=selected&columns=type&columns=last_iteration&columns=hyperparams.Args.num_train_epochs&columns=name&columns=status&columns=users&columns=started&columns=last_update&columns=tags&columns=parent.name&colum...
I see a "publish" button on here, but would that make it visible on the wider internet?
OK, so with the RC, the issue has gone away. I can now import torch without issue.
I suppose I could upload 200 different "datasets", rather than one dataset with 200 folders in it, but then clearml-data search would have 200 entries in it? It seemed like a good idea to put them all in one at the time
As in, I edit Installed Packages, delete everything there, and put that particular list of packages.
AgitatedDove14 I'm making some progress on this. I've currently got the situation that my training run saved all of these files, and Task.get_task(param['TaskA']).models['output''][-1] gets me just one of them, training_args.bin . Then -2 gets me another, rng_state.pth
If I just get Task.get_task(param['TaskA']).models['output'] , I end up getting a huge list of, like, ` [<clearml.model.Model object at 0x7fec2841c880>, <clearml.model.Model object at 0x7fec2841...
Alas, no luck. Uploaded the same things, did not upload trainer_state.json
Yup, not hoping to open the server to the world. As for "rerun it", I don't think I want to rerun the experiments, I want to show the results of the original training runs.
Is there any way to export the results from the internal server?
Yeah, we don't even get to line 480, all the training loop is within line 469, I think.
Ah, makes sense! Have you considered adding a "this is the old website! Click here to get to the new one!" banner, kinda like on docs for python2 functions? https://docs.python.org/2.7/library/string.html
Martin I found a different solution (hardcoding the parent tasks by hand), but I'm curious to hear what you discover!
So for example, I'm able to view in the UI that my finetuning task 7725f5bed94848039c68f2a3a573ded6 has an input model, and I can find the creating experiment for that. But how would I do this in code?
Well they do all have different names
I'm not sure I follow. Can you elaborate what you mean? Pseudo stack?
OK, neat! Any advice on how to edit the training loop to do that? Because the code I'm using doesn't offer easy access to the training loop, see here: https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/examples/pytorch/language-modeling/run_mlm.py#L469
trainer.train() just does the training loop automagically, and saves a checkpoint once in a while. When it saves a checkpoint, clearML uploads all the other files. How can I hook into... whatever ...
So in theory we could hook into one of those functions and add a line to have ClearML upload that particular json we want