Reputation
Badges 1
25 × Eureka!Hi @<1684010629741940736:profile|NonsensicalSparrow35>
however for the remote file it always creates the name with the following pattern:
{filename_prefix}checkpoint{n}.pt
..
Is this the main issue?
Notice that the model name (i.e. the entry on the Task itself) is not directly connected with the stored file name on the target file server (or S3)
Oh this is Only in the SaaS server ...
(I'm sorry I was not clear on that)
When you set the pod make sure you mount the clearml local cache folder to the PV
basically /root/.clearml/cache/
That's a very neat solution! maybe there's a way to inject "Task.init" into the code through a plugin, or worst case push it into some internal base package, and only call it when the code is orchestrated automatically (usually there is a an environment variable that is set to signal that, like CI_something )
Task.current_task().connect(training_args, name='hugggingface args')And you should be able to change them when launching remotely 😉
SmallDeer34 btw: "set_parameters_as_dict" will replace all the arguments (and is one way) ...
This seems to be okay to me, are you seeing the dataset in the web UI?
Also:
my_local_dataset_folder = Dataset.get(dataset_project=project, dataset_name=name).get_mutable_local_copy()
what exactly are you seeing in " my_local_dataset_folder " directory?
(it should contain the copy of the S3 file)
but realized calling that from the extension would be hard, so we opted to have the TypeScript code make calls to the ClearML API server directly, e.g.
POST /tasks.get_all_ex
.
did you manage to get that working?
- To get the credentials, we read the
~/clearml.conffile. I tried hard, but couldn't get a TypeScript library to work to parse the HOCON config file format... so I eventually resorted to using (likely brittle) regex to grab the ClearML endpoint and API ke...
query
tasks
that are both Running --> You mean
status=["in_progress"]
Yes!
How do I figure out other possible parameter I can use with
status
parameter?
https://clear.ml/docs/latest/docs/references/api/tasks#post-tasksget_all
https://clear.ml/docs/latest/docs/references/api/definitions#taskstask
Filter only tasks that start say
10 min ago
. Is there any parameter for it also ?
last_update or created then use...
EnviousStarfish54
oh, this is a bit different from my expectation. I thought I can use artifact for dataset or model version control.
You totally can use artifacts as a way to version data (actually we will have it built in in the next versions)
Getting an artifact programmatically:
Task.get_task(task_id='aabb'). artifacts['artifactname'].get()
Models are logged automatically. No need to log manually
Hi QuaintJellyfish58
This is odd, this "undefined" project is also marked as "Example" which would explain why you cannot delete it, but not how you ended up with one
Any idea on what changed on your server ?
It looks like the tag being used is hardcoded to 1.24-18. Was this issue identified and fixed in later versions?
BoredHedgehog47 what do you mean by "hardcoded 1.24-18" ? tag to what I think I lost context here
AbruptWorm50 my apologies I think I mislead you you, yes you can pass geenric arguments to the optimizer class, but specifically for optuna, this is disabled (not sure why)
Specifically to your case, the way it works is:
your code logs to tensorboard, clearml catches the data and moves it to the Task (on clearml-server), optuna optimization is running on another machine, trail valies are maanually updated (i.e. the clearml optimization pulls the Task reported metric from the server and updat...
Hey WickedGoat98
I found the bug, it is due to the fact the numpy (passed to plotly) contains both datetime and nan, and plotly.js does not like it. I'll make sure this is fixed, in the meantime you can just remove the first row (it contains the nan):df = pd.concat([tickerDf.Close, tickerDf_Change.Close_pcent], axis=1) df = df[1:]
If possible, can we have a "only one experiment can be given a single tag"
You mean "moving a tag" automatically (i.e. if someone else had the same tag it is removed from it)?
Hi TeenyFly97
Can I super-impose the graphs while comparing experiments?
Hmm not at the moment, I think someone asked for the option to control it, in both comparison mode and "standalone" mode.
There is a long discussion on this feature here:
https://github.com/allegroai/trains/issues/81#issuecomment-645425450
Feel free to chime in 🙂
I think that the latest agreement is a switch in the UI, separating or collecting (super-imposing) those graphs.
Was trying to figure out how the method knows that the docker image ID belongs to ECR. Do you have any insight into that?
Basically you should have the docker service login before running the agent, then the agent uses docker to run the image from the ECR.
Make sense ?
Hi RipeGoose2
You can also report_table them? what do you think?
https://github.com/allegroai/clearml/blob/master/examples/reporting/pandas_reporting.py
https://github.com/allegroai/clearml/blob/9ff52a8699266fec1cca486b239efa5ff1f681bc/clearml/logger.py#L277
Can you do the following
Clone the Task you previously sent me the installed packages of, then enqueue the cloned task to the queue the agent with the conda.
Then send me the full log of the task that the agent run
GiganticTurtle0
I'm assuming here that self.dask_client.map(read_and_process_file, filepaths) actually does the multi process/node processing. The way it needs to work, it has to store the current state of the process and then restore it on any remote node/process. In practice this means pickling the local variables (Task included).
First I would try to use a standalone static function for the map, DASK might be able to deduce it does not need to pickle anything, as it is standalone.
A...
I am struggling with configuring ssh authentication in docker mode
GentleSwallow91 Basically the agent will automatically mount the .ssh into the container , just make sure you set the following in the clearml.conf:force_git_ssh_protocol: truehttps://github.com/allegroai/clearml-agent/blob/178af0dee84e22becb9eec8f81f343b9f2022630/docs/clearml.conf#L30
Hi SmallDeer34
Is the Dataset in clearml-data ? If it is then Dataset.get().get_local_copy() will get you a cached local copy of the entire dataset.
If it is not, then you can use StorageManager.get_local_copy(url_here) to download the dataset.
- Any Argparser is automatically logged (and later can be overridden from the UI). Specifically HfArgumentParser will be automatically logged https://github.com/huggingface/transformers/blob/e43e11260ff3c0a1b3cb0f4f39782d71a51c0191/examples/pytorc...
Hi SmallDeer34
Did you call Task.init ?
TenseOstrich47 every agent instance has its own venv copy. Obviously every new experiment will remove the old venv and create a new one. Make sense?
Damn, JitteryCoyote63 seems like a bug in the backend, it will not allow you to change the task type to the new types 😞
the time taken to upload halved. It is puzzling because as you say it's not that much to upload.
Maybe it was the load on the server? meaning dealing with multiple requests at the same time delayed the requests?!
For now I've whittled down the number of entries to a more select but useful few and that has solved the issue. If it crops up again I will try
connect_configuration
properly.
Thanks for your help!
My pleasure 🙂
EnviousStarfish54 we just fixed an issue that relates to "installed packages" on windows.
RC is due to be release in the upcoming days, I'll keep you posted
I guess we should have obfuscated the name better 😄
What do you mean? every Model has a unique ID, what do you consider a version?