Reputation
Badges 1
25 × Eureka!but I'd prefer to have a new instance deployed for each new experiment and that it also terminates when no new experiments are queued
I'm not objecting, just wondered on the rational behind the decision π
Back to the AWS autoscaler:
Basically if you have the services-agent running on your cluster, it will just run the aws-autoscaler for you π
The idea of the service-agent is to run logic/monitoring Tasks suck as the aws autoscaler. Notice that service-mode means multiple job per...
GreasyPenguin14 whats the clearml version you are using, OS & Python ?
Notice this happens on the "connect_configuration" that seems to be called after the Task was closed, could that be the case ?
Make sense π
Just make sure you configure the git user/pass in the docker-compose so the agent has your credentials for the repo clone.
By your description it seems to make no difference whether I added the files via sync or add, since I will have to create a new dataset either way.
Sync is design to take a local folder/s and add/remove files from a dataset based on the local changes (it does that automatically based on file existence / content)
The changes (i.e. added files) are uploaded as delta changes relative to the parent version, this means we are not always uploading all files.
Add on the other hand means you...
PompousBeetle71 BTW: if you remove the type=str from the argparse, it will do what you want, None will stay None (instead of ''), all other values will be of type str as this is always the default π
Hmm CourageousLizard33 seems you stumbled on a weird bug,
This piece of code only tries to get the username of the current UID, but since you are running inside a docker and probably set the environment UID but there is no "actual" UID by that number on /etc/passwd , and so it cannot resolve it.
I'm attaching a quick fix, please let me know if it solved the problem.
I'd like to make sure we have it in the next RC as soon as possible.
task = Task.get_task('task_id_here') task.mark_started(force=True) task.upload_artifact(..., wait_on_upload=True) task.mark_completed()
Hi SarcasticSparrow10 , so yes it does, this is more efficient when using pytorch loaders, and in some other situations.
To disable it add to your clearml.conf:sdk.development.report_use_subprocess = false2. interesting error, maybe we can revert to "thread mode" if running under a daemon. (I have to admit, I'm not sure why python has this limitation, let me check it...)
Why do you ask? is your server sluggish ?
how did you install trains?pip install git+
If I access the dataset on the same location directly it works fine:
wait, I'm confused, how is it the datset us there? did it download the dataset?
are you saying this line for example will fail? (assuming you actually have a dataset by that name)
data_path = Dataset.get(dataset_name="002_Datenset_MASAM_for_fintuning", alias="002_Datenset_MASAM_for_fintuning").get_local_copy()
this topic is about the issue with reporting a configuration with a string inside a tuple that has backslash
So the encoding itself is done YAML style, and based on your example \b Has to be encoded to \b because this is string encoding, like \n will become "new line"
Make sense ?
. Are there any option to remove the example projects?
So sorry just realized I missed your message
Yes, but I'm not sure it will have an effect, see here
why the memory usage of the elastic search still persist on 32 gb after removing experiments?
did you restart the server after removing the experiments?
Hmm apparently it is not passed, but it could be.
Would the object itslef be enough to get the values? wouldn't it make sense to get them from outside somehow? (I'm assuming there is one set of args used at any certain moment?)
hmmm, somehow I have a bed feeling about it... Could you check the log, it should say something like "Collecting torch==1.6.0.dev20200421+cu101 from https://"
It should be right at the top of the installation. What do you have there?
for example train.py & eval.py under the same repo
Hi JumpyDragonfly13
Let's assume we have two machines, one we call remote, one we call laptop (at least for this discussion)
On the Remote machine we need to run: (notice we must have docker preinstalled on the remote machine, it can work without docker, let me know if this is the case for you)clearml-agent daemon --queue interactive --create-queue --docker
On the Laptop we runclearml-session --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04What clearml-session will do is crea...
Yes that should work, only thing is you need to call Task init on the master process (and make sure you call Task.current_task() on the subprocesses, if you want to automagic to kick in, that said, usually there is no need, they are supposed to report everything back to the main one anyhow
basically
` @call_parse
def main(
Β Β gpus:Param("The GPUs to use for distributed training", str)='all',
Β Β script:Param("Script to run", str, opt=False)='',
Β Β args:Param("Args to pass to script", nargs=...
Thanks! a few thoughts below π
- not true β you can specify the image you want for each stepMy apologies, looking at the release notes, it was added a while back and I have not noticed π
- re: role-base access control - see Outerbounds Platform that provides a layer of security and auth features required by enterprisesRole based access meaning limiting access in metaflow i.e. specific users/groups can only access specific projects etc. ...
The issue is uploading reporting fro http uploads (object storage will report upload). Basically the http upload is post with urllib that does not support upload callbacks for progress report. If you have an idea here, we will gladly add it (as you mentioned it can be quite annoying to have to open network manager to verify the upload is progressing)
Hi CloudyHamster42
how do i have the trains-agent install myΒ
requirements.txt
Β file from my repo when creating the environment?
BTW if you clear all "the installed packages", then trains-agent will user requirements.txt and update back all the packages in the UI
Metadata might be expensive, it's a RestAPI call, and we have found users putting hundreds of artifacts, with preview entries ...
So this is verry odd, it looks like a pip bug:
The agent is trying to install torch==2.1.0.* because by default it ignores the 4th+ parts (they are unstable and torch have tendency to remove them) . and for some reason pip will not match 2.1.0.* with for example "2.1.0.dev20230306+cu118"
but based on the docs it should work:
see here: None
As a workaround you can always edit and change to the final url for example: so ...
Btw I sometimes get a gzip error when I am accessing artefacts via the '.get()' part.
Hmm this is odd, is this a download issue? if this is reproducible maybe we should investigate further...
Hi @<1661542579272945664:profile|SaltySpider22>
question 1: are parallel writes to a dataset with the same version possible?
When you are saying parallel what do you mean? from multiple machines ?
Whats the recommended way to append the dataset in a future version?
Once a dataset was finalized the only way to add files is to add another version that inherits from the previous one (i.e. the finalized version becomes the parent of the new version)
If you are worried about multip...
will my datasets be stored on the same machine that hosts the clearml server?
By default yes, they will be stored to the files-server (but you can change it, this is an argument for both the CLI and the python interface)
Yes, I mean trains-agent. Actually I am using 0.15.2rc0. But, I am using local files, I mean I clone trains and trains-agent repos and install them. Their versions are 0.15.2rc0
I see, that's why we get the git ref, not package version.