Reputation
Badges 1
27 × Eureka!Hello @<1523701070390366208:profile|CostlyOstrich36> , thank you for the response. Would you mind pointing to some documentation because I cannot hind the system tag
No, this is very useful, thank you
@<1523701070390366208:profile|CostlyOstrich36> , wondering if my clarification made sense and if what I am describing is possible?
when I run the example, the plots are showing up fine but when I run my own code for a different project, the only plots that show up are gpu and cpu usage, not the validation loss or accuracy
I would like to implement MLOPS best practices to my project.
So in my Datamodule class, i would load the clearml data and prep it into train and test. In the lightning module class, i would create my model, and finally use trainer class to train.
How do I best utilize clearml in this scenario such that any coworker of mine is able to reproduce my work with the same pipeline?
Thank you so much, it worked :D
Sorry, I wasnt clear. I already have everything set up correctly in the clearml.conf to work with S3. I know this works because when I push datasets, the data is stored in S3. It's just that every time I run an ML experiment, clearml is using my machine as storage which is fine for experiment tracking and comparison. The only problem is that other members of my team want to use my models but they cant since they cant pull it. How do I tell clearml to use the S3 storage for an experiment inst...
so if my parent dataset is 1Tb and I add a single file to create a child dataset. There will now be 2Tb of data on the server. The parent dataset is duplicated on the server?
I have to assume that I do not know the dataset ID
That is very useful. Thank you.
A use case would be the following. I have a 200GByte dataset and I want to pull 3 files that are 20MB each
Hello @<1523701070390366208:profile|CostlyOstrich36> , I am on a self hosted server. This project has experiments, datasets and models
Hello @<1523701070390366208:profile|CostlyOstrich36> , I am using a self hosted server
if I dont run it, i get back the original error temporary failure in name resolution')) : /auth.login
so I have to restart docker everyday
Thank you for your response but I dont think that would solve the problem.
I'm imagining a case where all you know is the Project name and you want to pull the 2nd version out of 10 and you dont know it's id.
I think this is great 😄 I do have another question.
I am using S3 as my remote. When creating datasets and uploading, everything is great, it is pushed S3. How do I push a model to S3 server? As is, after training, my models are save locally. How do I push a trained model to the S3 server specified in my clearml.conf file?
ohh, that is really clever!! I did not think about that! Thank you very much 😄
So I got it to work by running sudo service docker restart
. The only thing is that I have to run this every morning. Not sure why this works but it does
does this apply if im using an external S3 storage? because the stored data appears as a large zip file in S3
Would that have been the reason for the deletion of a project?
in other words, how do you combine a pytorchlightning Module with a ClearML task?
and to answer your question, I do not know how it became hidden, it happened one morning
is it possible thats it's cuased by the following "ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start"
@<1523701070390366208:profile|CostlyOstrich36> I realized that even though the project does not appear, the experiments do appear but in "all experiments" as if they dont belong to a project
I am using ClearML 1.10.3, and pytorch_lightning 1.9.5. It works fine for the experiment. I believe I am on a self deployed server
Hello @<1523701070390366208:profile|CostlyOstrich36> , thank you for your recommendation but I doubled checked everything and I am still running into the same issue. The credentials are correct and not revoked.
and the weird part is that I got it working fine last week. I docker composed down the containers, aborted all the task and now it wont work any more and i keep getting those errors
is this still true if the child dataset is smaller than the parent? If the parent dataset is 1Tb and I delete half the files, I will still be pushing 2Tb of data to the server?