ShinyWhale52 any time 🙂
Feel free to followup with more questions
Ok, this makes more sense. Thank you very much. I'll take a closer look at your code when I have a better picture of ClearML.
Hi ShinyWhale52
This is just a suggestion, but this is what I would do:
- use
clearml-data
and create a dataset from the local CSV fileclearml-data create ... clearml-data sync --folder (where the csv file is)
2. Write a python code that takes the csv file from the dataset and creates a new dataset of the preprocessed data
` from clearml import Dataset
original_csv_folder = Dataset.get(dataset_id=args.dataset).get_local_copy()
process csv file -> generate a new csv
preprocessed = Dataset.create(...)
preprocessed.add_files(new_created_file)
preprocessed.upload()
preprocessed.close() `3. Train the model (i.e. get the dataset prepared in (2)), add output_uri to upload the model (say to your S3 bucket of clearml-server)
` preprocessed_csv_folder = Dataset.get(dataset_id='preprocessed_dataset_if').get_local_copy()
Train here `
- Use the clearml model repository (see the Models Tab in the Project experiment table) to get / download the trained model
wdyt?
Ok, thanks a lot. This is not exactly what I expected, so I don't fully understand. For example, let's say you have a basic project in which the workflow is:
You read a csv stored in your filesystem. You transform this csv adding some new features, scaling and things like that. You train a model (usually doing several experiments with different hyperparameters). You deploy the model and is ready for making predictions. How would you structure this workflow in Tasks in ClearML?
To organize work, we designate a special task type for datasets (so it's easy to search and browse through them) as well as tags that help you get finer granularity search capabilities.
In ClearML Opensource, a dataset is represented by a task (or experiment in UI terms). You can add datasets to projects to indicate that the dataset is related to the project, but it's completely a logic entity, IE, you can have a dataset (or datasets) per project, or a project with all your datasets.