This is just a suggestion, but this is what I would do:
clearml-dataand create a dataset from the local CSV file
clearml-data create ... clearml-data sync --folder (where the csv file is)2. Write a python code that takes the csv file from the dataset and creates a new dataset of the preprocessed data
` from clearml import Dataset
original_csv_folder = Dataset.get(dataset_id=args.dataset).get_local_copy()
process csv file -> generate a new csv
preprocessed = Dataset.create(...)
preprocessed.close() `3. Train the model (i.e. get the dataset prepared in (2)), add output_uri to upload the model (say to your S3 bucket of clearml-server)
` preprocessed_csv_folder = Dataset.get(dataset_id='preprocessed_dataset_if').get_local_copy()
Train here `
- Use the clearml model repository (see the Models Tab in the Project experiment table) to get / download the trained model
Ok, thanks a lot. This is not exactly what I expected, so I don't fully understand. For example, let's say you have a basic project in which the workflow is:
You read a csv stored in your filesystem. You transform this csv adding some new features, scaling and things like that. You train a model (usually doing several experiments with different hyperparameters). You deploy the model and is ready for making predictions. How would you structure this workflow in Tasks in ClearML?
In ClearML Opensource, a dataset is represented by a task (or experiment in UI terms). You can add datasets to projects to indicate that the dataset is related to the project, but it's completely a logic entity, IE, you can have a dataset (or datasets) per project, or a project with all your datasets.