Split CSV into Train and Validation datasets (85%/15%)


Wed Dec 07 2022 08:28:42 GMT+0000 (Coordinated Universal Time)

Saved by @edubrigham #python #datasets #split #train #validation

import pandas as pd
from sklearn.model_selection import train_test_split

# Load the data from the CSV file
data = pd.read_csv("data.csv")

# Split the data into train and validation sets, using 85% of the data for training and 15% for validation for the "labels" column
train_data, validation_data = train_test_split(data, train_size=0.85, test_size=0.15, random_state=42, stratify=data["labels"])

# Write the train and validation datasets to CSV files
train_data.to_csv("train.csv", index=False)
validation_data.to_csv("valid.csv", index=False)

I used following chatGPT input to generate this code snippet: to be able to train a ML model using the multi label classification task, i need to split a csv file into train and validation datasets using a python script. the ration should be 85% of data in the train dataset and 15% in the validation set. the split datasets should contain the same number of labels. write the resulting datasets into 2 csv files called train.csv and valid.csv