판다를 사용하여 하나의 데이터 프레임에서 테스트 및 훈련 샘플을 작성하려면 어떻게 해야 합니까?

programing

판다를 사용하여 하나의 데이터 프레임에서 테스트 및 훈련 샘플을 작성하려면 어떻게 해야 합니까?

randomtip 2023. 1. 25. 08:44

판다를 사용하여 하나의 데이터 프레임에서 테스트 및 훈련 샘플을 작성하려면 어떻게 해야 합니까?

데이터 프레임 형태의 꽤 큰 데이터 세트를 가지고 있는데, 교육 및 테스트를 위해 데이터 프레임을 두 개의 랜덤 샘플(80%와 20%)로 분할할 수 있는 방법이 궁금했습니다.

감사합니다!

Scikit Learn's는 좋은 것입니다.numpy 어레이와 데이터 프레임을 분할합니다.

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

의 Numpy를 합니다.randn:

In [11]: df = pd.DataFrame(np.random.randn(100, 2))

In [12]: msk = np.random.rand(len(df)) < 0.8

In [13]: train = df[msk]

In [14]: test = df[~msk]

그리고 이것이 효과가 있는 것을 보기만 해도:

In [15]: len(test)
Out[15]: 21

In [16]: len(train)
Out[16]: 79

팬더 랜덤 샘플도 동작합니다.

train=df.sample(frac=0.8,random_state=200) #random state is a seed value
test=df.drop(train.index)

skikit-learn 자체 training_test_split을 사용하여 인덱스에서 생성합니다.

from sklearn.model_selection import train_test_split


y = df.pop('output')
X = df

X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train] # return dataframe train

열차/테스트 및 검증 샘플을 작성하는 방법은 여러 가지가 있습니다.

1:한 방법 ★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★」train_test_split션션: :

from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.3)

사례 2: 매우 작은 데이터 세트(500개 이하의 행)의 경우: 이 교차 검증을 통해 모든 라인에 대한 결과를 얻습니다.마지막에는 사용 가능한 교육 세트의 각 행에 대해 예측이 하나씩 표시됩니다.

from sklearn.model_selection import KFold
kf = KFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
    reg = RandomForestRegressor(n_estimators=50, random_state=0)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf = reg.fit(X_train, y_train)
    y_hat = clf.predict(X_test)
    y_hat_all.append(y_hat)

사례 3a: 분류를 위한 불균형 데이터 세트.케이스 1에 이어 동등한 솔루션을 다음에 나타냅니다.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3)

사례 3b: 분류를 위한 불균형 데이터 세트.케이스 2에 이어 동등한 솔루션을 다음에 나타냅니다.

from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
    reg = RandomForestRegressor(n_estimators=50, random_state=0)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf = reg.fit(X_train, y_train)
    y_hat = clf.predict(X_test)
    y_hat_all.append(y_hat)

사례 4: 하이퍼 파라미터(트레인 60%, 테스트 20%, Val 20%)를 조정하려면 빅데이터에 트레인/테스트/검증 세트를 생성해야 합니다.

from sklearn.model_selection import train_test_split
X_train, X_test_val, y_train, y_test_val = train_test_split(X, y, test_size=0.6)
X_test, X_val, y_test, y_val = train_test_split(X_test_val, y_test_val, stratify=y, test_size=0.5)

numpy로 변환할 필요가 없습니다.팬더 df를 사용하여 분할하면 팬더 df가 반환됩니다.

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

x를 y에서 나누려면

X_train, X_test, y_train, y_test = train_test_split(df[list_of_x_cols], df[y_col],test_size=0.2)

그리고 만약 네가 전체 df를 나누길 원한다면

X, y = df[list_of_x_cols], df[y_col]

다음 코드를 사용하여 테스트 및 훈련 샘플을 작성할 수 있습니다.

from sklearn.model_selection import train_test_split
trainingSet, testSet = train_test_split(df, test_size=0.2)

테스트 크기는 테스트 및 교육 데이터 세트에 넣는 데이터의 비율에 따라 달라질 수 있습니다.

유효한 답변이 많이 있습니다.sklearn.cross_validation import train_test_split에서 번들에 하나 더 추가합니다.

#gets a random 80% of the entire set
X_train = X.sample(frac=0.8, random_state=1)
#gets the left out portion of the dataset
X_test = X.loc[~df_model.index.isin(X_train.index)]

또한 교육 및 테스트 세트로 계층화된 분할을 고려할 수도 있습니다.또한 Started division은 훈련과 테스트 세트를 랜덤으로 생성하지만 원래의 클래스 비율이 유지되도록 합니다.따라서 교육 및 테스트 세트가 원래 데이터 세트의 속성을 더 잘 반영할 수 있습니다.

import numpy as np  

def get_train_test_inds(y,train_proportion=0.7):
    '''Generates indices, making random stratified split into training set and testing sets
    with proportions train_proportion and (1-train_proportion) of initial sample.
    y is any iterable indicating classes of each observation in the sample.
    Initial proportions of classes inside training and 
    testing sets are preserved (stratified sampling).
    '''

    y=np.array(y)
    train_inds = np.zeros(len(y),dtype=bool)
    test_inds = np.zeros(len(y),dtype=bool)
    values = np.unique(y)
    for value in values:
        value_inds = np.nonzero(y==value)[0]
        np.random.shuffle(value_inds)
        n = int(train_proportion*len(value_inds))

        train_inds[value_inds[:n]]=True
        test_inds[value_inds[n:]]=True

    return train_inds,test_inds

df[train_inds] 및 df[test_inds]는 원래 DataFrame df의 훈련 및 테스트 세트를 제공합니다.

~ (tilde 연산자)를 사용하여 df.sample()을 사용하여 샘플링된 행을 제외할 수 있습니다. 따라서 팬더만 인덱스의 샘플링 및 필터링을 처리하면 두 개의 세트를 얻을 수 있습니다.

train_df = df.sample(frac=0.8, random_state=100)
test_df = df[~df.index.isin(train_df.index)]

데이터 세트의 레이블 열에 대해 데이터를 분할해야 하는 경우 다음을 사용할 수 있습니다.

def split_to_train_test(df, label_column, train_frac=0.8):
    train_df, test_df = pd.DataFrame(), pd.DataFrame()
    labels = df[label_column].unique()
    for lbl in labels:
        lbl_df = df[df[label_column] == lbl]
        lbl_train_df = lbl_df.sample(frac=train_frac)
        lbl_test_df = lbl_df.drop(lbl_train_df.index)
        print '\n%s:\n---------\ntotal:%d\ntrain_df:%d\ntest_df:%d' % (lbl, len(lbl_df), len(lbl_train_df), len(lbl_test_df))
        train_df = train_df.append(lbl_train_df)
        test_df = test_df.append(lbl_test_df)

    return train_df, test_df

사용할 수 있습니다.

train, test = split_to_train_test(data, 'class', 0.7)

분할 랜덤성을 제어하거나 글로벌 랜덤시드를 사용하는 경우에도 random_state를 전달할 수 있습니다.

train, test, validation 등 2개 이상의 클래스로 분할하려면 다음 작업을 수행합니다.

probs = np.random.rand(len(df))
training_mask = probs < 0.7
test_mask = (probs>=0.7) & (probs < 0.85)
validatoin_mask = probs >= 0.85


df_training = df[training_mask]
df_test = df[test_mask]
df_validation = df[validatoin_mask]

이렇게 하면 데이터의 약 70%가 훈련, 15%가 테스트, 15%가 검증에 사용됩니다.

shuffle = np.random.permutation(len(df))
test_size = int(len(df) * 0.2)
test_aux = shuffle[:test_size]
train_aux = shuffle[test_size:]
TRAIN_DF =df.iloc[train_aux]
TEST_DF = df.iloc[test_aux]

이렇게 df에서 범위 행을 선택하십시오.

row_count = df.shape[0]
split_point = int(row_count*1/5)
test_data, train_data = df[:split_point], df[split_point:]

import pandas as pd

from sklearn.model_selection import train_test_split

datafile_name = 'path_to_data_file'

data = pd.read_csv(datafile_name)

target_attribute = data['column_name']

X_train, X_test, y_train, y_test = train_test_split(data, target_attribute, test_size=0.8)

Data Frame을 분할해야 할 때 쓴 글입니다.상기의 Andy의 어프로치를 사용하는 것을 검토하고 있습니다만, 데이터 세트의 사이즈를 정확하게 제어할 수 없는 것이 마음에 들지 않습니다(예를 들면, 79, 81 등).

def make_sets(data_df, test_portion):
    import random as rnd

    tot_ix = range(len(data_df))
    test_ix = sort(rnd.sample(tot_ix, int(test_portion * len(data_df))))
    train_ix = list(set(tot_ix) ^ set(test_ix))

    test_df = data_df.ix[test_ix]
    train_df = data_df.ix[train_ix]

    return train_df, test_df


train_df, test_df = make_sets(data_df, 0.2)
test_df.head()

있기 에, 수를 더 하겠습니다.그래서 저는 단지 당신이 기차와 테스트 세트의 정확한 샘플 수를 지정하기를 원하는 경우에 한 가지 예를 더 추가하고 싶습니다.numpy★★★★★★★★★★★★★★★★★★.

# set the random seed for the reproducibility
np.random.seed(17)

# e.g. number of samples for the training set is 1000
n_train = 1000

# shuffle the indexes
shuffled_indexes = np.arange(len(data_df))
np.random.shuffle(shuffled_indexes)

# use 'n_train' samples for training and the rest for testing
train_ids = shuffled_indexes[:n_train]
test_ids = shuffled_indexes[n_train:]

train_data = data_df.iloc[train_ids]
train_labels = labels_df.iloc[train_ids]

test_data = data_df.iloc[test_ids]
test_labels = data_df.iloc[test_ids]

훈련, 테스트 및 검증 세트로 분할하려면 다음 기능을 사용할 수 있습니다.

from sklearn.model_selection import train_test_split
import pandas as pd

def train_test_val_split(df, test_size=0.15, val_size=0.45):
    temp, test = train_test_split(df, test_size=test_size)
    total_items_count = len(df.index)
    val_length = total_items_count * val_size
    new_val_propotion = val_length / len(temp.index) 
    train, val = train_test_split(temp, test_size=new_val_propotion)
    return train, test, val

(numpy 배열이 아닌) 하나의 데이터 프레임과 두 개의 데이터 프레임을 입력 및 출력하는 경우 다음과 같이 하십시오.

def split_data(df, train_perc = 0.8):

   df['train'] = np.random.rand(len(df)) < train_perc

   train = df[df.train == 1]

   test = df[df.train == 0]

   split_data ={'train': train, 'test': test}

   return split_data

df.as_param() 함수를 사용하여 Numpy-array를 생성하여 전달할 수 있습니다.

Y = df.pop()
X = df.as_matrix()
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)
model.fit(x_train, y_train)
model.test(x_test)

제 취향에 좀 더 우아한 것은 랜덤 컬럼을 만들고 그 컬럼으로 분할하는 것입니다.이렇게 하면 필요에 따라 랜덤으로 분할할 수 있습니다.

def split_df(df, p=[0.8, 0.2]):
import numpy as np
df["rand"]=np.random.choice(len(p), len(df), p=p)
r = [df[df["rand"]==val] for val in df["rand"].unique()]
return r

팬더 데이터 프레임을 numpy 어레이로 변환한 후 numpy 어레이를 다시 데이터 프레임으로 변환해야 합니다.

 import pandas as pd
df=pd.read_csv('/content/drive/My Drive/snippet.csv', sep='\t')
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)
train1=pd.DataFrame(train)
test1=pd.DataFrame(test)
train1.to_csv('/content/drive/My Drive/train.csv',sep="\t",header=None, encoding='utf-8', index = False)
test1.to_csv('/content/drive/My Drive/test.csv',sep="\t",header=None, encoding='utf-8', index = False)

저 같은 경우에는 Train, test, dev의 데이터 프레임을 특정 번호로 분할하고 싶었습니다.여기에서는,

먼저 데이터 프레임에 고유 ID를 할당합니다(이미 존재하지 않는 경우).

import uuid
df['id'] = [uuid.uuid4() for i in range(len(df))]

다음은 분할 번호입니다.

train = 120765
test  = 4134
dev   = 2816

분할 함수

def df_split(df, n):
    
    first  = df.sample(n)
    second = df[~df.id.isin(list(first['id']))]
    first.reset_index(drop=True, inplace = True)
    second.reset_index(drop=True, inplace = True)
    return first, second

이제 열차, 테스트, 개발로 나뉩니다.

train, test = df_split(df, 120765)
test, dev   = df_split(test, 4134)

나중에 열을 추가하려면 데이터 프레임이 아닌 복사본도 가져와야 할 것 같습니다.

msk = np.random.rand(len(df)) < 0.8
train, test = df[msk].copy(deep = True), df[~msk].copy(deep = True)

이건 어때? df는 내 데이터 프레임이야

total_size=len(df)

train_size=math.floor(0.66*total_size) (2/3 part of my dataset)

#training dataset
train=df.head(train_size)
#test dataset
test=df.tail(len(df) -train_size)

K-폴드보다 더 수 것이 증명되었습니다.train_test_split다음은 매뉴얼 자체에서 sklearn을 사용하여 적용하는 방법에 대한 기사입니다.https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

언급URL : https://stackoverflow.com/questions/24147278/how-do-i-create-test-and-train-samples-from-one-dataframe-with-pandas

저작자표시

'programing' 카테고리의 다른 글

.syslog 프리펜드 .after() 및 .before() (0)	2023.01.25
Java의 바로 가기 "또는 수정"(\|=) 연산자 (0)	2023.01.25
개별 목록으로 튜플 목록을 압축 해제하려면 어떻게 해야 합니까? (0)	2023.01.25
정수 목록에서 지정된 값에 가장 가까운 숫자를 가져옵니다. (0)	2023.01.25
배열에서 임의 항목 가져오기 (0)	2023.01.25

현재글판다를 사용하여 하나의 데이터 프레임에서 테스트 및 훈련 샘플을 작성하려면 어떻게 해야 합니까?

각종 프로그래밍 정보를 다루는 블로그입니다.

C#, spring3, C++, spring, jQuery, Java, javascript,

Today :
Yesterday :

randomtip

판다를 사용하여 하나의 데이터 프레임에서 테스트 및 훈련 샘플을 작성하려면 어떻게 해야 합니까?

판다를 사용하여 하나의 데이터 프레임에서 테스트 및 훈련 샘플을 작성하려면 어떻게 해야 합니까?

'programing' 카테고리의 다른 글

'programing'의 다른글

티스토리툴바

« 2025/02 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

판다를 사용하여 하나의 데이터 프레임에서 테스트 및 훈련 샘플을 작성하려면 어떻게 해야 합니까?

판다를 사용하여 하나의 데이터 프레임에서 테스트 및 훈련 샘플을 작성하려면 어떻게 해야 합니까?

'programing' 카테고리의 다른 글

'programing'의 다른글

관련글

티스토리툴바