딥러닝_04_tfdata 파이프라인

Notice

Recent Posts

Recent Comments

Link

Blog

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Tags more

Archives

Today

Total

관리 메뉴

ㅅㅇ

딥러닝_04_tfdata 파이프라인 본문

AI_STUDY/딥러닝

딥러닝_04_tfdata 파이프라인

SO__OS 2022. 7. 14. 20:27

플레이데이터 빅데이터캠프 공부 내용 _ 7/13

딥러닝 _04_tfdata 파이프라인

1. tf.data 모듈

- 데이터 입력 파이프라인을 위한 모듈

: 모델에 입력해줄 데이터를 만드는 파이프라인
=> 모델 학습/평가를 위한 대용량 데이터셋을 제공(feeding)하기 위한 모듈
=> raw dataset 에서 입력을 위한 전처리, 배치 크기, shuffling 등을 한번에 처리할 수 있게 한다.

- tf.data.Dataset 추상클래스에서 상속된 여러가지 클래스들을 제공
- 입력 소스의 제공 형태에 따라 다양한 방식을 제공

- 각 tf.data 함수들은 각 기능에 맞게 데이터를 처리하는 Dataset 객체 를 반환한다.

= > 반환되는 Dataset 은 또다른 tf.data 함수의 input 으로 들어가 처리될 수 있다.

이렇게 데이터 셋들을 흘러가며 여러 처리를 거치게 된다. (파이프라인)

2. Dataset API 사용

2.1 Dataset 생성

1) raw dataset 을 지정

= > data Loading

2) from_tensor_slices(), # 메모리의 리스트, 넘파이배열, 텐서플로 자료형에서 데이터를 생성한다. => 주로 이것 씀.

from_generator() 클래스 메소드,

tf.data.TFRecordDataset 클래스 등 을 사용해

= > 메모리나 파일에 있는 데이터를 Dataset으로 만든다.

2.2 제공 데이터 전처리

- map(함수) : 하나 하나의 데이터를 변환
= > 함수 : 값을 변환할 함수 로 입력데이터셋의 개수만큼 매개변수 선언

- filter(함수) : 특정 조건의 데이터만 제공하도록 처리.
= > 함수: 제공할 값의 조건을 정의한 함수로 입력데이터셋의 개수만큼 매개변수 선언하고 bool 값을 반환.

2.3 데이터 제공 설정 관련

    - batch(size) : 학습/평가시 한번에 제공할 batch size 지정
        - size: int. batch size 지정
        - drop_remainder: bool. True일 경우 마지막 제공시 남은 데이터수가 batch size보다 작으면 제공하지 않는다.

- shuffle(buffer 크기) : dataset의 원소들의 순서를 섞어서 제공한다. (1epoch 한 번 데이터셋 처리할 때마다 섞는다.)

- buffer 크기 (int) 를 지정 : buffer 크기는 섞는 공간의 크기 로

=> 데이터보다 크거나 같으면 완전셔플,

적으면 size 만큼 일부만 가져와서 섞어 완전셔플이 안된다.

- 내부적으로 섞는게 아니라 메모리 임시공간(buffer) 에서 섞는다.

여기에 얼만큼 들고와 섞을지 size 를 결정할 수 있다.
- 데이터 사이즈가 너무 커서 메모리가 부족할 경우, 버퍼크기를 적게 준다.
- 메모리가 충분하다면, 데이터의 개수와 동일하게 주면된다.

    - repeat(count) : 전체 데이터를 한번 다 제공한 뒤 다시 데이터를 제공한다.
        - count: 몇번 제공할지 반복 횟수
        - shuffle이 적용된 Dataset의 경우 다음 반복 제공마다 shuffle을 진행한다. (에폭단위)

3. Tensor : TensorFlow의 기본 data type.

- numpy가 ndarray를 이용해

데이터를 관리하듯이 tensorflow 는 Tensor를 이용해 데이터를 관리한다.

- Tensor 의 모델이 학습, 평가할 때

사용하는 데이터 셋(train dataset, validation dataset, test dataset ) 은 tf.Tensor 타입이어야 한다.

= > 학습/평가(model.fit(), model.evaluate()) 할 때

ndarray 를 입력 데이터셋으로 넣어주면 내부적으로 tf.Tensor 로 변형해서 처리.

- tf.Tensor 는 데이터 셋을 ndarray 로 가지고 있다.

=> tensor type 은 numpy 로 변환이 가능하며,

numpy / list 를 tensor 로도 변환 가능하다.

t = tf.constant([1,2,3], dtype="float32")
t

<tf.Tensor: shape=(3,), dtype=float32, numpy=array([1., 2., 3.], dtype=float32)>

- tensor type 을 numpy type 으로 변환

# tensor => numpy 변환
a = t.numpy()
a

array([1., 2., 3.], dtype=float32)

- numpy/list => tensor 변환하는 방법 두가지

# 방법1
t2 = tf.constant(a)

# 방법2
t3 = tf.convert_to_tensor(a)

4. tfdata 파이프라인 설명 - 쪼개서 단계적으로 설명

4.1 import

import tensorflow as tf
import numpy as np

4.2 Dataset 생성

1) raw dataset Loading 예시

raw_data1 = np.arange(10) # raw_data 
raw_data1 # ndarray -> 메모리

2) 메모리에 있는 ndarray(Tensor) 등 을 읽어 들이는 Dataset 을 생성

=> Dataset.from_tensor_slices(변수) 클래스 메소드를 사용.

배열 객체를 넣으면 데이터셋으로 만들어준다

- 반환 값의 type 은 TensorSliceDataset

- Dataset 생성 시, 데이터를 읽어오거나 처리하지 않는다. (이게 더 효율적. 미리할 필요없다.)
== > 그 데이터셋을 사용하는 시점 (모델학습, 평가) 에 읽어온다.

==> Lazy 실행 라 한다.

dataset = tf.data.Dataset.from_tensor_slices(raw_data1)
print(type(dataset))

<class 'tensorflow.python.data.ops.dataset_ops.TensorSliceDataset'>

- 생성된 DataSet 의 data 들은 iterable type 이다.

=> for in 문에서 사용 가능. Dataset 이 가지고 있는 원소들(데이터들) 을 하나씩 (batch size 단위) 제공한다.

for data in dataset:
    print(data)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)

3) 딥러닝에서 Dataset 생성할 땐

== > X, y 묶어서 제공. fit 할 때 X, y 을 함께 dataset(X,y) 튜플로 묶어 넣어야 한다.

- 두 개 이상의 데이터셋을 묶어서 제공할 경우 Tuple 로 묶어준다.

=> Dataset 이 feeding 할 때는 같은 index 의 값들을 tuple 로 묶어서 제공

raw_data1 = np.arange(10) 
raw_data2 = np.arange(10,20)

dataset2 = tf.data.Dataset.from_tensor_slices((raw_data1, raw_data2))

- for in 문 결과보면, tuple(raw_data1 원소, raw_data2 원소) 같은 index의 것들을 튜플로 묶어서 반환하는 것을 알 수 있다.

for X, y in dataset2: # tuple 대입
    print(X, y, sep = '|||')

tf.Tensor(0, shape=(), dtype=int32)|||tf.Tensor(10, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)|||tf.Tensor(11, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)|||tf.Tensor(12, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)|||tf.Tensor(13, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)|||tf.Tensor(14, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)|||tf.Tensor(15, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)|||tf.Tensor(16, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)|||tf.Tensor(17, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)|||tf.Tensor(18, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)|||tf.Tensor(19, shape=(), dtype=int32)

4.3 .take(개수) : 지정한 개수만큼의 데이터만 제공

Pipeline (이러한 단계들이 각각 일어남)

: raw_data1 --- 읽기---> TensorSliceDataset(dataset) --- 값 3개 조회 --- > TakeDataset (dataset3)

- dataset3 이용해 값을 조회

: dataset이 데이터를 읽어오기 => dataset3는 값을 3개까지 제공하는 기능을 제공.

dataset3 = dataset.take(3)

print(type(dataset3)) 
for data in dataset3:
    print(data)

<class 'tensorflow.python.data.ops.dataset_ops.TakeDataset'>
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)

4.4 .shuffle(buffer size)

Pipeline
: raw_data1 --- 읽기---> TensorSliceDataset(dataset) ---섞기---> ShuffleDataset(dataset4)

- buffer size: 10 (dataset의 원소개수와 동일 => 완전셔플 - 모두 섞음)

dataset4 = dataset.shuffle(10)

print(type(dataset4))
for data in dataset4:
    print(data)

<class 'tensorflow.python.data.ops.dataset_ops.ShuffleDataset'>
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)

4.5 .batch

- drop_remainder 설정 : boolean (기본 - False)
- batch로 제공할 데이터의 개수가 설정한 batch_size 보다 적으면 제공하지 않는다.

- Train dataset을 만들때 True로 설정하여

각각의 step 에서 항상 일정한 batch size로 학습할 수 있도록 한다.

- drop 되는 데이터는 아예 안 쓰이는 게 아니라, 다른 epoch을 반복할 때 사용될 것이다.

Pipeline

: raw_data1 --읽기-->TensorSliceDataset(dataset) --배치처리--> BatchDataset(dataset5)

dataset5 = dataset.batch(3, drop_remainder = True)

print(type(dataset5))
for data in dataset5:
    print(data)

<class 'tensorflow.python.data.ops.dataset_ops.BatchDataset'>
tf.Tensor([0 1 2], shape=(3,), dtype=int32)
tf.Tensor([3 4 5], shape=(3,), dtype=int32)
tf.Tensor([6 7 8], shape=(3,), dtype=int32)

- drop_remainder = False (기본값 설정) 했을 때의 결과

tf.Tensor([0 1 2], shape=(3,), dtype=int32)
tf.Tensor([3 4 5], shape=(3,), dtype=int32)
tf.Tensor([6 7 8], shape=(3,), dtype=int32)
tf.Tensor([9], shape=(1,), dtype=int32)

4.6 .repeat(반복횟수)

- 예전 버전에서는 필수였으나, 현 버전에서는 학습과 관련해서 쓸 일 없다.

- 데이터 셋 전체를 반복 횟수만큼 제공한다.

- 반복횟수를 생략하면 무한제공한다.

Pipeline

: raw_data1 --읽기-->TensorSliceDataset(dataset) --데이터를 반복제공--> RepeatDataset(dataset7)

dataset7 = dataset.repeat(2) 

print(type(dataset7))
for data in dataset7:
    print(data)

<class 'tensorflow.python.data.ops.dataset_ops.RepeatDataset'>
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)

- shuffle 과 batch 설정과 함께 사용한다면?

dataset8 = dataset.shuffle(10).batch(5).repeat(3)
for data in dataset8:
    print(data)

tf.Tensor([2 6 1 0 5], shape=(5,), dtype=int32)
tf.Tensor([8 4 3 7 9], shape=(5,), dtype=int32)
tf.Tensor([9 8 6 1 5], shape=(5,), dtype=int32)
tf.Tensor([2 7 0 4 3], shape=(5,), dtype=int32)
tf.Tensor([3 4 7 0 8], shape=(5,), dtype=int32)
tf.Tensor([1 5 9 2 6], shape=(5,), dtype=int32)

4.7 .map(mapping_func)

- mapping_func() : dataset 의 원소를 하나 받아서 처리하는 함수를 선언

- 매개변수 : 원소개수를 맞춰서 변수를 선언
- 반환값 : 받은 원소를 처리한 값. 해줄 처리를 작성해준다.

- 주의. map 함수에 넣을 때, 함수 호출이 아니라 객체를 넣어야 한다. mapping_func() 아님.

Pipeline

: raw_data1 -읽기-> TensorSliceDataset(dataset) -mapping처리-> MapDataset(dataset9)

- 결과를 보면 원소(데이터) 들이 모두 제곱 처리 된 것을 볼 수 있음.

def mapping_func(x):
    return x**2

dataset9 = dataset.map(mapping_func) 

print(type(dataset9))
for data in dataset9:
    print(data) # x**2 처리 됨.

<class 'tensorflow.python.data.ops.dataset_ops.MapDataset'>
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)
tf.Tensor(16, shape=(), dtype=int32)
tf.Tensor(25, shape=(), dtype=int32)
tf.Tensor(36, shape=(), dtype=int32)
tf.Tensor(49, shape=(), dtype=int32)
tf.Tensor(64, shape=(), dtype=int32)
tf.Tensor(81, shape=(), dtype=int32)

- 두 data 가 튜플로 들어가 있는 dataset 에 대해 map 함수 적용

- dataset2의 원소 는 튜플(raw_data1원소, raw_data2원소) 로 => 매개변수에 튜플 대입 적용

- > rawd_data1의 세제곱한 값, y값을 처리하지 않은 값.(그대로 반환) 하고자 한다.

= > mapping_func2 생성 시, 원소 갯수 맞춰 두 변수 선언

def mapping_func2(x, y): 
    return x**3, y

dataset2 = tf.data.Dataset.from_tensor_slices((raw_data1, raw_data2))
dataset10 = dataset2.map(mapping_func2)

- 위 처리를 lambda 식으로 구현

dataset11 = dataset.map(lambda x: x**2)

dataset12 = dataset2.map(lambda x, y: (x**3, y**2))

4.8 .filter(fliter_func)

- filter_func()

: 제공할 데이터의 조건 정의. 이 조건을 만족하는 (True) 인 원소들만 모델에 제공(feeding)

- 매개변수 : Dataset 으로부터 원소를 받을 변수
- 반환 : bool

Pipeline

: raw_data1 -읽기-> TensorSliceDataset(dataset) - filter처리-> FilterDataset(dataset13)

def filter_func(x):
    return x%2 == 0 # 2의 배수인지 여부

dataset13 = dataset.filter(filter_func)

print(type(dataset13))
for i in dataset13:
    print(i)

<class 'tensorflow.python.data.ops.dataset_ops.FilterDataset'>
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)

- lambda 식으로 구현

dataset14 = dataset.filter(lambda x : x > 5) # 조건:5초과하는 값
for data in dataset14:
    print(data)

tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)

5. 종합 - 데이터 입력 파이프라인 구현 - 모델에 입력해줄 데이터를 만드는 파이프라인)

=> 모델 학습/평가를 위한 대용량 데이터셋을 제공(feeding)하기 위해 다음과 같은 처리를 파이프라인으로 구현

각각의 역할 처리 결과(가 다음 처리의 입력으로)
raw_data(-10~10:ndarray) - 읽기 ->TensorSliceDataset
- filter처리 -> FilterDataset
- map 처리 -> MapDataset
- 섞기 -> ShuffleDataset
- 배치처리 -> BatchDataset (dataset_final)

raw_data = np.arange(-10,11)

dataset_final = tf.data.Dataset.from_tensor_slices(raw_data)\
                               .filter(lambda x:x>=0)\
                               .map(lambda y:y+10)\
                               .shuffle(raw_data.size)\
                               .batch(3)
                               
# \ : 한줄에 적어야 하는데 다음줄에 코드작성하기 위해서 - 이때 주석, 공백 다 안됨.

for data in dataset_final:
    print(data)

- 아래와 같은 ' 원하는 여러 처리가 된 dataset ' 이 모델의 입력으로 들어가게 된다.!!

tf.Tensor([19 14 17], shape=(3,), dtype=int32)
tf.Tensor([20 18 11], shape=(3,), dtype=int32)
tf.Tensor([12 15 13], shape=(3,), dtype=int32)
tf.Tensor([16 10], shape=(2,), dtype=int32)

저작자표시

'AI_STUDY > 딥러닝' 카테고리의 다른 글

딥러닝 _05_DNN_성능개선 (0)	2022.07.25
딥러닝 _03_2_DNN (Deep Neural Network) 신경망 구조 _ 최적화 (0)	2022.07.13
딥러닝 _03_1_DNN (Deep Neural Network) 신경망 구조 (0)	2022.07.13
딥러닝 _02_첫번째 딥러닝- MLP 구현 (0)	2022.07.12
딥러닝 _ 01_ 개요 및 Tensorflow 설치 (0)	2022.07.11

'AI_STUDY/딥러닝' Related Articles

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

ㅅㅇ

ㅅㅇ

딥러닝_04_tfdata 파이프라인 본문

딥러닝_04_tfdata 파이프라인

딥러닝 _04_tfdata 파이프라인

1. tf.data 모듈

2. Dataset API 사용

3. Tensor : TensorFlow의 기본 data type.

4. tfdata 파이프라인 설명 - 쪼개서 단계적으로 설명

5. 종합 - 데이터 입력 파이프라인 구현 - 모델에 입력해줄 데이터를 만드는 파이프라인)

'AI_STUDY > 딥러닝' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역