TabNet 설치하기

2024. 1. 13. 18:51

와 억만년만의 포스팅이다.

~~( 사실 네이버 블로그를 또 만들어서 잘 안들어오게 되는듯 ...)~~

절미하고! (ㅋㅋ) TabNet 설치는 사실 별거 없다.

참고 블로그 URL

https://slowsteadystat.tistory.com/23

[논문 리뷰] 정형 데이터를 위한 딥러닝 | Tabnet

이 글이 도움되셨다면 광고 클릭 부탁드립니다 : ) 2019년 구글에서 개발한 tabular data 분석을 위한 딥러닝 아키텍처인 Tabnet에 대해 간단하게 알아보겠습니다. 캐글이나 데이콘과 같은 여러 대회에

slowsteadystat.tistory.com

1. PyTorch 설치하기 ~

일단, TabNet를 사용하기 위해 pytorch를 설치해야한다.

pytorch를 설치하는건 그다지 어렵지 않고, 심지어 홈페이지에 가면 다 나와있는데

처음해보는 사람은 어려워서 헤매더라.. 근데 그럴 수 있다! 나도 그랬으니까 ㅎㅎ

PyTorch

pytorch.org

위 페이지에 들어가면 친절하게 홈페이지 첫 화면 맨 아래에 아래 사진과 같이 되어있다.

컴퓨터에 GPU가 있다면 CUDA를 선택해서 설치하면 되는거고

나는 노트북 사용 중인데, GPU가 없으므로 CPU를 선택해서 설치하려고 한다.

그렇게 항목을 클릭하면, 설치할 수 있는 command 문구를 만들어 준다.

나는 Anaconda 가상환경을 만들어서 사용하기 때문에, 가상환경에 설치할 것임.

2. TabNet 설치하기 ~

pytorch-tabnet

PyTorch implementation of TabNet

pypi.org

위 페이지에 접속하면 뜨는 TabNet 설치 command 문구가 있다.

자신의 가상환경 or 환경에 설치하면 된다.

나는 Anaconda 에서 가상환경을 만들어서 사용하고 있기 때문에 아래 사진처럼 깔아주었다.

3. TabNet

TabNet: Attentive Interpretable Tabular Learning | Proceedings of the AAAI Conference on Artificial Intelligence

ojs.aaai.org

관련 논문 자료는 위 사이트에서 받아볼 수 있다. ㅎㅎ

[논문 리뷰] 정형 데이터를 위한 딥러닝 | Tabnet

slowsteadystat.tistory.com

좀 어려울 수도 있어서 다른 곳에서 잘 설명해주더라.. 위 블로그를 읽으면서 이해가 좀 되었다.

GitHub - dreamquark-ai/tabnet: PyTorch implementation of TabNet paper : https://arxiv.org/pdf/1908.07442.pdf

PyTorch implementation of TabNet paper : https://arxiv.org/pdf/1908.07442.pdf - GitHub - dreamquark-ai/tabnet: PyTorch implementation of TabNet paper : https://arxiv.org/pdf/1908.07442.pdf

github.com

TabNet github이고, 관련 예제 코드를 받아볼 수 있다. 예시 코드 하나만 아래에 풀어보겠다.

from pytorch_tabnet.tab_model import TabNetClassifier
import torch
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import roc_auc_score
import pandas as pd
import numpy as np
np.random.seed(0)
import os
import wget
from pathlib import Path
from matplotlib import pyplot as plt

# download census-income dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
dataset_name = 'census-income'
out = Path(os.getcwd()+'/data/'+dataset_name+'.csv')
out.parent.mkdir(parents=True, exist_ok=True)
if out.exists():
    print("File already exists.")
else:
    print("Downloading file...")
    wget.download(url, out.as_posix())


# load data and split
col_name = ['age','workclass','fnlwgt','education','education-num', 'marital-status', 'occupation' ,
            'relationship','race','sex','capital-gain','capital-loss','hours-pier-week','native-country',
            '<=50K']
train = pd.read_csv(out, names = col_name)
target = '<=50K'
if "Set" not in train.columns:
    train["Set"] = np.random.choice(["train", "valid", "test"], p =[.8, .1, .1], size=(train.shape[0],))

train_indices = train[train.Set=="train"].index
valid_indices = train[train.Set=="valid"].index
test_indices = train[train.Set=="test"].index


# simple preprocessing
nunique = train.nunique()
types = train.dtypes

categorical_columns = []
categorical_dims =  {}
for col in train.columns:
    if types[col] == 'object' or nunique[col] < 200:
        # print(col, train[col].nunique())
        l_enc = LabelEncoder()
        train[col] = train[col].fillna("VV_likely")
        train[col] = l_enc.fit_transform(train[col].values)
        categorical_columns.append(col)
        categorical_dims[col] = len(l_enc.classes_)
    else:
        train.fillna(train.loc[train_indices, col].mean(), inplace=True)
        
# Define categorical feature for categorical embeddings
unused_feat = ['Set']
features = [ col for col in train.columns if col not in unused_feat+[target]] 
cat_idxs = [ i for i, f in enumerate(features) if f in categorical_columns]
cat_dims = [ categorical_dims[f] for i, f in enumerate(features) if f in categorical_columns]

# define your embedding sizes : here just a random choice
cat_emb_dim = [5, 4, 3, 6, 2, 2, 1, 10]

# check that pipeline accepts strings
train.loc[train[target]==0, target] = "wealthy"
train.loc[train[target]==1, target] = "not_wealthy"

# network parameters
tabnet_params = {"cat_idxs":cat_idxs,
                 "cat_dims":cat_dims,
                 "cat_emb_dim":1,
                 "optimizer_fn":torch.optim.Adam,
                 "optimizer_params":dict(lr=2e-2),
                 "scheduler_params":{"step_size":50, # how to use learning rate scheduler
                                 "gamma":0.9},
                 "scheduler_fn":torch.optim.lr_scheduler.StepLR,
                 "mask_type":'entmax', # "sparsemax"
                 "gamma" : 1.3 # coefficient for feature reusage in the masks
                }

clf = TabNetClassifier(**tabnet_params)

# training
X_train = train[features].values[train_indices]
y_train = train[target].values[train_indices]

X_valid = train[features].values[valid_indices]
y_valid = train[target].values[valid_indices]

X_test = train[features].values[test_indices]
y_test = train[target].values[test_indices]

max_epochs = 100 if not os.getenv("CI", False) else 2

# This illustrates the warm_start=False behaviour
save_history = []
for _ in range(2):
    clf.fit(
        X_train=X_train, y_train=y_train,
        eval_set=[(X_train, y_train), (X_valid, y_valid)],
        eval_name=['train', 'valid'],
        eval_metric=['auc'],
        max_epochs=max_epochs , patience=20,
        batch_size=1024, virtual_batch_size=128,
        num_workers=0,
        weights=1,
        drop_last=False
    )
    save_history.append(clf.history["valid_auc"])

# plot losses
plt.plot(clf.history['loss'])

# plot auc
plt.plot(clf.history['train_auc'])
plt.plot(clf.history['valid_auc'])

# plot learning rates
plt.plot(clf.history['lr'])

# prediction
preds = clf.predict_proba(X_test)
test_auc = roc_auc_score(y_score=preds[:,1], y_true=y_test)


preds_valid = clf.predict_proba(X_valid)
valid_auc = roc_auc_score(y_score=preds_valid[:,1], y_true=y_valid)

print(f"BEST VALID SCORE FOR {dataset_name} : {clf.best_cost}")
print(f"FINAL TEST SCORE FOR {dataset_name} : {test_auc}")

# global
feat_importances = pd.Series(clf.feature_importances_, index=features)
feat_importances.plot(kind='barh')

#local
explain_matrix, masks = clf.explain(X_test)
fig, axs = plt.subplots(1, 3, figsize=(20,20))

for i in range(3):
    axs[i].imshow(masks[i][30:50])
    axs[i].set_title(f"mask {i}")

저작자표시 비영리 변경금지

'프로그래밍 > Python' 카테고리의 다른 글

Support vector machine(SVM) (0)	2023.03.27
Anaconda Prompt 명령어 정리 (0)	2023.01.09

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

정다실버 블로그