2019年5月7日 星期二

[NLP] 自然語言處理觀念整理-2 (前處理範例-使用Keras)

自然語言處理觀念整理-2 
前處理範例-使用Keras


Word Embedding

  • 將文字向量化, 使用onehot encoding會造成稀疏矩陣
  • 具有密集.低維度.從資料中學習的特性
  • 實作中可以使用目前的文本進行訓練embedding layer, 或由pretrained word embedding拿來用(例如英文嵌入向量Glove), 以提升少量訓練資料的效果
  • pretrained word embedding 在model定義好時,  將第一層embedding設定成glove權重, model.layers[0].set_weights([embedding_matrix])
  • pretrained word embedding  當訓練時 必須凍結embedding layer, model.layes[0].trainable = False, 以免更新到已經知道的內容
  • 不同領域embedding空間會差異很大


看一下字詞統計
total_counts.most_common()
#[('and', 392764),
# ('the', 302872),
# ('a', 263210),
# ('for', 252534),
# ('in', 230538),

去除stopwords
# 可以定義stopwords過濾 or 正規表達式
sentences = []

for i, text in enumerate(tqdm_notebook(data)):
    line = []

    for w in jieba.cut(text, cut_all=False):
        
        ## remove stopwords and digits
        ## can define your own rules
        if w not in stop_words and not bool(re.match('[0-9]+', w)):
            line.append(w)

    sentences.append(line)
整理字詞資料(可使用正規表達式)
## filter rules
article['content'] = article['content'].str.replace('https?:\/\/\S*', '')
article['content'] = article['content'].replace('', np.nan)

## word count
## http://blog.csdn.net/gatieme/article/details/43235791 (中文正則表達式)
df['word_count'] = df['content'].str.count('[a-zA-Z0-9]+') + df['content'].str.count('[\u4e00-\u9fff]')

## compute correlation
df.iloc[:, -4:].corr()
bag of words
from sklearn.feature_extraction.text import CountVectorizer

## define transformer (轉換器)
vectorizer = CountVectorizer()
count = vectorizer.fit_transform([' '.join(x) for x in sentences])

## save data as pickle format
with open("article_count", "wb") as file:
    pickle.dump([vectorizer, count], file)

## create a dictionary: id as key ; word as values
id2word = {v:k for k, v in vectorizer.vocabulary_.items()}

## columnwise sum: words frequency
sum_ = np.array(count.sum(axis=0))[0]

## top 10 frequency's wordID
most_sum_id = sum_.argsort()[::-1][:10].tolist()
most_sum_id
# [73627, 198934, 95899, 37001, 243708, 258736, 257519, 305714, 256024, 283981]

## print top 10 frequency's words
features = [id2word[i] for i in most_sum_id]
features 
# ['八卦', '有沒有', '台灣', '一個', '現在', '知道', '真的', '覺得', '看到', '肥宅']

## print the data
data = pd.DataFrame(count[df.idx.values,:][:,most_sum_id].toarray(), columns=features)
data[:5]
使用TFIDF
from sklearn.feature_extraction.text import TfidfVectorizer

## define transformer (轉換器)
vectorizer = TfidfVectorizer(norm=None) ## do not do normalize
tfidf = vectorizer.fit_transform([' '.join(x) for x in sentences])

## save data as pickle format
with open("article_tfidf", "wb") as file:
    pickle.dump([vectorizer, tfidf], file)

## create a dictionary: id as key ; word as values
id2word = {v:k for k, v in vectorizer.vocabulary_.items()}

## columnwise average: words tf-idf
avg = tfidf.sum(axis=0) / (tfidf!=0).sum(axis=0)

## set df < 20 as 0
avg[(tfidf!=0).sum(axis=0)<20 0="" 10="" 157970="" 183683="" 263428="" 325364="" 33207="" 357411="" 47011="" 51405="" 5490="" avg="np.array(avg)[0]" axis="1)" charlie="" columns="features)" compute="" correlation="" data.corr="" data="" features="" for="" i="" in="" most_avg_id="" pre="" print="" s="" tf-idf="" tfidf="" the="" top="" united="" wordid="" words="">
前處理搭配演算法實驗
對原始 IMDB 資料的文字資料進行向量化
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np

maxlen = 100  # 100 個文字後切斷評論 (只看評論的前 100 個字)
training_samples = 200  # 以 200 個樣本進行訓練
validation_samples = 10000 # 以 10, 000 個樣本進行驗證
max_words = 10000  # 僅考慮資料集中的前 10, 000 個單詞

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts) # 將文字轉成整數 list 的序列資料

word_index = tokenizer.word_index
print(word_index[: 10])
print('共使用了 %s 個 token 字詞.' % len(word_index))

data = pad_sequences(sequences, maxlen=maxlen) # 只取每個評論的前 100 個字 (多切少補) 作為資料張量
labels = np.asarray(labels)  # 將標籤 list 轉為 Numpy array (標籤張量)

print('資料張量 shape:', data.shape) # (25000, 100)
print('標籤張量 shape:', labels.shape) # (25000,)

indices = np.arange(data.shape[0])  # 將資料拆分為訓練集和驗證集, 但首先要將資料打散, 因為所處理的資料是有順序性的樣本資料 (負評在前, 然後才是正評)
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]

x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples: training_samples + validation_samples]
y_val = labels[training_samples: training_samples + validation_samples]
以函數式 API 實作雙輸入問答模型 by Keras
from keras import Model
from keras import layers
from keras import Input

text_vocabulary_size = 10000
question_vocabulary_size = 10000
answer_vocabulary_size = 500
       #↓1...                   #↓2...
text_input = Input(shape=(None, ), dtype='int32', name='text') 
embedded_text = layers.Embedding(text_vocabulary_size, 64)(text_input) #← 3...
print(embedded_text.shape)   #→ (?, ?, 64)
encoded_text = layers.LSTM(32)(embedded_text) #← 4...
print(encoded_text.shape)  # → (?, 32)

question_input = Input(shape=(None, ), dtype='int32', name='question')
embedded_question = layers.Embedding(question_vocabulary_size, 32)(question_input) #5..
print(embedded_question.shape)   #→ (?, ?, 32)
encoded_question = layers.LSTM(16)(embedded_question)
print(encoded_question.shape)   #→ (?, 16)
             #↓6...
concatenated = layers.concatenate([encoded_question, encoded_text], axis=-1) 
print(concatenated.shape)  #→ (?, 48)

answer = layers.Dense(answer_vocabulary_size, activation='softmax')(concatenated) #← 7...
print(answer.shape)  #→ (?, 500) 

model = Model([text_input, question_input], answer) #← 8...
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])
model.summary()

#1. shape = (None, ) 代表不限定張量的 shape 大小, 所以文字輸入可以是可變長度的整數序列。
#2. 請注意, 可以選擇是否為輸入命名, 原因為下面程式 7.2 中的訓練方法 2。
#3. 將輸入送進嵌入層, 編碼成大小 64 的文字嵌入向量 (處理 「參考文字」輸入)。
#4. 再透過 LSTM 層將向量序列編碼成單一個向量
#5. 處理「問題」輸入的流程 (與處理「參考文字」輸入的流程相同)
#6. 串接編碼後的「問題」和「參考文字」資料 (向量), 將兩份資料合而為一。axis 參數為 -1 代表以輸入的最後一個軸進行串接。
#7. 最後增加一個 Dense層 (softmax分類器), 將串接向量送入, 輸出模型的結果張量 answer。
#8. 在模型實例化時, 因為有兩個輸入, 所以將它們組成一個 list 一起做為輸入, 而輸出為 answer。 

使用Glove當成預訓練的embedding
print('Preparing embedding matrix.')

# first, build index mapping words in the embeddings set
# to their embedding vector
embeddings_index = {}
with open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt')) as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

print('Found %s word vectors in Glove embeddings.' % len(embeddings_index))
#print(embeddings_index["google"])

# prepare embedding matrix - rows are the words from word_index, columns are the embeddings of that word from glove.
num_words = min(MAX_NUM_WORDS, len(word_index)) + 1
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if i > MAX_NUM_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

# load these pre-trained word embeddings into an Embedding layer
# note that we set trainable = False so as to keep the embeddings fixed
embedding_layer = Embedding(num_words,
                            EMBEDDING_DIM,
                            embeddings_initializer=Constant(embedding_matrix),
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)
print("Preparing of embedding matrix is done")
Preparing embedding matrix.
Found 400000 word vectors in Glove embeddings.
Preparing of embedding matrix is done

1D CNN Model with pre-trained embedding

print('Define a 1D CNN model.')

cnnmodel = Sequential()
cnnmodel.add(embedding_layer)
cnnmodel.add(Conv1D(128, 5, activation='relu'))
cnnmodel.add(MaxPooling1D(5))
cnnmodel.add(Conv1D(128, 5, activation='relu'))
cnnmodel.add(MaxPooling1D(5))
cnnmodel.add(Conv1D(128, 5, activation='relu'))
cnnmodel.add(GlobalMaxPooling1D())
cnnmodel.add(Dense(128, activation='relu'))
cnnmodel.add(Dense(len(labels_index), activation='softmax'))

cnnmodel.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])
#Train the model. Tune to validation set. 
cnnmodel.fit(x_train, y_train,
          batch_size=128,
          epochs=1, validation_data=(x_val, y_val))
#Evaluate on test set:
score, acc = cnnmodel.evaluate(test_data, test_labels)
print('Test accuracy with CNN:', acc)

Ref:
  • 深度學習必讀 keras大神 帶你用python實作

沒有留言:

張貼留言