自然語言處理觀念整理-2
前處理範例-使用Keras
Word Embedding
- 將文字向量化, 使用onehot encoding會造成稀疏矩陣
- 具有密集.低維度.從資料中學習的特性
- 實作中可以使用目前的文本進行訓練embedding layer, 或由pretrained word embedding拿來用(例如英文嵌入向量Glove), 以提升少量訓練資料的效果
- pretrained word embedding 在model定義好時, 將第一層embedding設定成glove權重, model.layers[0].set_weights([embedding_matrix])
- pretrained word embedding 當訓練時 必須凍結embedding layer, model.layes[0].trainable = False, 以免更新到已經知道的內容
- 不同領域embedding空間會差異很大
total_counts.most_common() #[('and', 392764), # ('the', 302872), # ('a', 263210), # ('for', 252534), # ('in', 230538),去除stopwords
# 可以定義stopwords過濾 or 正規表達式 sentences = [] for i, text in enumerate(tqdm_notebook(data)): line = [] for w in jieba.cut(text, cut_all=False): ## remove stopwords and digits ## can define your own rules if w not in stop_words and not bool(re.match('[0-9]+', w)): line.append(w) sentences.append(line)整理字詞資料(可使用正規表達式)
## filter rules article['content'] = article['content'].str.replace('https?:\/\/\S*', '') article['content'] = article['content'].replace('', np.nan) ## word count ## http://blog.csdn.net/gatieme/article/details/43235791 (中文正則表達式) df['word_count'] = df['content'].str.count('[a-zA-Z0-9]+') + df['content'].str.count('[\u4e00-\u9fff]') ## compute correlation df.iloc[:, -4:].corr()bag of words
from sklearn.feature_extraction.text import CountVectorizer ## define transformer (轉換器) vectorizer = CountVectorizer() count = vectorizer.fit_transform([' '.join(x) for x in sentences]) ## save data as pickle format with open("article_count", "wb") as file: pickle.dump([vectorizer, count], file) ## create a dictionary: id as key ; word as values id2word = {v:k for k, v in vectorizer.vocabulary_.items()} ## columnwise sum: words frequency sum_ = np.array(count.sum(axis=0))[0] ## top 10 frequency's wordID most_sum_id = sum_.argsort()[::-1][:10].tolist() most_sum_id # [73627, 198934, 95899, 37001, 243708, 258736, 257519, 305714, 256024, 283981] ## print top 10 frequency's words features = [id2word[i] for i in most_sum_id] features # ['八卦', '有沒有', '台灣', '一個', '現在', '知道', '真的', '覺得', '看到', '肥宅'] ## print the data data = pd.DataFrame(count[df.idx.values,:][:,most_sum_id].toarray(), columns=features) data[:5]使用TFIDF
from sklearn.feature_extraction.text import TfidfVectorizer ## define transformer (轉換器) vectorizer = TfidfVectorizer(norm=None) ## do not do normalize tfidf = vectorizer.fit_transform([' '.join(x) for x in sentences]) ## save data as pickle format with open("article_tfidf", "wb") as file: pickle.dump([vectorizer, tfidf], file) ## create a dictionary: id as key ; word as values id2word = {v:k for k, v in vectorizer.vocabulary_.items()} ## columnwise average: words tf-idf avg = tfidf.sum(axis=0) / (tfidf!=0).sum(axis=0) ## set df < 20 as 0 avg[(tfidf!=0).sum(axis=0)<20 0="" 10="" 157970="" 183683="" 263428="" 325364="" 33207="" 357411="" 47011="" 51405="" 5490="" avg="np.array(avg)[0]" axis="1)" charlie="" columns="features)" compute="" correlation="" data.corr="" data="" features="" for="" i="" in="" most_avg_id="" pre="" print="" s="" tf-idf="" tfidf="" the="" top="" united="" wordid="" words=""> 20>前處理搭配演算法實驗 對原始 IMDB 資料的文字資料進行向量化
from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences import numpy as np maxlen = 100 # 100 個文字後切斷評論 (只看評論的前 100 個字) training_samples = 200 # 以 200 個樣本進行訓練 validation_samples = 10000 # 以 10, 000 個樣本進行驗證 max_words = 10000 # 僅考慮資料集中的前 10, 000 個單詞 tokenizer = Tokenizer(num_words=max_words) tokenizer.fit_on_texts(texts) sequences = tokenizer.texts_to_sequences(texts) # 將文字轉成整數 list 的序列資料 word_index = tokenizer.word_index print(word_index[: 10]) print('共使用了 %s 個 token 字詞.' % len(word_index)) data = pad_sequences(sequences, maxlen=maxlen) # 只取每個評論的前 100 個字 (多切少補) 作為資料張量 labels = np.asarray(labels) # 將標籤 list 轉為 Numpy array (標籤張量) print('資料張量 shape:', data.shape) # (25000, 100) print('標籤張量 shape:', labels.shape) # (25000,) indices = np.arange(data.shape[0]) # 將資料拆分為訓練集和驗證集, 但首先要將資料打散, 因為所處理的資料是有順序性的樣本資料 (負評在前, 然後才是正評) np.random.shuffle(indices) data = data[indices] labels = labels[indices] x_train = data[:training_samples] y_train = labels[:training_samples] x_val = data[training_samples: training_samples + validation_samples] y_val = labels[training_samples: training_samples + validation_samples]以函數式 API 實作雙輸入問答模型 by Keras
from keras import Model from keras import layers from keras import Input text_vocabulary_size = 10000 question_vocabulary_size = 10000 answer_vocabulary_size = 500 #↓1... #↓2... text_input = Input(shape=(None, ), dtype='int32', name='text') embedded_text = layers.Embedding(text_vocabulary_size, 64)(text_input) #← 3... print(embedded_text.shape) #→ (?, ?, 64) encoded_text = layers.LSTM(32)(embedded_text) #← 4... print(encoded_text.shape) # → (?, 32) question_input = Input(shape=(None, ), dtype='int32', name='question') embedded_question = layers.Embedding(question_vocabulary_size, 32)(question_input) #5.. print(embedded_question.shape) #→ (?, ?, 32) encoded_question = layers.LSTM(16)(embedded_question) print(encoded_question.shape) #→ (?, 16) #↓6... concatenated = layers.concatenate([encoded_question, encoded_text], axis=-1) print(concatenated.shape) #→ (?, 48) answer = layers.Dense(answer_vocabulary_size, activation='softmax')(concatenated) #← 7... print(answer.shape) #→ (?, 500) model = Model([text_input, question_input], answer) #← 8... model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc']) model.summary() #1. shape = (None, ) 代表不限定張量的 shape 大小, 所以文字輸入可以是可變長度的整數序列。 #2. 請注意, 可以選擇是否為輸入命名, 原因為下面程式 7.2 中的訓練方法 2。 #3. 將輸入送進嵌入層, 編碼成大小 64 的文字嵌入向量 (處理 「參考文字」輸入)。 #4. 再透過 LSTM 層將向量序列編碼成單一個向量 #5. 處理「問題」輸入的流程 (與處理「參考文字」輸入的流程相同) #6. 串接編碼後的「問題」和「參考文字」資料 (向量), 將兩份資料合而為一。axis 參數為 -1 代表以輸入的最後一個軸進行串接。 #7. 最後增加一個 Dense層 (softmax分類器), 將串接向量送入, 輸出模型的結果張量 answer。 #8. 在模型實例化時, 因為有兩個輸入, 所以將它們組成一個 list 一起做為輸入, 而輸出為 answer。
使用Glove當成預訓練的embeddingprint('Preparing embedding matrix.')
# first, build index mapping words in the embeddings set
# to their embedding vector
embeddings_index = {}
with open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt')) as f:
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
print('Found %s word vectors in Glove embeddings.' % len(embeddings_index))
#print(embeddings_index["google"])
# prepare embedding matrix - rows are the words from word_index, columns are the embeddings of that word from glove.
num_words = min(MAX_NUM_WORDS, len(word_index)) + 1
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
if i > MAX_NUM_WORDS:
continue
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
# load these pre-trained word embeddings into an Embedding layer
# note that we set trainable = False so as to keep the embeddings fixed
embedding_layer = Embedding(num_words,
EMBEDDING_DIM,
embeddings_initializer=Constant(embedding_matrix),
input_length=MAX_SEQUENCE_LENGTH,
trainable=False)
print("Preparing of embedding matrix is done")
print('Preparing embedding matrix.')
# first, build index mapping words in the embeddings set
# to their embedding vector
embeddings_index = {}
with open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt')) as f:
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
print('Found %s word vectors in Glove embeddings.' % len(embeddings_index))
#print(embeddings_index["google"])
# prepare embedding matrix - rows are the words from word_index, columns are the embeddings of that word from glove.
num_words = min(MAX_NUM_WORDS, len(word_index)) + 1
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
if i > MAX_NUM_WORDS:
continue
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
# load these pre-trained word embeddings into an Embedding layer
# note that we set trainable = False so as to keep the embeddings fixed
embedding_layer = Embedding(num_words,
EMBEDDING_DIM,
embeddings_initializer=Constant(embedding_matrix),
input_length=MAX_SEQUENCE_LENGTH,
trainable=False)
print("Preparing of embedding matrix is done")
1D CNN Model with pre-trained embedding
print('Define a 1D CNN model.')
cnnmodel = Sequential()
cnnmodel.add(embedding_layer)
cnnmodel.add(Conv1D(128, 5, activation='relu'))
cnnmodel.add(MaxPooling1D(5))
cnnmodel.add(Conv1D(128, 5, activation='relu'))
cnnmodel.add(MaxPooling1D(5))
cnnmodel.add(Conv1D(128, 5, activation='relu'))
cnnmodel.add(GlobalMaxPooling1D())
cnnmodel.add(Dense(128, activation='relu'))
cnnmodel.add(Dense(len(labels_index), activation='softmax'))
cnnmodel.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['acc'])
#Train the model. Tune to validation set.
cnnmodel.fit(x_train, y_train,
batch_size=128,
epochs=1, validation_data=(x_val, y_val))
#Evaluate on test set:
score, acc = cnnmodel.evaluate(test_data, test_labels)
print('Test accuracy with CNN:', acc)
print('Define a 1D CNN model.')
cnnmodel = Sequential()
cnnmodel.add(embedding_layer)
cnnmodel.add(Conv1D(128, 5, activation='relu'))
cnnmodel.add(MaxPooling1D(5))
cnnmodel.add(Conv1D(128, 5, activation='relu'))
cnnmodel.add(MaxPooling1D(5))
cnnmodel.add(Conv1D(128, 5, activation='relu'))
cnnmodel.add(GlobalMaxPooling1D())
cnnmodel.add(Dense(128, activation='relu'))
cnnmodel.add(Dense(len(labels_index), activation='softmax'))
cnnmodel.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['acc'])
#Train the model. Tune to validation set.
cnnmodel.fit(x_train, y_train,
batch_size=128,
epochs=1, validation_data=(x_val, y_val))
#Evaluate on test set:
score, acc = cnnmodel.evaluate(test_data, test_labels)
print('Test accuracy with CNN:', acc)
- 深度學習必讀 keras大神 帶你用python實作
沒有留言:
張貼留言