¡Ay, no tengo dinero! Ayer le di lo último a alguien, ya sabes que tiene dos hijos.

—¡Irene, no tengo dinero! ¡Lo último se lo di a Natalia ayer! Ya sabes que tiene dos niños… —Agobiada, Ana Fernández colgó el teléfono.

No quería ni recordar lo que su hija le había dicho. —¿Por qué será así? Criamos tres hijos con mi marido, nos esforzamos por ellos, les dimos estudios superiores, buenos puestos… Y ahora, en mis años, ni paz ni ayuda.

—Vaya, Paco, por qué te fuiste tan pronto… Contigo todo era más fácil —pensó Ana, dirigiéndose mentalmente a su difunto esposIMDB Dataset of 50K Movie Reviews
IMDB dataset having 50K movie reviews for natural language processing or Text analytics.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training and 25,000 for testing. So, predict the number of positive and negative reviews using either classification or deep learning algorithms.
For more dataset information, please go through the following link,
http://ai.stanford.edu/~amaas/data/sentiment/
Kaggle dataset identifier: imdb-dataset-of-50k-movie-reviews
import pandas as pd

df = pd.read_csv(‘imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv’)
df.info()

RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
# Column Non-Null Count Dtype
— —— ————– —–
1 sentiment 50000 non-null object
2 review 50000 non-null object
dtypes: object(2)
memory usage: 781.4+ KB
Examples:
{
“sentiment”: “positive”,
“review”: “One of the other reviewers has mentioned that after watching just 1 Oz episode you’ll be hooked. They are right, as this is exactly what happened with me.

The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from t…(truncated)”,
}
{
“sentiment”: “positive”,
“review”: “A wonderful little production.

The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece.

The actors are extremely well chosen- not only \”has got …(truncated)”,
}
{
“sentiment”: “positive”,
“review”: “I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). Whi…(truncated)”,
}
{
“sentiment”: “positive”,
“review”: “Basically there’s a family where a little boy (Jake) thinks there’s a zombie in his closet & his parents are fighting all the time.

This movie is slower, a little too dark, and not too engaging. There are many popcorn movies at the multiplex these days, this movie appears …(truncated)”,
}
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only “../input/” directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os

for dirname, _, filenames in os.walk(“/kaggle/input”):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using “Save & Run All”
# You can also write temporary files to /kaggle/temp/, but they won’t be saved outside of the current session
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
import os
import re
from bs4 import BeautifulSoup
from sklearn.model_selection import train_test_split
import nltk
from nltk.tokenize import WordPunctTokenizer
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
from nltk import pos_tag
from sklearn.feature_extraction.text import (
CountVectorizer,
TfidfVectorizer,
ENGLISH_STOP_WORDS,
)
from wordcloud import WordCloud

df = pd.read_csv(“/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv”)
df.head()
df.shape

# # Preprocessing
## Cleaning html tags
def remove_html_tags(text):
soup = BeautifulSoup(text, “html.parser”)
stripped_text = soup.get_text()
return stripped_text

# Cleaning punctuations and stop words
def clean_text(text):
text = text.lower()
text = re.sub(“\[.*?\]”, ” “, text)
text = re(“[%s]” % re.escape(string.punctuation))
text = re.sub(“\w*\d\w*”, ” “, text)
text = re.sub(“[‘’“”…]”, ” “, text)
text = re.sub(“\n”, ” “, text)
text = re.sub(“\r”, ” “, text)
text = re.sub(“\s+”, ” “, text)
return text

df[“clean_review”] = df[“review”].apply(lambda x: remove_html_tags(x))
## converting sentimental to numeric
df[“sentiment”] = df[“sentiment”].apply(lambda x: 1 if x == “positive” else 0)
df.head()
## split the data into train and test
train, test = train_test_split(df, test_size=0.3, random_state=42, shuffle=True)
train.shape, test.shape
## BERT takes maximun of 512 tokens
train_reviews = train[“clean_review”].tolist()
train_labels = train[“sentiment”].tolist()
test_reviews = test[“clean_review”].tolist()
test_labels = test[“sentiment”].tolist()
## converting into tensorflow dataset format for bert input
train_ds = tf.data.Dataset.from_tensor_slices((train_reviews, train_labels))
test_ds = tf.data.Dataset.from_tensor_slices((test_reviews, test_labels))
## bert model takes input batch size of 32,64,128… I have taken 32 because of computational power.
## shuffle the data
train_ds = train_ds.batch(32).shuffle(100)
test_ds = test_ds.batch(32)

# **Preparing bert layer**
## bert preprocess and encoder layer
bert_preprocess = hub.KerasLayer(
“https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3”
)
bert_encoder = hub.KerasLayer(
“https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4″
)

# building bert classification model
def build_classifier_model():
text_input = tf.keras.layers.Input(
shape=(), dtype=tf.string, name=”input_sentence”
) ## input layer
preprocessed_text = bert_preprocess(
text_input
) ## preprocess the text to tokens, input mask and input type ids
encoded_output = bert_encoder(
preprocessed_text
) ## bert encoder layer will give two outputs pooled output and sequence outputs
pooled_output = encoded_output[
“pooled_output”
] ## pooled_output is of `[batch_size, 768]`
sequence_output = encoded_output[
“sequence_output”
] ## sequence_output is a float32 Tensor of shape [batch_size, seq_length, 768]
net = tf.keras.layers.Dropout(0.1)(pooled_output)
net = tf.keras.layers.Dense(128, activation=”relu”)(net)
net = tf.keras.layers.Dropout(0.1)(net)
net = tf.keras.layers.Dense(1, activation=”sigmoid”, name=”classifier”)(net)
return tf.keras.Model(text_input, net)

classifier_model = build_classifier_model()
## checking the summary
classifier_model.summary()
## Checking the model if working fine or not
classifier_model.predict([“This is sample input sentence”])
## Compile the model
classifier_model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001),
loss=”binary_crossentropy”,
metrics=[“accuracy”],
)
## Training the model
history = classifier_model.fit(train_ds, epochs=5, validation_data=test_ds)
## model evaluation
loss, accuracy = classifier_model.evaluate(test_ds)
print(“Test accuracy :”, accuracy)
print(“Test loss :”, loss)

# **Model evaluation**
import matplotlib.pyplot as—Ana y Pedro siguieron caminando bajo la luz tenue del farol, sabiendo que, a pesar de todo, habían encontrado la felicidad cuando menos la esperaban.