Deploy REL for a new Wikipedia corpus
Although we will do our best to continuously provide the community with recent corpuses, it may be the case that a user wants to, for example, use an older corpus for a specific evaluation. For this reason we provide the user with the option to do so. We must, however, note that some steps are outside the scope of the REL package, which makes support for some of these steps a difficult task.
This tutorial is divided in four parts. The first part deals with extracting a Wikipedia corpus and creating a p(e|m) index. After extracting the aforementioned index and thus obtaining a sqlite3 database, we are also in need of Embeddings. We obtain new embeddings by training a Wikipedia2Vec model. To train our own Entity Disambiguation model, we need to generate training, validation and test files. These aforementioned p(e|m) index, can be used to train your own Entity Disambiguation model. After obtaining this model, a user may choose to evaluate the obtained model on Gerbil or for E2E Entity Linking.
Creating a folder structure
Previously we have defined our base_url
, but now we need to create a new sub folder in our directory to obtain
the following folder structure:
.
├── generic
└─── wiki_2014
| ├── basic_data
| └── anchor_files
| └── generated
└─── wiki_2019
| ├── basic_data
| └── anchor_files
| └── generated
└─── your_corpus_name
| ├── basic_data
| └── anchor_files
| └── generated
Extracting a Wikipedia dump
There are several platforms that host Wikipedia dumps. These platforms provide xml
files that need processing.
A tool that does this is called WikiExtractor. This tool takes as an input a
Wikipedia dump and spits out files that are required for our package. We, however, had to alter it slightly such that it
stored some additional files that are required for this package. As such, we have added this edited edition to our scripts
folder. To process a Wikipedia dump run the command below in a terminal. We define the file size (bytes
) as one GB, but it can
be changed based on the user's wishes. We advice users to run the script in the basic_data
folder. After the script is
done, the user only needs to copy the respective wikipedia dump (this excludes the wiki id/name mapping, disambiguation pages
and redirects) into the anchor_files
folder.
python WikiExtractor.py ./wiki_corpus.xml --links --filter_disambig_pages --processes 1 --bytes 1G
Generate p(e|m) index
Now that we have extracted the necessary data from our Wikipedia corpus, we may create the p(e|m) index. This index
is automatically stored in the same database as the embeddings that can be found in the generated
folder. The first
thing we need to do is define define a variable where we store our database. Secondly, we instantiate a Wikipedia
class
that loads the wikipedia id/name mapping, disambiguation file and redirect file.
wiki_version = "your_corpus_name"
wikipedia = Wikipedia(base_url, wiki_version)
Now all that is left is to instantiate our WikipediaYagoFreq
class that is responsible for parsing the Wikipedia and
YAGO articles. Here we note that the function compute_custom()
by default computes the p(e|m) probabilities of
YAGO, but that it can be replaced by any Knowledge Base of your choosing. To replace YAGO, make sure that the input dictionary
to the aforementioned function is of the following format:
{mention_0: {entity_0: cnt_0, entity_n: cnt_n}, ... mention_n: {entity_0: cnt_0, entity_n: cnt_n}}
wiki_yago_freq = WikipediaYagoFreq(base_url, wiki_version, wikipedia)
wiki_yago_freq.compute_wiki()
wiki_yago_freq.compute_custom()
wiki_yago_freq.store()
Embeddings
Training Wikipedia2Vec embeddings
Training new embeddings is based on the Wikipedia2Vec package. For extra information
about this package we refer to their website. We did, however, feel obligated to provide users with the same scripts that
we used to train our embeddings. These two shell scripts first install Wikipedia2vec and then asks you where
your Wikipedia dump is stored. Please make sure that the dump is still zipped and thus has the extensions .xml.bz2
.
The two scripts are located in scripts/w2v
, where you first run preprocess.sh
which requires you to enter the
location of your Wikipedia dump. After this is done, you can run train.sh
which will train a Wikipedia2Vec model and
store it in the required word2vec format.
Storing Embeddings in DB
Now that the Embeddings are trained and stored you might notice that the file is huge. This is exactly the reason
why we choose for a database approach, because it was simply infeasible to load all the embeddings into memory. After
the package is installed, all we have to do is run the code below. Please make sure to not not change the variables save_dir
and db_name
. The variable embedding_file
needs to point to the trained Wikipedia2vec file.
from REL.db.generic import GenericLookup
save_dir = "{}/{}/generated/".format(base_url, wiki_version)
db_name = "entity_word_embedding"
embedding_file = "./enwiki_w2v_model"
# Embedding load.
emb = GenericLookup(db_name, save_dir=save_dir, table_name='embeddings')
emb.load_word2emb(embedding_file, batch_size=5000, reset=True)
Generating training, validation and test files
To train your own Entity Disambiguation model, training, validation and test files are required. To obtain these
files we first define our wiki_version
and instantiate a Wikipedia
class that are required.
from REL.wikipedia import Wikipedia
from REL.generate_train_test import GenTrainingTest
wiki_version = "wiki_2019/"
wikipedia = Wikipedia(base_url, wiki_version)
GenTrainingTest
that parses the raw training and test files that can be found in our
generic
folder. The user may choose to only retrieve one of the listed datasets below by simply changing the name
in the function process_wned
or process_aida
. The reason for separating these functions was due to the way they were
provided and had to be parsed.
data_handler = GenTrainingTest(base_url, wiki_version, wikipedia)
for ds in ["aquaint", "msnbc", "ace2004", "wikipedia", "clueweb"]:
data_handler.process_wned(ds)
for ds in ["train", "test"]:
data_handler.process_aida(ds)
Training your own Entity Disambiguation model
For our final tutorial we will train our own Entity Disambiguation model. The first step, as always, is to import
the necessary packages and to define the wiki_version
.
from REL.training_datasets import TrainingEvaluationDatasets
from REL.entity_disambiguation import EntityDisambiguation
wiki_version = "wiki_2019"
Our second step is loading our training and evaluation datasets that can be found in our generated
folder. These
datasets are provided by us, but can also be generated by the user in previous steps.
datasets = TrainingEvaluationDatasets(base_url, wiki_version).load()
Our third step consists of defining the config dictionary that serves as an input to the model. If the mode is set
to eval
then the model will evaluate the given datasets given that the model exists (else it will throw an error). The
mode should be set to train
if the user wishes to train a new model and save it in the given model_path
.
config = {
"mode": "train",
"model_path": "{}/{}/generated/model".format(
base_url, wiki_version
),
}
model = EntityDisambiguation(base_url, wiki_version, config)
Our fourth step consists of training or evaluating the model. This is dependent on the mode
that was chosen previously,
# 3. Train or evaluate model
if config["mode"] == "train":
model.train(
datasets["aida_train"], {k: v for k, v in datasets.items() if k != "aida_train"}
)
else:
model.evaluate({k: v for k, v in datasets.items() if "train" not in k})
Now that we have obtained our model, we want to express how confident we are in our prediction. By default this is not integrated in our model, but we implemented an additional step, where we use Logistic Regression to obtain confidence scores. However, to obtain such scores, we first need to train the model. This can be done by running the code below.
# 3. Train and predict using LR
model_path_lr = "{}/{}/generated/".format(base_url, wiki_version)
model.train_LR(
datasets,
model_path_lr
)