Here is a short blog I was asked to make about making a personal Wiki from Wikipedia. It shows the basic steps in text processing so I hope it will be useful for data scientists. It also requires some knowledge of MediaWiki setup on a web server, and some (not very advanced) knowledge of the Python programming language. It takes only several days to create this Wiki with Wikipedia articles if you know Python and basic ideas of data science. Here are the steps:
(1) Install MediaWiki with basic extensions and insert some templates from Wikipedia (~2 h).
(2) Download Wikipedia dump file (with the extension *.bz2) using https://dumps.wikimedia.org/ A BitTorrent program is recommended since the file is large (~17 GB).
(3) Create a Python script that reads this file and write only articles with certain categories. Read it wisely. You cannot read the entire file into the computer memory (my old computer had only 8 GB of RAM), so use other techniques to parse this file. In data science, this step is called data skimming. I wanted all categories related to data science and science. My script creates TXT file with Wiki tags (30 min on a commodity computer). However, it took about ~2 days (not very hard) work to make this python code.
(4) Data cleaning. Since my Wiki is about “concepts”, not about people and other things, I’ve done a second pass with my python script to remove articles on people that end up in my categories. While doing this, I’ve detected a massive number of empty “stabs”, self-promotions and junk articles without references. I was surprised that professional Wikipedians do nothing about such entries. It took about 2 hours of work during my weekend to make this script.
(5) During data cleaning I had to shorten some long articles (up to 10 sections) and removed many Wikipedia info boxes (takes time to install these templates). Plus I wanted to have easy-to-read articles. This step is called data slimming.
(6) The hardest work was to move some articles (on particular topics) to new namespaces, and replace internal links. I wanted a better organized Wiki. Wikipedia just dumps all articles to the “main namespace”. In my Wiki, I wanted links like [[Gyroscope]] to go to the “Physics” namespace i.e. Physics:Gyroscope. I had to convert some Wikipedia links to the plain text if I do not have such articles. This required me to build a Python map with all my titles, and use it for the link replacements.
(7) Removal of duplicate entries. Some articles appeared in the main namespace and under the dedicated namespaces for specific sciences. A small script removed such duplicate entries.
(8) Finally, I’ve added a link to a full Wikipedia article for the section “External links”, saying that it was sourced from Wikipedia. All of this took me one day of thinking, but the actual implementation was simple – create a Python map and use it for replacements of words between [[ and ]] tags.
(9) The very last step is to import my TXT files with the selected articles to MediaWiki. MediaWiki “maintenance” directory provides such PHP scripts. The import takes about 3h for 20,000 slimmed articles. With the default MediaWiki setup, you do not need to copy images from Wikipedia Commons
Done. You can find this Wiki here. The whole project took about a week, spending ~2-3h h per day. Later I noticed that some old Wikipedia templates are still missing. This should be easy to fix in the future.
(this blog was re-posted from jWork.ORG with the permission from the author)