Contributed by Bernard Ong. He enrolled in the NYC Data Science Academy 12-week full time Data Science Bootcamp program taking place between July 5th to September 23rd, 2016. This post is based on their third project – Web Scraping, due on 6th week of the program. The original article can be found here.
The eBook business is thriving. The likes of Amazon Kindle, Apple iBookstore, and Google eBookstore all provide a robust variety of channels by which to publish any eBook on any subject you could think of. Amazon generates an average of 1.07MM in eBook paid sales volume, which translates to about $5.8MM in revenue, every day.
A huge community of eBook followers exist due to its proven model to generate passive income for good writers. There are many great writers out there, but why is it sometimes difficult to generate the expected revenue in the market? The key to real success for capable, passionate writers is preparation and research. You have to know the market and current trends.
It is not enough for writers come up with a topic they are interested in, start drafting and writing what they feel most passionate about, get published, then sit back and reap the benefits of their efforts. There are many websites that offer advice and recommendations on how to get into the eBook publishing business, but it boils down to good old fashioned market research and audience targeting to drive attention and trigger those sales.
Many tools and utilities are available to aid writers research their subjects of choice and understand keyword search volume. They also reveal how much competition exists for those subject areas you’re writing about, and what the reading public is clamoring for. At what price point are customers willing to pay for the eBook?
When you think about the purchasing behavior of eBook readers, they would normally go to their favorite eBook stores, type in the relevant keywords, and start browsing through the hundreds, if not thousands of eBooks out there. If your eBook does not appear in the first or second page of the most relevant search results, your chances of being noticed are slim. Consumers will always try to go for the most relevant, most interesting, most popular, most highly rated, most favorably reviewed, and most inexpensive eBook with the best value they get their hands on. Sounds daunting right?
The eBook market is saturated with all the popular topics, so one needs to be more creative in their approach on the entire eBook publishing process. Do not start writing a word until an understanding of how the market will react is clearly achieved.
Business of eBooks
There is an interesting market psychology that drives the eBook market. The savviest writers are great marketers first, excellent wordsmiths second. If one’s intent is to purely follow their passion and write something they truly care about with the prospect of generating income as a second thought, then no amount of market research will convince the author to sway from his or her prime directive. However, if you’re like most people, you would want to not only write something you really care about, but also try to maximize the potential of earning passive income while you’re doing it. And why not? This mindset has been prevalent especially in this generation as more Baby Boomers are in or close to their retirement periods. Many feel compelled to follow their passion more, and at the same time find ways to supplement their retirement income. Writing and publishing eBooks has been one of these passive income generators that a lot of people vie for, but sadly, not many succeed.
There’s so much potential of passive income generation that eBooks offer that there are many models built around the business of eBooks itself. Websites and courses exist to teach, mentor, and guide eBook writers to how to go about the business of honing and excelling in the craft of research, marketing and writing rolled into one.
The most successful ones make it a point to create and publish eBooks once every two to three months. These savvy publishers go for volume, so it becomes a numbers game (to an extent). But it’s not done blindly to crank out these eBooks. The electronic format and plethora of delivery channels make it really convenient and efficient to produce and publish eBooks at an unheard of rate that traditional book publishers have never seen before. So it boils down to research and more research.
So where does one start? Right out of the gate.
When potential readers are hunting for the best eBook, there are cues that can be managed effectively to maximize their attention and garner a rise on a potential sale. Once search results are displayed, the book’s title and cover are the little-known doorways to grabbing the reader’s attention. Book design covers are there to speak visually to the consumer, the eBook title (and subtitle) are there to compete among the thousands of keywords and appeal and attract to the consumer mental and emotional state.
Let’s focus on the how to choose a title for an eBook to increase the likelihood of being more visible, and appeal to the most anticipated result. We will focus on three main objectives: how to find the top words to title a Kindle eBook, and how to find the most similar words related to the subject. We will also have a discussion on how to use and combine two technical approaches to fulfill our objectives.
Titles and headlines change the way we think. What you read affects what you see, and what grabs your attention changes the way you feel. The hook, line, and sinker exist in all forms of media. Part of advertising agencies’ expertise is coming up with all these catch phrases and headline grabbers. From news pieces, articles, blogs, news, printed and electronic books, to movies, marquees, and advertisements. These are just a sample of how titling can make a difference.
Word Counts and the Doc2Vec Neural Network
I targeted the Amazon Kindle site as the source for gathering information to do an in depth analysis on eBook inventory listing, ratings, reviews, and pricing. Code was written to dynamically web-scrape and scour the entire result set based on the key word search. From there, a process was developed to tokenize and clean up the titles into group of words which are categorized and counted to the proper rating for that specific eBook title. A total weighted score is calculated by getting the summation of the product of the rating score multiplied by the word count.
A second and complementary approach using Doc2Vec was used to analyze the entire eBook title listing. Doc2Vec was built based on the Word2Vec approach towards documents. The objective is to convert whole documents or in this case, bodies of text based on eBook titles into a digital representation in a multidimensional vector space. What that means is that any title can be represented as a vector point in a multidimensional space. This point on its own does not offer any value, but when a cluster of these points are created, a pattern starts to emerge and relationships between the words start to form. The magic and power of Doc2Vec is its ability to mathematically create context out of words, and inversely, also produce words given its context.
Doc2vec is based on a very thin two-layer neural network. Doc2Vec learns representations for words and labels simultaneously. It operates in a purely unsupervised mode and needs no labels other than an arbitrary unique ID per text example. You train it to find similar (using cosine similarity) words in context of each other based on the frequency of co-occurrences of words that are near each other. You can even pass entirely new text to the trained model and it can infer a compatible vector, and find the best and most similar words most likely to appear in context.
The approach is to create the weighted word frequency count to get the most relevant words from Amazon Kindle site and intersect that set of word list with the result from the Doc2Vec neural network result set. The intersected result set then uses the relevancy score from Amazon Kindle site to sort and produce the highly recommended list of words to use to form an eBook title that could increase the likelihood of being noticed and getting the sale. This serves two purposes. One, it ensures that the list relevancy remains high due to its consistent use of the Amazon Kindle inherent scoring mechanism. Second, the Doc2Vec neural network output of the most similar words that it learned from the entire title listings from Amazon Kindle produces possibly related search words. Utilizing these words will fortify the likelihood that the eBook title will appear on the first two to three pages of Amazon’s search results.
The initial process is all about the scraping strategy and approach. Python was the language and the BeautifulSoup library took care of the web-scraping. A fully object oriented approach was also implemented so that full modularity, reuse, componentization, and abstraction was achieved. The ReviewCorpus class did pre-processing on incoming titles. This class are four methods.
This class method removes the higher order non ascii characters. It is fairly normal at times to get these characters and it detracts from accurately processing text for a higher dimensionalized vector space.
def remove_non_ascii(self, text): # removes all non-ascii characters return ''.join([i if ord(i) < 128 else '' for i in text])
These class methods use the Python nltk library functions to tokenize the titles, remove stop words, and take out all punctuation.
def cleanupTitle(self, s): # remove stopwords stopset = set(stopwords.words('english')) punctuations = list(string.punctuation) tokens = [i for i in nltk.word_tokenize(re.sub(r'\d+', '', s.lower())) if i not in punctuations] cleanup = " ".join(filter(lambda word: word not in stopset, tokens)) cleanup = self.remove_non_ascii(cleanup) cleanup = cleanup.replace('...','') cleanup = cleanup.replace("'s",'') cleanup = cleanup.replace("''",'') cleanup = cleanup.replace("``",'') cleanup = cleanup.replace("-",'') cleanup = cleanup.replace("''",'') cleanup = cleanup.replace("'",'') return cleanup
The next class method named .add then creates a dictionary that maps each title with a count against each rating category.
def add(self, title, rating): newsent = self.cleanTokens(title) for x in range(len(newsent)): if newsent[x] not in self.corpus: self.corpus[newsent[x]] = [0,0,0,0,0] if rating > 0.00 and rating <= 1.44: self.corpus[newsent[x]] += 1 if rating > 1.44 and rating <= 2.44: self.corpus[newsent[x]] += 1 if rating > 2.44 and rating <= 3.44: self.corpus[newsent[x]] += 1 if rating > 3.44 and rating <= 4.44: self.corpus[newsent[x]] += 1 if rating > 4.44 and rating <= 5.00: self.corpus[newsent[x]] += 1
The AmazonKindle class scrapes the Amazon Kindle site dynamically. This code needs to be very robust as the search results from Amazon could go to a maximum of 400 pages, and each page could have from 15-20 title listing with metadata (titles, ratings, number of comments, prices).
The first method of the AmazonKindle class initializes the query string and scrapes the maximum page from the Amazon result page. This tells how deep the looping code needs to go to scrape through all the titles.
AMZ_ROOT = 'https://www.amazon.com/'
def __init__(self, query): # Preprocess Search Query exclude = set(string.punctuation) self.query = ”.join(ch for ch in query if ch not in exclude) # Get maximum page to retrieve self.maximum = self.maxPage()
The buildURL method is then used to formulate the URL to scrape. Since Amazon generates this dynamically, I had to get the URL down to a reproducible format and protocol.
The retrieveSource method is the main module to collect the entire HTML byte stream. The header needs to be formulated to effect the scraping as a browser client as Amazon does not allow scraping bots on its site.
# Process Reviews if reviews: for a in reviews.find_all(lambda a: (a.name==’a’ and \ ‘customerReviews’ in a[‘href’]), href=True): stringer = a.text stringer = stringer.replace(‘,’,”) reviews = int(stringer)
# Return Result if title and price and rating and reviews: return title, float(price.text.replace(‘,’, ”)[1:]), float(rating.text.replace(‘ out of 5 stars’, ”)), reviews else: return None
Lastly, the AmazonKindle class is designed as a class iterator to pass back a generator result set for further processing.
def __iter__(self): # Amazon Kindle results iterator for i in range(1, self.maximum+1): print 'Scraping and Processing: Page '+str(i)+' / '+str(self.maximum) raw_html = self.retrieveSource(self.buildURL(self.query, i)) for j in self.processPage(raw_html): yield j
After the two classes above are created, a Python application is created to bring it all together. The main application will pass the parameters to the processing classes, clean all text up as it collects it, gathers all the metadata, stores it as vectors, then write it all out to CSV files for further processing by the Doc2Vec procedure.
''' ------------------------------------------------------------------------------ Creator: Bernard Ong Created: Aug 2016 Project: Web Scraping Project Purpose: Scraper Code for Amazon Kindle eBook Site ------------------------------------------------------------------------------ '''
# import libraries import pandas as pd from rc import ReviewCorpus from amazon import AmazonKindle import sys
# get the command line argument for the Amazon search query = sys.argv
# initialize the variables to capture price/rating/review (prr) and review corpus prr =  rc = ReviewCorpus() titles = 
# instantiate class for the Amazon Kindle ebook list based on keywords amz = AmazonKindle(query)
# execute the class iterator method # build out the prr collection for a in amz: if not a: pass else: title, price, rating, review = a rc.add(title, rating) prr.append([price, rating, review]) titles.append(rc.cleanupTitle(title))
# build out the ebook list with frequency count of each rating rev_corp =  for key, val in rc.corpus.iteritems(): key = rc.cleanupTitle(key) if key.strip() != “”: rev_corp.append([key] + val)
# Convert to dataframe and Export Data to csv files df_revcorp = pd.DataFrame(rev_corp) df_prr = pd.DataFrame(prr) df_titles = pd.DataFrame(titles)
# convert dataset to csv format for export readiness rc_data = df_revcorp.to_csv(index=False, header=[‘title’, ‘tr1’, ‘tr2’, ‘tr3’, ‘tr4’, ‘tr5’]) pr_data = df_prr.to_csv(index=False, header=[‘price’, ‘rating’, ‘review’]) titles_data = df_titles.to_csv(index=False)
# write the entire dataset to file csv_revcorp.write(rc_data) csv_prr.write(pr_data) csv_titles.write(titles_data)
# close the files csv_revcorp.close() csv_prr.close() csv_titles.close()
The next step is to now vectorize the entire title list and train the neural network on the entire corpus. The Doc2Vec function is called with the right hyper-parameters. How the hyper-parameters are set is the most critical step in the vectorization, model build out, and training. The code is as follows.
# build model = Doc2Vec(size=int(vec_dim), window=win_size, alpha=alpha, min_count=min_count, sample=sample, workers=workers) model.build_vocab(tList.toArray()) tmpList = tList.toArray() print ‘Training Model…’ for e in range(epochs): print ‘\tEpoch ‘ + str(e) random.shuffle(tmpList) model.alpha -= 0.001 model.min_alpha = model.alpha
An inference engine is called to get the most similar set of words from the vectorized cloud. The parameters need to be passed in manually for this version, but can be abstracted easily as appropriate.
# Inference Test - hardcoded for now inferred = model.infer_vector(["test1","test2"], steps=15) similar_vec = model.similar_by_vector(inferred, topn=200)
Finally, the data sets are converted to a set format, then the set intersect function called to obtain the final recommended list of words that best be used for formulating the eBook title.
# convert the similar words found into set format top200_vec = set(map(lambda x: x, similar_vec)) print pd.DataFrame(list(top200_vec),columns=["words"])[:35] top200_count = open('top200_count.csv', 'r') top200_count = set([word for word in top200_count.read().split('\n')])
# show the intersection set of words (resulting as set format) intersect = top200_count.intersection(top200_vec) intersect = pd.DataFrame(list(intersect),columns=[“words”]) wordscore = pd.read_csv(‘top200_countscores.csv’,sep=’,’,names=[“words”,”score”],skiprows=) final_list = pd.merge(intersect,wordscore).sort_values(“score”,ascending=False)[0:25].reset_index().drop(“index”,axis=1) print final_list
Here are some examples of the output using the weighted word frequency count and Doc2Vec neural network, tried with various keyword combinations, and final result sorted using Amazon’s relevancy score.
Future Next Steps
It would be an interesting exercise to use the recommended bag of words to literally come up with final title recommendations. I believe that the use of a Deep Learning model using Theano Keras has this capability to formulate human readable titles using natural language generation.