Development of an online reputation monitor GJC Venter 21735514 Dissertation submitted in fulfilment of the requirements for the degree Magister in Computer and Electronic Engineering at the Potchefstroom Campus of the North-West University Supervisor: Prof WC Venter November 2014 i Declaration I, Gerhardus Jacobus Christaan Venter, hereby declare that the dissertation entitled “Development of an online reputation monitor” is my own original work and has not already been submitted to any other university or institution for examination. ________________ G.J.C. Venter ii Acknowledgements I would like to thank: ? My father and study-leader Prof W.C. Venter for his support, assistance and guidance throughout this research. ? My mother Dr. A. Venter for proofreading some of my articles, supporting me throughout this research and bringing me a cup of coffee whenever I needed it most. ? Ornico House Ltd for their financial support as well as providing technical assistance. My thanks to anyone else that contributed to the project that I have not mentioned above. All glory unto Him. iii Abstract The opinion of customers about companies are very important as this can influence a company’s profit. Companies often get customer feedback via surveys or other official methods in order to improve their services. However, some customers feel threatened when their opinions are publicly asked and thus prefer to voice their opinion on the internet where they take comfort in anonymity. This form of customer feedback is difficult to monitor as the information can be found anywhere on the internet and new information is generated at an astonishing rate. Currently there are companies such as Brandseye and Brand.Com that provide online reputation management services. These services have various shortcomings such as cost and is incapable of accessing historical data. Companies are also not allowed to purchase these software and can only use the software on a subscription basis. The design proposed in this document will be able to scan any number of user defined websites and save all the information found on the websites in a series of index files, which can be queried for occurrences of user defined keywords at any time. Additionally, the software will also be able to scan Twitter and Facebook for any number of user defined keywords and save any occurrences of the keywords to a database. After scanning the internet, the results will be passed through a similarity filter, which will filter out insignificant results as well as any duplicates that might be present. Once passed through the filter the remaining results will be analysed by a sentiment analysis tool which will determine whether the sentence in which the keyword occurs is positive or negative. The analysed results will determine the overall reputation of the keyword that was used. The proposed design has several advantages over current systems: ? By using the modular design several tasks can execute at the same time without influencing each other. For example; information can be extracted from the internet while existing results are being analysed. ? By providing the keywords and websites that the system will use the user will have full control over the online reputation management process. ? By saving all the information contained in a website the user will be able to take historical information into account to determine how the keywords reputation changes over time. Saving the information will also allow the user to search for any keyword without rescanning the internet. The proposed system was tested and successfully used to determine the online reputation of many user defined keywords. Disstertation keywords: Online Reputation Monitor, Web crawler, Facebook, Twitter, dtSearch, Sentiment Analysis. iv Table of Contents List of Figures ____________________________________________________________ vii List of Tables ____________________________________________________________ viii List of Abbreviations _______________________________________________________ ix Chapter 1 Introduction ______________________________________________________ 1 1.1. Scenario ________________________________________________________________ 2 1.2. The problem ____________________________________________________________ 2 1.3. Project objectives ________________________________________________________ 3 1.4. Research methodology ____________________________________________________ 4 1.4.1. Study existing ORM software ___________________________________________________ 4 1.4.2. Study existing tools ____________________________________________________________ 4 1.4.3. Design the ORM system________________________________________________________ 5 1.4.4. Experiments _________________________________________________________________ 5 1.4.5. Conclusion and recommendations _______________________________________________ 5 1.5. Dissertation Outline ______________________________________________________ 5 Chapter 2 Background and Literature Study ____________________________________ 7 2.1. The need for ORM _______________________________________________________ 8 2.2. How does ORM work? ____________________________________________________ 9 2.3. ORM components _______________________________________________________ 10 2.3.1. Web crawlers _______________________________________________________________ 11 2.3.2. Social network crawler _______________________________________________________ 14 2.3.2.1. Twitter ___________________________________________________________________ 14 2.3.2.2. Facebook _________________________________________________________________ 15 2.3.3. String similarity algorithm ____________________________________________________ 16 2.3.4. Sentiment analysis ___________________________________________________________ 17 2.3.4.1. Lexical approach __________________________________________________________ 17 2.3.4.2. Machine learning approach _________________________________________________ 18 2.3.4.3. Optimal approach _________________________________________________________ 18 2.4. Existing solutions _______________________________________________________ 19 2.4.1. Brandseye __________________________________________________________________ 19 2.4.2. Brand.Com _________________________________________________________________ 20 Chapter 3 Design __________________________________________________________ 22 3.1. The Design _____________________________________________________________ 23 3.1.1. The Back-End _______________________________________________________________ 24 3.1.2. The Front-End ______________________________________________________________ 25 3.1.3. The Website _________________________________________________________________ 27 3.2. Component selection _____________________________________________________ 28 3.2.1. Web crawlers _______________________________________________________________ 28 3.2.1.1. dtSearch Engine ___________________________________________________________ 28 3.2.1.2. HTML Agility Pack ________________________________________________________ 30 v 3.2.1.3. Open-source web crawlers __________________________________________________ 32 3.2.1.3.1. Abot web crawler: ______________________________________________________ 33 3.2.1.3.2. Tenteikura ____________________________________________________________ 34 3.2.1.3.3. Weaver _______________________________________________________________ 34 3.2.1.4. Speed comparisons _________________________________________________________ 35 3.2.1.5. Essential features __________________________________________________________ 36 3.2.1.6. Additional feature comparison _______________________________________________ 37 3.2.1.7. Customizability and cost comparison _________________________________________ 38 3.2.1.8. Web crawler conclusion ____________________________________________________ 39 3.2.2. Social Network Crawlers ______________________________________________________ 40 3.2.2.1. Twitter API _______________________________________________________________ 40 3.2.2.1.1. TweetInvi and StreamInvi _______________________________________________ 41 3.2.2.1.2. Linq2Twitter __________________________________________________________ 42 3.2.2.1.3. Effectiveness comparison ________________________________________________ 44 3.2.2.1.4. Existing features _______________________________________________________ 44 3.2.2.1.5. Complexity ____________________________________________________________ 45 3.2.2.1.6. Customizability ________________________________________________________ 45 3.2.2.1.7. Twitter component conclusion ____________________________________________ 46 3.2.2.2. Facebook SDK ____________________________________________________________ 47 3.2.3. String similarity formula ______________________________________________________ 47 3.2.4. Sentiment analysis tool _______________________________________________________ 49 3.3. Revision of concept design ________________________________________________ 50 3.3.1. The Back-End _______________________________________________________________ 50 3.3.2. The Front-End ______________________________________________________________ 51 3.3.3. The Website _________________________________________________________________ 52 Chapter 4 Implementation __________________________________________________ 53 4.1. The Back-End __________________________________________________________ 54 4.1.1. Web crawler ________________________________________________________________ 54 4.1.2. Twitter API _________________________________________________________________ 61 4.1.3. Facebook API _______________________________________________________________ 64 4.1.4. Complete Back-End implementation ____________________________________________ 66 4.2. The Front-End __________________________________________________________ 67 4.2.1. dtSearch Engine _____________________________________________________________ 67 4.2.2. Similarity Filter _____________________________________________________________ 67 4.2.3. Sentiment analysis tool _______________________________________________________ 70 4.2.4. Complete Front-End implementation ___________________________________________ 71 4.3. The Website ____________________________________________________________ 72 4.4. Final method of operation ________________________________________________ 72 Chapter 5 Results _________________________________________________________ 74 5.1. Process ________________________________________________________________ 75 5.2. The Back-End __________________________________________________________ 75 5.2.1. Web crawler ________________________________________________________________ 75 5.2.2. Twitter Scanner _____________________________________________________________ 78 5.2.3. Facebook Scanner ___________________________________________________________ 79 5.3. The Front-End __________________________________________________________ 80 5.3.1. Web crawler result generator __________________________________________________ 80 5.3.2. Twitter scanner results _______________________________________________________ 84 vi 5.3.3. Facebook scanner results ______________________________________________________ 84 5.3.4. Online reputation calculation __________________________________________________ 85 5.4. The Website ____________________________________________________________ 86 Chapter 6 Conclusion and Recommendations ___________________________________ 90 Appendix A Index files ______________________________________________________ 93 Appendix B Conference Presentations _________________________________________ 96 Bibliography _____________________________________________________________ 105 vii List of Figures Figure 1: Overall process of an ORM system _____________________________________________________ 9 Figure 2: Detailed process of an ORM system ___________________________________________________ 10 Figure 3: Basic Crawler Architecture __________________________________________________________ 12 Figure 4: Crawl Depth Illustration ____________________________________________________________ 12 Figure 5: Proposes System Architecture ________________________________________________________ 23 Figure 6: Proposed Back-End Architecture _______________________________________________________ 24 Figure 7: Proposed Front-End Architecture _____________________________________________________ 26 Figure 8: Proposed Website Architecture _______________________________________________________ 27 Figure 9: dtSearch Engine search method _______________________________________________________ 29 Figure 10: Document Model __________________________________________________________________ 31 Figure 11: HTML Agility Pack search method ___________________________________________________ 32 Figure 12: Abot search method _______________________________________________________________ 33 Figure 13: Weaver search method _____________________________________________________________ 35 Figure 14: Twitter API Comparison____________________________________________________________ 41 Figure 15: StreamInvi scanning architecture ____________________________________________________ 42 Figure 16: Linq2Twitter search method _________________________________________________________ 43 Figure 17: Final Back-End Architecture ________________________________________________________ 51 Figure 18: Final Front-End Architecture _______________________________________________________ 52 Figure 19: Single instance web crawler and multiple instance web crawler operation ____________________ 55 Figure 20: Web crawler internet traffic _________________________________________________________ 58 Figure 21: More efficient internet traffic ________________________________________________________ 59 Figure 22: Verify amount of web crawlers _______________________________________________________ 60 Figure 23: Database save methods comparison __________________________________________________ 63 Figure 24: Complete Back-End implementation __________________________________________________ 66 Figure 25: Final Front-End Architecture _______________________________________________________ 71 Figure 26: Final ORM operational flow ________________________________________________________ 72 Figure 27: Web crawler execution time _________________________________________________________ 76 Figure 28: Web crawler URLs and Links scanned ________________________________________________ 76 Figure 29: Web crawler URLs and Links scanned per second _______________________________________ 77 Figure 30: Twitter results breakdown __________________________________________________________ 78 Figure 31: Facebook results breakdown ________________________________________________________ 79 Figure 32: Screenshot of web crawler results ____________________________________________________ 81 Figure 33: Results report of web crawler _______________________________________________________ 82 Figure 34: Regenerated web page of web crawler result ___________________________________________ 83 Figure 35: Web crawler results review _________________________________________________________ 84 Figure 36: Twitter results processor ___________________________________________________________ 84 Figure 37: Facebook results processor _________________________________________________________ 85 Figure 38: Profile overview __________________________________________________________________ 86 Figure 39: Website results – Internet ___________________________________________________________ 87 Figure 40: Website results – Twitter ___________________________________________________________ 87 Figure 41: Website results – Facebook _________________________________________________________ 88 Figure 42: Website - Overall sentiment _________________________________________________________ 89 viii List of Tables Table 1: Web crawler speed comparisons _______________________________________________________ 36 Table 2: Essential features comparison _________________________________________________________ 37 Table 3: Web crawler decision matrix __________________________________________________________ 39 Table 4: Twitter component comparisons _______________________________________________________ 44 Table 5: Twitter weighted averages ____________________________________________________________ 46 Table 6: Web crawler threading comparison, 1Mb/s ______________________________________________ 56 Table 7: Web crawler threading comparison, 16Mb/s _____________________________________________ 57 Table 8: Twitter record process; no filter vs filter _________________________________________________ 62 Table 9: New/Filtered Facebook results per minute _______________________________________________ 65 Table 10: Similarity filter tuning table __________________________________________________________ 68 Table 11: 20% to 30% similarity filter investigation _______________________________________________ 69 Table 12: AlchemyAPI multithreading __________________________________________________________ 70 Table 13: Amount of web crawler results ________________________________________________________ 80 ix List of Abbreviations API: Application programming interface ORM: Online reputation management SDK: Software development kit 1 Chapter 1 Introduction This chapter will serve as an introduction for this dissertation. The chapter will start by providing a quick scenario that will demonstrate how consumer opinions influence a company followed by a general overview regarding the need and use of ORM system and the problems this research will aim to address. To finish the chapter the overall objectives of this research will be stated followed by a quick outline of the methodology this research will follow and a general outline of the following chapters. 2 1.1. Scenario The Xbox One, the third generation of home entertainment and video game console, was unveiled by Microsoft™ on 21 May 2013. While technologically superior to its previous generations, the system caused controversy amongst critics and consumers due to strict digital rights management policies, such as requiring the user to connect the console to the internet every 24 hours and blocking the use of pre- owned games. Due to these restrictions, the perception of the Xbox One by the online community was largely negative. Many unhappy customers used the internet to voice their concerns on blogs and social networking sites such as Facebook and Twitter with many of them planning to purchase one of the Xbox One's competitors instead. Microsoft has listened to the feedback and changed many of the Xbox One's policies since its original announcement but a lot of consumers still have a negative perception surrounding the console which influences the console’s sales up to the present day. 1.2. The problem Word-of-mouth communication is considered to be a valuable marketing resource and is often underestimated. This includes all forms of information exchange among customers regarding characteristics, usage and experiences with particular products, brands or companies [1]. According to Reichheld [2], the tremendous cost of marketing and other promotions make it hard for a company to grow profitable. Reichheld believes that the only path to a profitable growth rate lies in the company's ability to get loyal customers to become the company's marketing department by sharing positive information or experiences involving the company. His research showed that there is a strong correlation between a company's growth rate and the number of customers who are likely to recommend using the company's services to somebody else. Most companies know this and employ techniques such as focus groups and surveys to generate various statistics, as detailed in [3]. However, these methods are not always effective; consumers often feel under pressure when their opinions are publicly asked and therefore adjust their answers to avoid any potential confrontation. Instead consumers often voice their opinions on the internet by making use of blogs and/or social networking sites where they take comfort in anonymity. Most companies are aware of this but are unable to generate statistics from these sources as they are too numerous and new information is generated too rapidly. Therefore companies require computerized techniques that will allow them to monitor their online reputation. 3 1.3. Project objectives Determining online reputation is not a new field and has been done for years by organizations such as BrandsEye [4] and Brand.Com [5]. However, the services these companies offer have several limitations: in order to make use of the ORM services a company has to pay an ORM company a monthly free which often costs thousands of rands. At time of writing the “small” package at BrandsEye costs $500 per month (R5 500 at an exchange rate of R11 = $1) and the “medium” package $800 per month (R8 800 at an exchange rate of R11 = $1) [6]. In order to lessen this cost many companies would rather opt to purchase ORM software to perform the reputation monitoring themselves, but none of the ORM companies have such an option. Another problem with existing ORM systems is many of them do not scan all available information sources, such as web pages and social networking posts, or scan information sources that are not relevant to a specific brand, for example looking for “CNA” on websites that only contain articles about fast-food restaurants. This would either cause a significant portion of customer opinions to be missed or the collection of too many results which would need to be filtered out. Lastly, existing ORM services cannot access historical data as information for a brand is only collected from the present day onwards. If the user wants to add a new brand they would have to wait while the ORM system starts scanning for the new brand before a general overview can be calculated. This also eliminates any quick-search functionality which would be a nice feature. Many of these limitations are implemented in order to make the ORM system as user friendly as possible while still providing the client with the necessary service. However, this makes it very hard for the user or company to customize the ORM service to their specific needs, which may lead to subpar results. As such, the goal of this research is to develop an ORM system that will present the user with as much control as possible while solving the problems mentioned above. To be successful, the ORM system must be capable of: ? scanning a list of user-specified web pages at a sufficient rate to ensure that all the results are kept up to date; ? scanning social networking sites such as Twitter and Facebook for public opinions; ? storing all the gathered information in a database or other storage system to allow the ORM system access to historical data; 4 ? analysing the gathered information for user-specified keywords in real time and report information regarding any keyword occurrences such as the location and sentiment of the mention, ? display the analysed mention to any interested party. To achieve these goals the implementation of the new ORM system will primarily make use of existing components - there is no use in redesigning the wheel. Available components will be evaluated by measuring their performance according to criteria that are considered important for this research. After the evaluation the components that best match the criteria will be selected. 1.4. Research methodology In order to complete this project the following methodology will be used 1.4.1. Study existing ORM software Existing ORM systems will be studied to obtain information about the services these systems provide, how users interact with these systems and the components that make up an ORM system. The ORM system that will be investigated are: ? BransEye. ? Brand.Com 1.4.2. Study existing tools The components used in these ORM systems will be studied in more detail. These components include: ? Web crawler ? Social network scanner ? String similarity filter ? Sentiment analysis tool 5 1.4.3. Design the ORM system A new system will be designed once enough information regarding the internal operation of an ORM system as well as the components that make up an ORM system has been retrieved. The design will prioritize customizability that will allow the user to control as much of the processes as possible 1.4.4. Experiments The new ORM system will be tested to determine the effectiveness and accuracy of the system. These tests include: ? using different components and implementations; ? altering the number of active web crawling instances; ? using different computer hardware and internet connections. The results from the experiments will be used to optimize the software as well as determine the strengths and limitations of the system. 1.4.5. Conclusion and recommendations Conclusions on the effectiveness of the system will be drawn after all tests were executed and future research areas that could improve the system will be highlighted. 1.5. Dissertation Outline 1. Introduction. This is the current chapter which gives some information regarding the factors that inspired the research, the problem which the research will aim to solve and the methodology that will be used to solve the problem. 2. Background and literature study. In this chapter the use of ORM systems will be clarified and existing ORM systems will be investigated. The components that make up an ORM system, 6 which include web and social network crawlers, similarity filters and sentiment analysis tools will also be discussed. 3. Concept Design. The new ORM system will be conceptualized and all available components will be investigated and compared whereby the best component will be identified and selected for the new ORM system. After all the components have been selected the concept for the new ORM system will be revised whereby any changes that are required by the components will be implemented into the design. 4. Implementation. The selected components will be implemented and optimized. Any other significant features of the software such as the use of multithreading and text filters will be discussed as well. 5. Tests and results. All tests that were carried out on the system as well as their respective results and their influence on the software will be discussed. 6. Conclusion and recommendations. In the last chapter a brief overview of the research will be given along with the objectives that were achieved and a summary of the results. Any potential future work and future research ideas will also be discussed for anyone that wishes to continue with the project. 7. Appendixes: Appendixes are additional chapters that will give extra information on specific concepts that are either mentioned or used during this dissertation. 7 Chapter 2 Background and Literature Study This chapter will provide the reader with detailed information regarding the importance of customer opinions on company sales and why companies require ORM services. Next the components that make up an ORM system will be further discussed by first providing in-depth information about each component followed by its method of operation and interaction with other ORM components. Lastly existing ORM systems will be evaluated whereby the advantages and disadvantages of several existing ORM solutions will be listed and discussed. 8 2.1. The need for ORM The goal of any business venture is to produce a profit. The most common way this can be achieved is either by selling goods or delivering a service. For many businesses a profit is so important that profit margins are the starting point for any budget planning. However, guessing the number of sales is one of the most difficult tasks any business has to make as it is difficult to predict or estimate the number of potential customers with reasonable accuracy. Therefore businesses try to influence the general public into becoming potential customers by means of advertising [7]. Advertising is a mass communication tool that communicates the same message to each person in public. According to Ayanwale [7] advertising is used to establish a basic awareness of a product or service in the potential customer by providing selected information about the product or service. Advertising can be used to great affect: Ayanwale’s research [7] shows that advertising has a major influence on customer preferences and a separate study by Stephen Hoch [8] claims that while consumers state they do not believe everything advertisements claim it does help them to make decisions. Ayanwale also noted that brand preferences do exist and that customers will stay with a tried- and-tested brand even if there are better alternatives on the market. From the research done by Ayanwale [7] and Hoch [8] it can be seen that brand preferences influences a company's sales. Therefore companies must investigate how the public perceives its brand. In order to do this a company must first determine the public's opinion regarding the company itself and its associated brands. Retrieving this information is not quite as simple as customers are scared of any potential backlash by providing their opinions publicly and will therefore use the internet to express their opinions where they take comfort in anonymity. This presents a problem as references regarding the company or brand can potentially be found anywhere on the internet, but scanning the entire internet is impossible as it is simply too large and grows too fast. However, much of the information on the internet is repeated as different websites write articles or posts about the same information. Therefore relevant information can be retrieved by scanning a specific number of websites relevant to the information that is required. Scanning web pages manually is a very exhausting task and a number of factors such as wrong interpretation of opinions, typing errors, sickness and fatigue can impact performance. Therefore the ideal solution would be to develop a software application in order to lessen the amount of human interaction. Such software is called Online Reputation Monitoring (ORM) software. 9 2.2. How does ORM work? Online reputation management (ORM) consists of monitoring various media such as web and social networking sites, detecting relevant content and analysing what people say about an entity [9]. In order to accomplish this an ORM service must scan the internet for specific keywords, download and analyse the results before displaying them. This is shown in Figure 1. Figure 1: Overall process of an ORM system The three processes demonstrated in Figure 1 should be capable of operating independently from each other; a user must be able to process any web crawler results even if the web crawler is currently busy acquiring new information from the internet. Likewise a user should be able to view results for a specific day even if another user is busy processing results for a different day. While Figure 1 can give the reader a general overview of the functionality of an ORM system, each component can be further expanded upon, as shown in Figure 2. 10 Figure 2: Detailed process of an ORM system In order to acquire information from the internet the user must make use of a web crawler. Web crawlers are programs that explore the World Wide Web, retrieve information according to specific criteria and store any results in a database or some other storage for future use [10]. Unfortunately, for reasons that will be discussed in Section 2.3.2, web crawlers cannot scan social networking sites and therefore a tool that can extract information from such sites will also be needed. Once the information from the web crawlers have been retrieved the results must be processed. Not all results will contain significant meaning, such as tags within a web page that will only highlight important aspects of that page. Therefore a similarity filter must be included to filter out results that are deemed too similar to the keyword that was used to detect the results. If a result passes through the similarity filter it must be passed to the final part of the result processor which will calculate the result’s sentiment. Once the processing is complete the result must be saved back to the database where it can be retrieved and shown at will. 2.3. ORM components As can be seen from Figure 2 an ORM system consists of several processes that work together to produce a result. Therefore, in order to design an ORM system the components that make up such a system must be investigated. As shown in Figure 2 the components are: 11 ? a web crawler; ? a social network scanner; ? a method of analysing the results from the web and social network crawler; ? sentiment analysis tools. The way web crawler results are generated depends on the web crawler that will be used for this ORM system. This will be discussed in the web crawler component section (Section 3.2.1). The rest of the components will be investigated in the following sections. 2.3.1. Web crawlers Web crawlers are programs that explore the World Wide Web. A key motivation for designing web crawlers are to retrieve web pages and store them or any relevant data for future use [10]. This process is known as web crawling and is most notably employed by search engines to locate new resources on the web. The type of data that can be extracted from web pages depends on the implementation of the web crawler. Some web crawlers are configured to extract only specified phrases [11] while others extract and index each word in a web page for future use [12]. Figure 3 shows the architecture of a basic web crawler [10]. Before a web crawler is started a user must specify a series of seed URLs which is stored in the frontier, a list of URLs that must be investigated. When the web crawler starts it will load the first URL in the frontier, download the associated web page, scan the page and store any relevant information in a database or local storage. Once the crawler has finished scanning the page it will load the next URL from the frontier and repeat the process until all the web pages in the frontier have been scanned. This is known as the crawling loop. 12 Figure 3: Basic Crawler Architecture Crawlers are often configured to scan a website up to a certain depth. This is the extent to which a web crawler will scan pages within a website. Many web sites contain multiple pages, which in turn contains additional subpages. This is illustrated in Figure 4. Figure 4: Crawl Depth Illustration 13 The crawl depth for each web site is specified when the user adds a seed URL to the frontier. IF the user wish to scan only the original page the crawl depth must be set to 0. If the user sets the crawl depth to 1 the web crawler will scan the seed URL and add the URLs for Page 1, Page 2 and Page 3 to the top of the frontier. Once the web crawler has finished scanning the seed URL it will proceed to download and scan Page 1, followed by Page 2 and finally Page 3. If the user sets the crawl depth to 2, the crawler will add the URLs for Page 4, Page 5 and Page 6 to the top of the frontier when scanning Page 1. Once Page 1 has been scanned, the web crawler will scan Page 4, Page 5 and Page 6 before proceeding to Page 2. This process will be repeated for the links within Page 2 and Page 3. It should be noted that as the crawl depth increased the number of web pages in the frontier will increase as well. Therefore the time it will take for the web crawler to scan all the pages in the frontier increases dramatically with the crawl depth. Web crawler execution ends when all the links in the frontier have been scanned. At this stage a timer is often initialized which will redeploy the crawler at a specific time or after a specified interval. When not optimized, web crawlers are designed to extract information from websites as fast as possible. While this would be perfect for the applications that use the results from the web crawlers this will add a massive load on the web servers that house the websites that are being scanned. For a single web crawler this would not be a problem but when multiple web crawlers are active simultaneously the web server can run out of resources available memory and processing power. This will lead to websites becoming unresponsive or stop working altogether, even if they are being scanned or not. Therefore web servers need to manage their resources. Preventing the website from being scanned by a web crawler will be counterproductive as the information contained by the website will not be scanned by search engines, which will cause the website to become obscure to the online community, but allowing web crawlers to scan the web site at full capacity will impact stability. To solve this problem, websites usually implement a “polite policy”: a file named “robots.txt” is added to the website which aids web crawlers by providing specific information such as: ? a list of web files that are archived or contain insufficient information; ? a list of web directories that are not to be scanned; ? different kinds of permissions to specific web crawlers; ? a crawl delay between web pages. This will ensure the web crawler will only scan the correct information such as the website’s articles and content while excluding files such as style sheets which are only needed to maintain the visuals of the webpage. This “polite policy” will also speed up web crawler execution; by reading the “robots.txt” file web crawlers will scan only the files that contain the information while skipping unnecessary ones. 14 This will lessen the time web crawlers spend analysing a website allowing web crawlers to move onto the next website in its frontier at a faster rate. 2.3.2. Social network crawler In order to provide a more accurate analysis of a company or brand social networking sites must be scanned as well. Unlike regular web sites web crawlers cannot scan social networking sites. Social networking sites only present data once the user has registered and even if logged in the site will only show data regarding the people the current user is connected to. Therefore the problem is two-fold: normal web crawlers do not possess the necessary authentication capabilities to access the content of social networking sites and those that do will only have access to limited information. To allow applications access to some of their data various social networking sites have designed an interface that allows registered applications to access the site’s public data. This interface is known as an Application Programming Interface (API) and is often the only way of interfacing with a social network’s database systems. All social networking sites have different API’s and therefore different methods of extracting information. Due to time constraints it is not be possible to design a universal method of interfacing with all social networking sites. The research in this thesis will focus on interfacing with the two most popular social networking sites, namely Facebook and Twitter. 2.3.2.1. Twitter Twitter is an online social networking service with microblogging capabilities, created by Jack Dorsey in March 2006. Since then the site has grown rapidly and had 500 million registered users in 2012 who posted 340 million messages per day [13]. Twitter allows people to send tweets, which is a short message with a maximum length of 140 characters by using a computer or almost any mobile device. All users can access the site and read tweets but only registered users can create and send new tweets. Additionally registered users can opt to “follow” other registered users which will automatically provide the user with any messages the users they are following make. All tweets are public by default but a sender may choose to send tweets that are only visible by his or her followers. Though a tweet may be about any topic, a study in August 2009 by Pear Analytics has shown that more than 70% of tweets are either considered as “pointless nonsense” or conversational, which would be of use for an ORM service [14]. 15 Twitter has created the Twitter Application Programming Interface (API) which allows external applications to access its databases [15]. The Twitter API allows developers to access a certain number of tweets for various purposes, including but not limited to statistic generation, targeted marketing and online reputation management such as this project. Some of the information the API will provide include: ? tweet sender; ? tweet content; ? tweet date; ? unique tweet URL. ? tweet language. In order to use the Twitter API, the user must first create a Twitter account and register an application. Once the application is registered the user will receive: ? A consumer key; ? A secret consumer key; ? An access token; ? A secret access token. These four keys are used by Twitter to uniquely identify the application and determine which resources the application can access. If the user wishes to request data from the Twitter API the user must provide the API with the four keys. Once the application has been authenticated the API will provide the user with an authentication token, which the user will use with any subsequent API calls. After verification the API can be used for a variety of tasks as detailed within the official Twitter documentation [16] (please note that this link uses a secure connection. In order to view the page and its content the user will have to register a Twitter account and enable application development). 2.3.2.2. Facebook Facebook is an online social networking service that was founded on 4 Feb 2004 by Mark Zuckerberg at Harvard University. Since then the site has grown rapidly; at the end of January 2014 1.23 billion users were active on the website every month while more than 945 million users were connecting via mobile devices. In 2013 Facebook’s revenue was $7.85bn, a 55% increase for the previous year [17]. 16 When first registering the user is asked to create a personal profile, where he or she can specify information such as name, surname, residential address and current interests. Once the profile has been created the user can connect to other people which the site refers to as friends who may view each other’s profiles and give personal comments. Like Twitter, Facebook gives the user the option to post messages which may be public or private depending on the user’s preference. Messages usually include text, images and occasionally hyperlinks that redirects to other websites. Unlike Twitter, Facebook messages are private by default. In order to allow developers to access Facebook data with their applications, Facebook has developed the Facebook Software Development Kit (SDK). While the SDK supports several features such as profile management, it will also allow applications to access several public posts which will be of use for ORM systems. To use the Facebook API the user must first create a Facebook account by registering on the website. Once registered, the user must enable the developer settings on the account and register a new application in order to receive 2 Facebook keys: ? Application ID; ? Secret application ID. These 2 keys are used by Facebook to uniquely identify the user’s application and determine whether the user has access to the application. If the user wishes to request data from the Facebook API, the user must first authenticate the Facebook application by using the 2 Facebook keys. Once authenticated the user can use the API to perform a large variety of tasks as detailed within the official Facebook Developer documentation. 2.3.3. String similarity algorithm Once the web or social networking crawlers have finished their crawling processes, all the results for a specific keyword will be returned. However, a certain portion of the web crawler results will have insignificant meaning, such as tags within web pages. Below is an example of such a message: Keyword: MTN Retrieved result: MTN Business 17 The sentence above only refers to MTN as a business and has no semantic meaning, making it useless for an ORM system. To filter out these results an algorithm must be developed that will compare the result to the keyword that was used to detect the results and determine whether the result and the keyword differ enough. This will be done by using a mathematical comparison which will determine the number of operations that are needed to transform one string into another. 2.3.4. Sentiment analysis If the result passes through the similarity filter it must be presented to the user. However, the user often wishes to know whether the content of the message is positive or negative, otherwise known as the sentiment. Sentiment analysis on a computer is not an easy task as computers can’t guess the meaning of a word by using its context as humans can. In fact the main problem with sentiment analysis is to determine how the sentiments are expressed within the sentences. According to Nasukawa and Yi [18], sentiment analysis involves identification of: ? Sentiment expressions. ? Polarity and the strength of the expressions. ? The relationship of the expressions to the subject. Though there are several methods to calculate the sentiment of some text, the two main methods either use a lexical or machine learning approach [19]. 2.3.4.1. Lexical approach A system that is based on the lexical approach uses a dictionary or lexicon of pre-tagged words. Each word that is present in the text is compared against the dictionary to find its polarity. This indicates whether the word has a positive or negative meaning as well as the strength of the sentiment. After the sentiment of each word has been determined, the sentiment of the given text is calculated by summing the sentiment of each word. If the total sentiment is positive the text has a positive meaning, whereas if the total sentiment is negative the text will have a negative meaning. According to Annett [19], the accuracy of such a system varies between 64% and 82%, depending on orientation of statistic metrics and the dictionary that was used. 18 2.3.4.2. Machine learning approach A system that uses the machine learning approach uses two components; a series of feature vectors and a collection of tagged corpora. A tagged corpus is a collection of documents that the system uses to initially train itself [20]. Feature vectors are usually a variety of uni-grams, single words from within a document, or n-grams, two or more words from a document that are in sequential order. Other features that are often proposed include the number of positive words, number of negative words, and the length of a document. Both the feature vectors and collection of tagged corpora are used to train a classifier, which can be applied to an untagged document to determine its sentiment. According to Annett [19], the accuracy of such a system varies between 63% and 82%, but the results are dependant of the features that were selected. 2.3.4.3. Optimal approach Both the lexical and machine learning approaches have several advantages and disadvantages. Where the lexical system is significantly easier to develop and does not have to be trained a powerful linguistics resource from which decisions can be made is required and such a resource can be hard to find. It is also difficult to take sentence context into account when making decisions, for example sarcasm [21]. A lexical system might categorize a sentence such as “The prices at MTN are ridiculously low” to be negative due to the negative modifier “ridiculously” which would be wrong as the sentiment actually has a positive sentiment. With the machine learning approach the dictionary is not required. Current systems that use this technique display a high level of accuracy, but high accuracies can only be achieved by using a representative collection of labelled training texts and a careful selection of features. In addition, classifiers that work in one domain, such as blogs, might give sub-par results in other domains [21]. Research done by Blinov [21] indicates that though machine learning techniques are capable of providing results with a high accuracy, systems based on the lexical approach can achieve matching accuracies using only small dictionaries. He concluded that though the advantages of the machine 19 learning approach are numerous, the lexical approach should not be discarded and instead be used in conjunction with the machine learning approach. 2.4. Existing solutions Online reputation management systems are nothing new and have been in use for several years. In addition to doing online reputation calculations, ORM systems often contain additional functionality which are used to offer extra services to attract potential customers. In order to determine what kind of additional functionality are included the following existing ORM services were studied: ? Brandseye ? Brand.Com Other ORM systems such as Radian 6, SaidWot and Alterian SM2 were initially included in the investigation but due to a lack of feedback they could not be investigated. 2.4.1. Brandseye Brandseye originated in 2004 when consumers wanted to know what people were saying about their brands. At first the Brandseye team manually searched the internet for relevant information using search engines but this process was too cumbersome and using various free services did not yield any better results. As such the team began developing their own system capable of monitoring brands online and received their first paying client in 2006 [4]. Brandseye features a web based application with support for the latest versions of Mozilla Firefox and Google Chrome. Microsoft Internet Explorer is not supported, and produced faulty results during testing. In order to use the system the user has to enter several ‘brands’ and ‘phrases’. Brands are pre-determined categories under which phrases are stored whereas phrases are the keywords the system will search for on the web. The system supports various types of searching such direct searches, inclusive searches and exclusive searches. With a direct search the system searches for a series of keywords and will present only results where all the keywords feature. With an inclusive search two keywords are specified and the system will only present results where both of the keywords feature though they do not have to follow each 20 other as in the search string. Exclusive searches are the opposite of an inclusive search. The system will present only results where a primary keyword features and the secondary keywords do not. Once a phrase has been detected by the system the number of occurrences in both the phrase and the brand is updated, with the brand showing the total number of hits all the phrases assigned to it has gotten. The number of hits each phrase received can also be individually shown by the system. Once the system has received any number of results various reports such as ‘amount of posts linked to Twitter and Facebook’, ‘country of author origin’, ‘post language’ can be generated. The system also allows for custom reports to be generated by providing the user a wizard where they can specify the report criteria. The advantages of Brandseye include rapid website scanning and a user friendly interface that is very easy to use. In addition the web site presents the user with as much control as possible without exposing too much technical functionality. However, the system does not make use of historical data and will only report results for a keyword from the date it was added. Additionally the system does not automatically calculate the sentiment which must be manually added by the user and is quite expensive to use, costing a monthly fee of $500 per month for the small package and (R 5 500 at an exchange rate of R11 = $1) $ 800 per month (R 8 800 at an exchange rate of R11 = $1) for the medium package [6]. 2.4.2. Brand.Com Like Brandseye, Brand.Com features a web based application with support for the latest versions of Mozilla Firefox, Google Chrome and Microsoft Internet Explorer. After the user has been successfully registered and entered some keywords for the system to detect, the user is taken to his Dashboard, from where he or she can control the system. The Dashboard provides the user with a general overview of his account and the keywords the system must detect. On the Dashboard the user can see the number of positive and negative results and the number of times the keyword was searched per month. Selecting either of these options will generate a graph showing the user the percentage of results that was generated per month for the last 12 months. These results are generated by using a search engine. At the time of investigation Brand.Com has been successfully incorporated with Google, Yahoo and Bing. Brand.Com also offers a live feed, where information from Twitter, Facebook and additional websites that contain the user-provided keywords are automatically listed. 21 An interesting feature is the ability to add system-provided keywords that are related to the user- provided ones. However, this functionality could not be explored further due to the limitations of the free account. Based on the abilities of the free account, Brand.Com has some obvious advantages. Brand.Com scans the internet at a significant pace, as new results are usually added at a rate of 1 per second. The related keywords feature is also a very helpful tool as it will help users to refine their search results. However, during the time of testing no results were marked as positive or negative though is not clear whether the user has to mark results him or herself or whether this functionality is disabled due to the free account. In addition the user has to select the data server location and only a few options are provided, all of them limited to the USA which will result in the user primarily using US websites in order to detect keyword mentions. It is not clear how this influences results. At the time of writing using Brand.Com costs upwards of $3 500 (R38 500 at an exchange rate of R11 = $1). 22 Chapter 3 Design This chapter will detail the methodology that was followed to design the new ORM system. The chapter will start by providing the user the goals that the ORM system must accomplish and any restrictions that must be taken into account. Next a new ORM architecture will be proposed along with motivation on why certain design choices were made. For each section of the new ORM architecture various existing tools will be researched and evaluated in order to determine whether any pre-programmed tools can be incorporated into the design of the ORM system or whether a new tool must be designed. The chapter will end with a re-evaluation of the proposed ORM architecture to determine whether any of the existing tools will require a change in the architecture. 23 3.1. The Design As stated in Section 1.3, the main goal of an ORM system is to: ? scan web pages and different social networking sites at a sufficient rate to ensure all results are kept up to date; ? analyse the results to give the user information such as the location, page, paragraph and sentiment of the results; ? use the analysed results to generate reports. Before a concept design can be done, the requirements and boundaries of the research must be established. At the start of the project the following requirements were established: ? The ORM software must be compatible with Microsoft Windows platforms with Windows 7 as the oldest acceptable operating system. Other operating systems are optional. ? The software must be written within the .Net Framework with .Net Framework 3.5 as the lowest acceptable version. ? The system must focus on English results. Due to these requirements it was decided to use Microsoft Visual Studio 2010 as development environment and Microsoft SQL Server as database management system. Figure 5: Proposes System Architecture 24 From a design viewpoint it will be a good idea to divide the ORM system into three sections, as shown in Figure 5. The sections will consist of: ? the Back-End; ? the Front-End; ? the Website. 3.1.1. The Back-End To start the ORM process information must be retrieved via the internet. This will be done by web crawlers and social network scanners which will be maintained by the Back-End. The information will be passed through several filters which will remove any duplicate and non-English results. The results that pass through the filter will be saved to a database or any designated storage device. Once the Back- End has finished crawling the web and social networking sites it will proceed to wait for a predetermined amount of time before restarting the process. Figure 6: Proposed Back-End Architecture 25 Figure 6 shows the proposed Back-End architecture. When the crawling process starts the Back-End will read the list of websites and keywords from the database and divide the information among the web and social network crawlers. The websites and keywords must be inserted via the Front-End before the web crawling processes starts. After dividing the information the web crawlers will proceed to scan the web sites while the social network crawlers will scan the Twitter and Facebook public streams for any mentions of the keywords. Once the web crawlers have finished scanning the sites in their frontiers they will save the information to the database and enter a waiting period, repeating the web crawl after a pre- determined amount of time has passed. The social network scanners will not shut down and will continue scanning until manually stopped. It can be seen that the Back-End does no calculations other than those that are necessary to maintain the web and social network crawlers as well as their filters. A primary characteristic of the Back-End will be its ability to function with as little human interaction as possible. The only interaction that is required will be to monitor that the Back-End has not crashed or to turn off the system in times of maintenance. The number of active web crawlers in the Back-End can be changed by a system administrator. This will allow the system administrator to optimally allocate the available internet bandwidth across all the applications that require an internet connection while simultaneously reducing the time it will take the Back-End to scan all the allocated web sites. This will also prevent the internet connection from being overloaded; if the system administrator activates too many web crawlers some of the web crawlers may not receive enough information to continue operating which will result in a time out. This will be further discussed in Section 4.1.1. 3.1.2. The Front-End When the Back-End has finished acquiring data from the internet the system administrator can select a date range for results to be generated. The stored records for the specified date range will be loaded, processed by passing it through a similarity filter and finally evaluated by a sentiment analysis tool. The similarity filter will remove results that are too close to the keywords, such as tags within a web page. This will lessen the number of results that will have to be processed in the following steps. The sentiment analysis tool will use the results that have passed through the similarity filter and determine whether the results are positive or negative. The sentiments will be used to determine the overall opinion of the company or brand that is being investigated. Once the sentiment of the results have been calculated it will be saved back in the database. 26 Figure 7: Proposed Front-End Architecture Figure 7 shows the proposed architecture of the Front-End system. Unlike the Back-End, the Front- End has no web crawling capabilities and will be used only for information extraction and refining. The Front-End will also be designed to allow multiple instances of the software to be run in parallel, which will allow multiple system administrators to access the system at the same which will increase the number of results that can be processed at a simultaneously. Unlike the Back-End, it will not be possible to fully automate the result generating processes of the Front-End. Computers cannot (yet) detect the meaning of words based on context, which may lead to incorrect results. This is shown in the example below: Keyword: Kalahari Returned results: ? “I bought this lovely watch from Kalahari for only R299.99” ? “Boy, the Kalahari is hot this time of year.” ? “Kalahari has a major special on games, you should go check it out.” ? “Rain in the Kalahari, who would have known?” 27 While a human would be fully capable of determining which results refer to the online market place and the Kalahari Desert, a computer cannot do so. Instead the computer will use all the results that contain the keyword “Kalahari” to determine the online reputation for the keyword, regardless of their semantic meaning which would greatly influence the results, especially if the people are more favourable towards the online marketplace than the desert. As such a system administrator must filter through the results provided by the web crawler and must manually determine which results are correct. Once the system administrator has marked a result as relevant the system will calculate the result’s sentiment and category by using the sentiment analysis tool before adding the result to a list that contains all the results that the user has marked as relevant. Once the user has finished marking the relevant results he or she may proceed to review the results. Once processed, the results will be saved back to the database from where it will be used to generate various statistics. 3.1.3. The Website The website will be the final part of the ORM system. A website will be developed that will use the analysed results stored to visualize and display the data to any interested party. This process is shown in Figure 8. Figure 8: Proposed Website Architecture As shown in Figure 8 the website will not contain any processing functionality and will only be used to display the results on the web. 28 3.2. Component selection As stated in Section 1.3 the ORM system will make use of existing components. Each of these components will be discussed in the following sections: 3.2.1. Web crawlers Ideally the ORM system would make use of a powerful web crawler such as GoogleBot or BingBot, respectively developed by Google and Microsoft but unfortunately these crawlers are not available for public use. An investigation discovered many free and commercially available web crawlers available on the internet such as: ? A commercial web crawler from dtSearch named the dtSearch engine [12]. ? A web crawler development kit named the HTML Agility Pack [22]. ? Several open source web crawlers from GitHub that is written within the .Net Framework. 3.2.1.1. dtSearch Engine Background and features The dtSearch Engine is developed and maintained by the dtSearch Corporation. The company began developing text retrieval systems in 1988 and started marketing its software in Virginia in 1991. Since then the dtSearch engine has expanded from a desktop application to include web applications [23]. The dtSearch engine has numerous features that include: ? support for Microsoft Windows and Linux platforms; ? support for C++, java, and the .Net Framework; ? natural language searches which allows the users to search requests in ‘plain English’; ? support for entire phrases, Boolean operators, wildcard searches and words that are within proximity of each other; ? support for phonetic searches and words with variations on their end such as applies and applied; ? saving of web pages where results are located. 29 Method of operation In order to use the dtSearch Engine within a programming environment the user must first specify which websites to crawl as well as their respective crawl depths. Additional information that the dtSearch Engine can use but are not compulsory include: ? file filters that allow the web crawler to automatically skip specific files, ? the maximum time to spend on a website and, ? the maximum file size to download. When the web crawler is started it will acquire the first website in its frontier and proceed to scan it. The dtSearch Engine will not look for any user specified keywords on the web; instead it will proceed to generate a list containing every word on the page as well as the locations of the word and store the results in an index file, a collection of documents that contains every word the dtSearch Engine has detected as well as the positions where the words were found. This technique allows the dtSearch engine to search large volumes of text very quickly [24]. Unfortunately, these index files cannot be saved to the database and must be saved to a local storage drive. Should the crawl depth of the web crawler be 0 the web crawler will proceed to scan the next page in the frontier. If not, the web crawler will extract all the hyperlinks in the current page and all them to the top of the frontier, repeating this process until it has reached its specified depth as shown in Section 2.3.1. This process is demonstrated in Figure 9. Figure 9: dtSearch Engine search method 30 The index files are created by an algorithm unique to the dtSearch Engine. As such only the dtSearch Engine is capable of reading data from the index files. In order to use the data the dtSearch Engine contains several result generating capabilities; once the user provides the system with a keyword the dtSearch Engine is capable of quickly scanning through all the index files and extract the paragraphs that contain the specified keyword. By using the index files the result generator is also capable of regenerating the web site on which the results were found. For more detail about the index files, see Appendix A Advantages and disadvantages Using the dtSearch engine will provide the system with a web crawler that is actively being maintained and currently used worldwide by several companies including ContractIQ [25], Densan Consoltants [26] and American Technology Services Inc [27]. As such no custom web crawler functionality will have to be written which will save a lot of development time. Unfortunately, the dtSearch Engine is commercially used and is therefore not free. A trial version can used to test its functionality but should the ORM software ever be used commercially the dtSearch Engine will have to be purchased, which would cost $2 500 (R 27 500 at an exchange rate of R11 = $1). The user will also not have any control over the internal functionality of the web crawler and will instead have to use the provided settings to customize the dtSearch Engine for the software application. Lastly, due to its use of index files the dtSearch Engine cannot save its results directly to a custom database. 3.2.1.2. HTML Agility Pack Background and features The HTML Agility Pack (HAP) is a library that allows the user to parse web pages that have not undergone any HTML restructuring, for example web pages that has missing sections or tags. Using this functionality the HAP allows the user to build document models that can be used to fix the HTML as well as provide an easy way to extract information from web pages, which may be used to build web crawlers. [22]. HAP features: ? support for the Microsoft Windows platforms; 31 ? support for the .NET Framework; ? gives the user complete control over his or her web crawling application. Method of operation Like the dtSearch Engine, the user must specify a list of URLs the HAP must scan but instead of using index files the HAP makes use of document models, which is a model that displays document information in a tree-like system. An example of a document model is shown in Figure 10. Figure 10: Document Model By using the model the user can scan for every sentence in the web document and check whether a specific keyword is present and if so extract various information such as the redirected URL, the title of the page and any hyperlinks within the document. What happens from here depends on the user: the user can either proceed to store the text for later use or leave the web page to continue scanning for more results, it depends on what the function of the web crawler must be. Like the dtSearch Engine, the HAP will continue adding links to the frontier until the crawl depth has been reached and stop the crawling process once all the links in the frontier have been scanned. This entire process is detailed in Figure 11. 32 Figure 11: HTML Agility Pack search method Advantages and disadvantages The HAP has two main advantages: it’s free and the user will have full control over the web crawling application. Unfortunately the HAP only contains functionality to support the development of web crawlers. The HAP does not contain a web crawler by itself and therefore the user would have to develop and test a web crawler from scratch, something which could take up a lot of time. Additionally, the HAP is open-source and developed by a community of developers. While this may lead to rapid development and addition of several new features it also means the community may stop supporting the HAP at any time which would increase development difficulties, especially if there are any errors within the HAP itself. As of writing the HAP was last updated on 10 Jul 2012, which suggests the community may have moved on to other projects. 3.2.1.3. Open-source web crawlers Along with the dtSearch Engine and the HTML Agility Pack several open source web crawlers from an open source repository named GitHub were investigated. Though an initial search for “web crawler” yielded hundreds of matches the results were restricted to those that satisfied the project requirements, namely web crawlers that operate on Microsoft Windows platforms which use the .Net Framework. Filtering out the irrelevant results gave 44 results, of which the three with the best ratings were selected. At the time of investigation these results were: 33 ? Sjdirect/abot (https://github.com/sjdirect/abot) ? Bolthar/tenteikura (https://github.com/bolthar/tenteikura) ? Dfdemar/Weaver (https://github.com/dfdemar/Weaver) 3.2.1.3.1. Abot web crawler: The Abot system is an open source web crawler kit developed in C#, which has been designed with speed and flexibility in mind. It uses the HTML Agility Pack (Section 3.2.1.2) as its basis and improves on it by automating many of the HTML Agility Pack’s processes such as building document models, multithreading, http requests, scheduling and link parsing. However, the user is still required to design his or her own events for actually processing the page data and generating results. Method of operation As the Abot system uses the HAP as its basis, interfacing with a web crawler developed by using the Abot system differs only marginally from a HAP application. Firstly the user has to specify the URLs in the frontier and configure the web crawler application. The web crawler will proceed to scan the first URL in the frontier and present the data to the user, which can be analysed and stored for later use before the system will move onto the next page. Like the dtSearch Engine and the HAP, the Abot system will continue adding links to the frontier until the crawl depth has been reached and stop the crawling process once all the links in the frontier have been scanned. This entire process is detailed in Figure 12. Figure 12: Abot search method 34 Advantages and disadvantages: Many of the advantages and disadvantages from the HAP carry over to the Abot system. Like its basis the Abot system is free and the user will have full control over the web crawler application. The Abot system compares favourably to the HTML Agility Pack as the user will not have to develop the low- level processes such as multithreading and link parsing which will lower the development time required to create a full web crawler. The Abot system is also actively maintained by the community, with the latest update released on 1 Feb 2014. However, like the HAP the Abot system only contains functionality to aid in the development of web crawlers. While many low-level processes have been pre-developed, many others like result generation and web page recalling must be developed manually. This will significantly increase development time. 3.2.1.3.2. Tenteikura This web crawler is merely an implementation of the HAP. As the HAP and an implementation of the HAP have already been examined in Sections 3.2.1.2 and 3.2.1.3.1 the Tenteikura web crawler will be discarded from the investigation. 3.2.1.3.3. Weaver As explained in Section 2.3.1, web crawlers implement a polite policy in order to prevent the web servers they are scanning from malfunctioning. This is where the Weaver web crawler differs from regular web crawlers; this crawler is designed to extract as much information from its target website by ignoring any polite policies. Method of operation The Weaver crawler is a custom web crawler implementation that does not rely on any external components. Once the user specified a list of seed URLs the web crawler will start downloading the first URL in the frontier before immediately creating a second instance that will run in parallel with the original one. This instance will download the next URL before repeating the processes for the following URLs. Once a thread has finished scanning a URL it will list all the URLs within the web page and 35 scan them by creating a thread for each URL, provided the crawl depth and thread limit have not been reached. The web crawler will stop once all threads have been scanned. This is shown in Figure 13. Figure 13: Weaver search method Advantages and disadvantages The lack of any external components makes the Weaver web crawler easier to deploy than the other web crawlers as no external libraries have to be maintained. Its focus on speed is also a major advantage. The Weaver web crawler’s largest disadvantage lies in its extensive use of parallel instances. More system resources are used as the number of instances increase which can lead to a system malfunction if too many instances are created. Another disadvantage is the absence of any result generating capabilities. As such all result generating capabilities would have to be developed which can take significant amounts of time. 3.2.1.4. Speed comparisons In order to compare the speed of the web crawlers, each web crawler was timed to see how long it took the web crawler to scan through the same 100 websites with a crawl depth of 0. Each test was carried out 3 times in order to determine an average. Due to the different methods each web crawler uses to generate results, the tests will only determine the time it took to crawl the websites. Table 1 shows the web crawler speed comparisons. 36 Table 1: Web crawler speed comparisons Web Crawler Min (s) Avg (s) Max (s) dtSearch Engine 422.09 463.10 503.11 HTML Agility Pack 486.87 545.22 647.24 Abot 708.10 730.49 777.92 Weaver - - - All tests were carried out on the following system: ? Intel Core i7-2600QM 2.4GHz; ? 8GB Memory; ? 1Mb/s internet connection speed. During testing it was noticed the Weaver web crawler was highly unstable and could not complete a single crawl test. Due to this instability the viability of the Weaver web crawler’s architecture was called into question and was dropped from the investigation. 3.2.1.5. Essential features When picking an existing component care must be taken to select a component that provides as much of the required functionality as possible. If a web crawler does not have some of the required functionality it must be added and tested which can take a significant amount of time. This can also increase the risk of the system becoming unstable, especially if the added functionality does not correctly interface with the used component. For the proposed ORM service, the essential features are: ? An active web crawler: does the component provide a web crawler that is ready to use or must the user develop some additional functionality in order to use the system? ? Threading: does the web crawler have internal logic that support threading or must threading capabilities be added? ? Storage: does the web crawler have a method of saving the results or must the method be developed? ? Result generator: does the crawler have a method of generating results from the web crawl results? 37 The dtSearch Engine features a fully functioning web crawler that is capable of threading and can be immediately used upon installation. As the results that are gathered by the dtSearch Engine are stored in index files by using a unique algorithm the dtSearch Engine features a result generator capable of interpreting the index files. Conversely, as the HAP is a web crawler development kit it contains algorithms that support web crawling but does not contain a functioning web crawler that can be used upon installation, nor does it contain any threading capabilities, storage algorithm or result generator. Lastly, the Abot system improves upon the HAP by adding several features such as threading capabilities but also does not feature a fully functioning web crawler, storage algorithm or result generator. By using the information in the previous paragraph a feature matrix can be completed that will compare the functionality of each web crawler. The feature matrix is shown in Table 2. Table 2: Essential features comparison Functional web crawler Threading capable Storage algorithm available Result generator available? Final Count dtSearch Engine Yes Yes Yes Yes 4/4 HAP No No No No 0/4 Abot System No Yes No No 1/4 From Table 2 it can be seen that the dtSearch Engine contains all the essential systems. 3.2.1.6. Additional feature comparison Additional features are features that are not essential to the ORM service but might be used later on. The dtSearch Engine contains the following additional features: ? The user is able to set the crawl depth for each individual web site, e.g. setting the crawl depth for Website A to 1 and the crawl depth for Website B to be 2. ? A score will automatically be calculated by the dtSearch Engine when results are generated. The score will indicate the likelihood of the usefulness of the result. ? When a result is generated it will automatically include the URL, the page and the paragraph where the result was located. 38 ? The user will be able to regenerate the web page where results were located. The dtSearch Engine accomplishes this by using the indexed data. ? Can index documents such as Word or PDF pages. ? Contains logging capabilities allowing the user to see exactly which web sites were detected and scanned. ? Supports a polite crawling policy. The HTML Agility Pack provides the user with the following additional features: ? Allows the user to build a document model which will provide various information such as titles, headers, footers and content and be extracted. The Abot system has the following additional features: ? Contains logging capabilities allowing the user to see exactly which web sites were detected and scanned. This is mostly used whenever a web crawler developed by using the Abot system malfunctions. ? Can limit the number of web pages to crawl and the number of times a HTML page may be redirected. This is to prevent pages from looping indefinitely and will lessen the number web pages to scan. ? Supports a polite crawling policy. From the above description it can be seen the dtSearch Engine has 7 additional features that may be of use, the HTML Agility Pack 1 and the Abot web crawler 3. 3.2.1.7. Customizability and cost comparison The dtSearch Engine is closed-source. Its source code can therefore not be modified and has to be bought. The user must use the settings that are provided by the dtSearch Engine in order to configure it. Unlike the dtSearch Engine the HTML Agility Pack and Abot system are open-source and therefore can be customized to a great extent. There are no cost involved with acquiring and using the open- source web crawlers. 39 3.2.1.8. Web crawler conclusion To summarize, the dtSearch Engine is a very fast but expensive commercial web crawler that is actively being maintained and developed. The HTML Agility Pack allows the user to develop a web crawler that will fit all requirements of an ORM service but it is in danger of being dropped by the community. The Abot system is a refined version of the HTML Agility Pack that is actively being maintained but still requires the user to develop his or her own web crawler. A weighted average will be used to determine which web crawler will be best suited for this project, which is as follows: ? Speed: 30%. As stated in Section 1.3, one of the main goals of the ORM service is that it must be fast enough to keep the information as updated as possible. As such the web crawler must be capable of both scanning the web and generating results at a rapid pace; ? Essential Features: 30%. Due to the scope of this project the implementation of the web crawler must not take too long. Implementation of the web crawler includes: designing, development, testing and integration into the ORM service; ? Additional Features: 15%. Additional features may increase customer satisfaction; ? Customizability: 15%. How easily the web crawler can be adapted to fit the ORM specification. ? Cost: 10%. The entire software package must be cost affordable for any potential client. These factors are not the only ones that can be used to compare the web crawlers but are deemed the most important for this project. The results are shown in Table 3. Table 3: Web crawler decision matrix Category dtSearch Engine HTML Agility Pack Abot Speed (30%) 30 25 19 Essential Features (30%) 30 0 8 Existing Features (15%) 15 2 7 Customizability (15%) 7 15 12 Cost (10%) 0 10 10 Total (100%) 82 52 56 In Table 3 the weighted averages were applied to each web crawler. It can be seen that despite its cost the dtSearch Engine will be the best web crawler to use with the ORM service. While the HTML Agility 40 Pack and the Abot web crawler may be almost as fast and free, it will take a lot of time to develop an entire web crawler with all the required features. 3.2.2. Social Network Crawlers As explained in Section 2.3.2, API’s must be used to extract information from social networking sites. The following sections will detail how the ORM system will interface with the social networking sites; 3.2.2.1. Twitter API The Twitter API is the name given to a collection of separate components developed by Twitter that each has a separate function. Two of the components within the API allow the user to extract information from the Twitter databases, namely the REST API [15] and the Streaming API [16]. Both components will provide the user with data from the Twitter database by different means: the REST API will allow the user to extract information on a query basis whereas the Streaming API will open a continuous feed of data. Querying the Twitter database with the REST API will grant the user access to historical data but is slow and will cause significant internet overhead. Additionally the REST API limits the user to 180 queries per 10 minute time frame. In contrast, the Streaming API gives the user low latency access to Twitter’s global stream of data; this will present the user with results as they become available but the user will not have access to historical data. This is shown in Figure 14 41 Figure 14: Twitter API Comparison Due to its restrictions and limitations the REST API will not be suitable and therefore the ORM system will make use of the Streaming API. This will leave the user unable to access any historical data, but upon further investigation it was noticed that the user can only extract the last 100 records from the Twitter database by using the REST API. Therefore it would be better to collecting results by using the Streaming API and save all results to a local database where they can be accesses at any given time. To aid with Twitter integration programmers have developed several free components that can access the REST API and Streaming API. For this project two such components were identified, namely TweetInvi and Linq2Twitter. Both of these components were investigated to see if they could be of use and if none of these components are found to be viable a custom Twitter integration method will have to be developed. 3.2.2.1.1. TweetInvi and StreamInvi TweetInvi [28] is a .Net software development kit (SDK) developed in C# by the TweetInvi Team. The SDK consists of two components: TweetInvi (a component of the TweetInvi package) and StreamInvi. TweetInvi provides the user with an easy way to interface with the Twitter REST API whereas StreamInvi provides the user with a way to interface with the Twitter Streaming API. As it was decided that the ORM system will use the Twitter Streaming API the system will only make use of the StreamInvi component. By using StreamInvi the user can access two types of streams; a sample stream which contains unfiltered Twitter data, or a filtered stream which only presents the user with data that matches a set of conditions. Method of operation To use StreamInvi the user must first pass a validation check by providing the four Twitter keys, as discussed Section 2.3.2.1. After the application has been verified the user must specify whether to use the sample stream or filtered stream, provide the keywords which will serve as filters and start the process. The stream will continue scanning Twitter’s public data and report a result once a keyword has been detected. This result will be saved to the computer’s internal memory where it will be processed and saved to a database from where it can be accessed later on. The stream will continue scanning Twitter until the user stops the process or the computer is shut down. This process is demonstrated in Figure 15. 42 Figure 15: StreamInvi scanning architecture Advantages and disadvantages The StreamInvi component is very easy to implement within an application. The component also allows the user to easily add more keywords to its filter and gives the user complete control over the results, allowing the user to either save the results or implement additional logic before saving. Additionally, StreamInvi also includes several debugging capabilities, which allows the user to identify and handle any potential errors that may occur during software execution. However, information provided by the StreamInvi component will only identify the user that has made the tweet, the time of the tweet and the actual message. Additional filters will have to be developed to detect the language of the tweet in order to filter out incompatible results. TweetInvi and StreamInvi are both open-source components; while the components are actively being maintained the community may drop the project in the future if they lose interest, which could lead to maintenance problems. 3.2.2.1.2. Linq2Twitter LINQ is a .NET Framework component developed by Microsoft that adds native data querying capabilities. It is commonly used to extract data from arrays, XML or enumerable classes by using a query structure similar to SQL. 43 By using LINQ’s capabilities a community group has developed an open-source library named Linq2Twitter [29]. The library uses LINQ syntax to handle queries from the Twitter API and present the user with the results. By using the Linq2Twitter library the user can develop Twitter applications capable of sending and receiving posts from their timeline as well as receive a stream of continuous Twitter posts. Method of operation In order to start using the Linq2Twitter library the user must initialize the library and verify the application by using the four Twitter keys. After the application has been verified the user must provide the keywords that will specify which sentences must be returned by the Twitter filtered stream. Once these parameters have been set the stream can be started. The stream will scan Twitter and provide the user with any results that contains one or more of the specified keywords. All results will be saved in the computer’s internal memory where it will be processed and saved to a database where it can be accessed from later on. The stream will continue scanning Twitter posts until the user stops the process or the computer is shut down. Figure 16: Linq2Twitter search method Advantages and disadvantages: Like StreamInvi, Linq2Twitter contains a lot of functionality such as Twitter authentication that would otherwise have to be programmed manually. Once a result has been provided the user will be able to 44 implement any custom functionality that will process the result. Additionally, Linq2Twitter has detected more results than StreamInvi (see Table 4) and the results contains additional information such as the language of the message, the number of replies to the message and the number of the message has been shared. However, Linq2Twitter requires the use of LINQ syntax which is complex to use. Linq2Twitter is also an open-source component, and while it is currently being maintained the community may drop the project in the future. 3.2.2.1.3. Effectiveness comparison Evaluating Twitter components are more difficult than evaluating web crawlers as the number of generated tweets vary by the second. To compensate both applications were run in parallel in order to determine which component would gather the most tweets within the allocated time. Both tests were executed twice. Table 4: Twitter component comparisons Test StreamInvi Linq2Twitter Difference Test 1 1747 1815 3.89% Test 2 1939 2036 5.00% All tests were carried out on the following system: ? Intel Core i7-2600QM 2.4GHz; ? 8GB Memory; ? 1Mb/s internet connection speed. 3.2.2.1.4. Existing features StreamInvi has the following unique features: 45 ? StreamInvi allows the user to easily extract the owner of a Twitter message as well as the date on which it was posted, the actual message and the keyword that was detected. Linq2Twitter has the following unique features: ? Linq2Twitter provides the user a lot of information regarding Twitter messages, including but not limited to the Twitter ID, location and alias of the user that created the message, as well as how many times the message was spread by others. ? Linq2Twitter identifies the language of the message. ? It is indicated within the results whether the message is an original or a reply to another message. From the above description it can be seen that Linq2Twitter has 3 additional features that may be of use while StreamInvi has only has 1. 3.2.2.1.5. Complexity Implementing the StreamInvi component within a programming environment is very simple as there are no special techniques required in order to interface with the component. Conversely, using Linq2Twitter requires the use of LINQ syntax which is harder to implement and maintain. 3.2.2.1.6. Customizability While both components are open source, it can be seen that Linq2Twitter can be more easily adapted to any situation than StreamInvi due to the large amount of information that results from Linq2Twitter contain. Should more information be required from the StreamInvi components the user would have to make the changes within the component’s source code and recompile the entire component, whereas the user can simply extract the correct information from the Linq2twitter results without modifying the component itself. 46 3.2.2.1.7. Twitter component conclusion To summarize, StreamInvi is very easy to implement and use but provides only limited information. Linq2Twitter provides more information and detects more results than StreamInvi, but is more difficult to implement. To determine which component will be best suited for this project, a weighted average will be used. The weighted average is as follows: ? Number of results: 30%. As stated in Section 1.3, one of the main goals of the ORM service is that it must be fast enough to keep the information as up to date as possible. As such the Twitter component must be capable of retrieving results as they are being generated without losing any. ? Existing Features: 30%. Additional features may increase customer satisfaction; ? Complexity: 20%. How easy it is to implement and modify the software ? Customizability: 20%. How easily the component can be adapted to fit the ORM specification. These factors are not the only ones that can be used to compare the components but are deemed the most important for this research. The results are shown in Table 5. Table 5: Twitter weighted averages Category StreamInvi Linq2Twitter Amount of results (30%) 27 30 Existing Features (30%) 10 30 Complexity (20%) 20 10 Customizability (20%) 10 20 Total (100%) 67 90 In Table 5 the weighted averages were applied to each Twitter scanner. It can be seen that while StreamInvi is quite fast and easy to implement, the limited information provided by the results make the component less viable than Linq2Twitter. If StreamInvi were to be used, additional filters would have to be implemented in order to filter out non-English results. As such, Linq2Twitter was selected as the Twitter scanning component for this research. 47 3.2.2.2. Facebook SDK The Facebook SDK for .Net is the only component available to interface with the Facebook system from within the .Net environment. Another set of tools known as Parse is also available but must be purchased before it can be used. Additionally a test application has demonstrated that the Facebook SDK for .Net supports all the features required for the ORM service and will therefore be used. 3.2.3. String similarity formula Once the web or social networking crawlers initialized, all results of a specific keyword will be returned. However, a certain portion of the web crawler results will have insignificant meaning such as tags within web pages. Manually filtering out such results will cost the user valuable time. Therefore similarity filters are often included alongside web crawlers, which are filters that indicate the number of operations that are needed to transform one string into another. There are varying kinds of string similarity filters; some filters such as the Levenstein-Munkres algorithm [30] [31] analyse the strings and determine the number of changes that are needed to literally translate a string into another, while others such as proposed by Mihalcea [32] and Li [33] extract and compare sentiments from the given strings. As the user would simply wish to filter out sentences that are deemed too similar to its keyword a technique that will look at the physical sentence will be needed. As such the string similarity techniques proposed Mihalcea [32] and Li [33] will be not be ideal for this solution and a technique based on transforming one string into another will be used. Two such algorithms were identified, namely the Damerau- Levenshtein algorihm [34] and the Levenshtein-Munkres algorithm [30] [31]. The Damerau-Levenshtein algorithm was first proposed in 1964 by Fred. J. Damerau in order to cope with errors in a coordinate indexing and retrieval system [35]. Damerau distinguished between four different edit operations, namely insertion, deletion, substation of a single character and transposition. According to his research these four edit operations correspond to more than 80% of all human misspellings. However, his research only took misspelling that requires at most 1 edit operation into account. The corresponding edit distance was taken into account by V.I. Levenstein, which developed the Levenshtein algorithm that can be used to measure the similarity between two strings by calculating the least number of edit operations that are necessary to modify one string into becoming another [30] [36]. When combined, the Damerau-Levenshtein algorithm can be used to calculate an edit distance that allows multiple edit operations. This algorithm is mostly used to improve spell checkers, although 48 the Damerau-Levenshtein algorithm has also been used in biology to measure the variations between DNA stands. Like the Damerau-Levenshtein algorithm, the Levenshtein-Munkres algorithm also uses the Levenshtein algorithm, but instead combines it with an algorithm developed by H. Kuhn and J. Munkres. The Kuhn-Munkres algorithm is an algorithm capable of solving linear assignment problem (LAP) instances. The algorithm was first published by H. Kuhn in 1955 and later improved upon by J. Munkres in 1957 [31]. The Levenshtein-Munkres algorithm calculates its similarity score by following three steps: ? Breaking up the result and keyword into a list of their individual terms. These terms are called tokens. ? Computing the similarity between the tokens within the same list. This is done by using the Levenshtein algorithm. The algorithm is used to compare the similarity between the tokens by providing a set of rules that calculates the cost of changing a token to another by using a series of one step operations. Each one-step operation has an associated cost; substitution costs 2 units whereas cost of insertions and deletions is one unit. ? Computing the similarity between the two token lists. This is done by the Kuhn-Munkres algorithm. When executed the algorithm will give the user a percentage that will indicate the similarity between the sentences. While both algorithms measures the distance between different sentences the Levenshtein algorithm will be most ideal for this project as it calculates the similarity between the sentences in terms of a percentage. The Damerau- Levenshtein algorithm only provides the user with the number of edit operations that are needed to change one sentence into another, which will make it difficult to determine whether the sentences differ enough from its keyword. This is shown in the following result: Sentence: It’s Audi Keyword: Audi Damerau-Levenshtein distance: 4 Levensthein-Munkres similarity: 50% 49 As can be seen from the example above the Levenshtein-Munkres algorithm is easier to interpret than the Damerau-Levenshtein algorithm. Therefore the Levenshtein-Munkres will be used within this project. 3.2.4. Sentiment analysis tool To finish the ORM process the sentiment of a record must be calculated. After the record has been processed by the similarity filter its sentiment must be calculated by a sentiment analysis tool. There are various sentiment analysis tools available such as the AlchemyAPI [37] and the Salience Engine provided by Lexalytics. AlchemyAPI was founded in 2005 with the aim to provide advanced language analysis tools for its customers. In order to aid developers AlchemyAPI provides the user with an SDK that can be used within most programming environments. In order to the use the SDK the user first needs to register for an account on the AlchemyAPI system, which will provide the user with 2 keys. In order to determine the sentiment of a sentence the user must use the 2 keys to verify the application and provide the AlchemyAPI with the sentence. AlchemyAPI will proceed to analyze the sentence and present the user with the result. AlchemyAPI is a commercial service. This project used the free version from AlchemyAPI which limits the user to 1000 requests per day. There are other options that allow the user to generate results for requests but these are not free. After the free version the cheapest version allows the user 5 000 requests per day for $250 per month. For more pricing options the user must contact the AlchemyAPI support. The Salience Engine from Lexalytics provides the user with a multi-lingual text analysis engine that can be integrated into systems for business intelligence and social media monitoring. The Salience Engine supports various programming frameworks including the .Net Framework and includes many features, such as sentiment analysis, named entity extraction and summarization. From features alone the Salience Engine would be better suited for this project, but in order to acquire the Salience Engine the user would have to contact Lexalytics. Additionally, there are no pricing opinion available on the website whereas the AlchemyAPI SDK can easily be acquired after registration. Based on availability AlchemyAPI would be best suited for this project, but a thought must be given regarding its accuracy; if the SDK cannot accurately calculate the sentiments of the given texts it will be of no use for the project. 50 In order to determine the accuracy of the system it was used to calculate the sentiments of 36 random texts which were retrieved from Twitter. When the sentiment was manually determined by reading the message and determining whether its content is positive or negative, it was found that the AlchemyAPI has an accuracy of 77.78%. While this may not seem high enough, a study by Biz360 [38], a social media monitoring measurement and engagement platform, has shown that humans only agree to sentiments 79% of the time and even if a system would have an accuracy of 100% humans would disagree with the calculated sentiment 21% of the time. As the AlchemyAPI achieves an accuracy that matches human interpretation and can easily be acquired, it can be used for this research. 3.3. Revision of concept design Some of the components chosen in Section 3.2 cannot be implemented as envisioned in the original concept design due to operating differently than expected: ? The dtSearch Engine cannot save data directly to a database. Instead the dtSearch Engine creates index files on the local storage device which can be used later on. To accommodate this change the Back-End needs to be changed so it can save data to both a local storage device and to the database and the Front-End must be changed in order to read and filter the data. ? The Alchemy API does not always identify the sentiment correctly, most notably where the message is written in a sarcastic tone. Therefore the user needs to have an option where he or she may review the sentiment or assign one manually. In order to accommodate these changes the original design must be altered. 3.3.1. The Back-End The only change made to the Back-End was to allow the dtSearch Engine to access a storage device where the engine may store the crawling indexes. The final Back-End design is shown in Figure 17. 51 Figure 17: Final Back-End Architecture From Figure 17 it can be seen that the web crawlers, Facebook SDK and the Twitter API all acquire their information from the internet but whereas the Facebook SDK and the Twitter API save their data to the database, the web crawlers save their information to a separate storage device. The Front-End will be able to locate both the database and the storage device to extract the necessary data. 3.3.2. The Front-End The Front-End design required two changes; it required functionality in order to read the web crawler indexes from the storage device, and an option for the user to either add an existing sentiment to a processed result or to edit an existing one. The final Front-End architecture is show in Figure 18. 52 Figure 18: Final Front-End Architecture From Figure 18 it can be seen the Front-End will acquire data from the storage device as well as the database. Included in the design is the ability to verify any sentiments that are generated before they are saved to the database. 3.3.3. The Website No design changes were necessary for the website. 53 Chapter 4 Implementation This chapter will detail the process how each of the tools that were chosen in Chapter 3 were implemented and configured within the ORM system as well as a motivation for each of the configurations. At the end of a chapter the overall flow of data will be illustrated. 54 4.1. The Back-End The following components will be used in the Back-End ? The dtSearch Engine as web crawler; ? The Linq2Twitter library as Twitter API interface; ? The Facebook SDK for .NET as Facebook API interface; 4.1.1. Web crawler As explained in Section 3.2.1.1, using the dtSearch Engine within a programming environment requires the system administrator to specify which websites must be scanned, the crawl depth of each website as well as the location of the storage device to which the dtSearch Engine can save its index files. The websites, crawl depth and storage location is specified by a system administrator using the Front-End, which will save the web crawl information to a database where it can be accessed by the Back-End. This will tell the dtSearch Engine where it should find its data and save the results, not how the dtSearch Engine should be implemented within a software application. This allows the user to implement the dtSearch Engine according to his own specifications. For this research, a core requirement is to scan web pages as fast as possible in order to ensure the ORM system’s information is always up to date, as stated in Section 1.3. During the concept design it was noticed that while the dtSearch Engine may quickly scan a few dozen web pages, scanning the thousands of web pages required for an ORM service will take significantly longer. This happens as the dtSearch Engine only scans a single page at a time while the other pages are kept in a queue. Such an implementation is called a Single Instance Web Crawler. In order to allow more than one web site to be crawled at a time multiple instances of the dtSearch Engine must be initialized and the web pages divided amongst them. This will allow multiple web pages to be scanned simultaneously and each web crawler instance will have a shorter queue. However, using multiple web crawler instances will require more system resources than a single instance. This is called a Multiple Instance Web crawler. The differences between a Single Instance Web Crawler and a Multiple Instance Web Crawler are shown in Figure 19. 55 Figure 19: Single instance web crawler and multiple instance web crawler operation In Figure 19, four web pages were assigned to both the Single Instance Web Crawler and the Multiple Instance Web Crawler. As seen the Single Instance Web Crawler can only process one page, namely Page 1, while Page 2, Page 3 and Page 4 are kept in the queue to be scanned. On the other side the Multiple Instance Web Crawler simultaneously processes Page 1, Page 2 and Page 3 while only Page 4 waits in the queue. This means that a Multiple Instance Web Crawler which uses 3 instances is 200% faster than the Single Instance Web Crawler but uses more resources; for 3 instances running in parallel both the CPU and RAM will have higher loads and the internet connection will have more traffic. This raises two questions: ? How to initialize more than 1 web crawler, and ? What is the optimal number of web crawlers to use? There are several techniques capable of initializing more than 1 web crawler. The user may opt to execute several web crawler applications in parallel but this technique is tedious to implement and will cause unnecessary overheard as the operating system will have to keep track of each application and not just the web crawler implementation. A better technique would be to make use of threading, which is a programming technique that creates sequences of instructions that are executed in parallel, which are called threads. This technique allows software to execute multiple instructions on the same processor, or split instructions across various processing units (CPU’s). However, like all techniques threading has several disadvantages; initializing too many threads may cause the system to use too much of its resources and cause the software and 56 possibly the entire system to become unstable. Therefore the amount of threads must be managed and threads must be programmed to release all their resources after the completion of their task. Using threading will allow the user to initialize multiple instances of the dtSearch Engine while minimizing unnecessary overhead. This brings us to the second question: what is the optimal number of threads? Initializing too many threads may increase the web crawling speed but will cause the system to become unstable while not initializing enough threads will unnecessary lower the number of web pages than can be scanned per second. In order to determine the optimal number of web crawlers the impact of threading upon the system must be examined. To do this several tests were executed where the system resources and number of websites to be crawled were kept the same but the number of web crawlers were varied. In each test the execution time as well as the number of system resources used was measured. The results are shown in Table 6. Table 6: Web crawler threading comparison, 1Mb/s Threads 1 2 4 8 16 32 Time (s) 2381 1259,83 722,37 573,41 430,03 382,01 CPU (%) 1,5 2,5 4,1 5,0 5,1 6,5 Memory (%) 1,4 2,5 4,1 7,08 14,5 25,1 Network Average (KB/s) 23,67 57,02 98,16 127,18 133,28 125,03 Thread time minimum (s) 2381 1116,83 626,56 469,03 325,67 NA Thread time maximum (s) 2381 1402,83 900,73 822,12 601,09 NA Comment Stable Stable Stable Stable Stable Unstable All tests were carried out on the following system: ? Intel Core i7-2600QM 2.4GHz; ? 8GB Memory; ? 1Mb/s internet connection speed. From Table 6 it can be seen that as the number of threads increased the amount of system resources used by the program increased as well but the web crawler execution time decreased. From 8 threads onwards it can be seen that while the memory usage increased as predicted, the CPU usage differs very little. This occurs because the internet bandwidth has been reached. As such, 8, 16 and 32 threads all extracted the same amount of data from the internet and therefore the CPU did not have to do any additional processing aside from maintaining the web crawlers. 57 In this test the ORM software became unstable at 32 threads. When investigated it was noticed that the limited internet bandwidth prevented all the crawlers from receiving data fast enough, which resulted in several of the crawlers timing out while their frontiers still contained URLs. To verify that the timeouts did indeed occur because of the insufficient internet bandwidth another test was carried out using a 16Mb/s internet connection. Table 7: Web crawler threading comparison, 16Mb/s Threads 1 4 16 32 64 Time (s) 2121,41 538,25 124,29 64,56 48,81 CPU (%) 2,5 5,1 6,1 7,5 Unknown Memory (%) 32,4 35,1 45,5 56,1 Est. 61% Network Average (KB/s) 34,51 151,65 628,95 1008,84 Unknown Comment Stable Stable Stable Stable Stable All tests were carried out on the following system: ? Intel Core i7-2600QM 2.4GHz; ? 8GB Memory; ? 16Mb/s internet connection speed. From Table 7 it can be seen that the internet bandwidth does indeed have an influence on the stability of the system. With a 16Mb/s internet connection the system was stable for 64 threads whereas the system became unstable at 16 threads when a 1Mb/s connection was used. The CPU network usage for the 64 thread test could not be calculated as the software did not execute long enough in order to get an average reading. Further tests showed that the system could support up to 128 crawlers before becoming unstable. The best way to determine the optimal number of threads that can be used for this system is to derive a formula that can be used to calculate the number of threads. From Table 6 and Table 7 it can be seen that the main factor limiting web crawler performance is the internet bandwidth. As such, the formula will use the internet bandwidth to determine the optimal number of web crawlers. Using the values from Table 6 and Table 7, it can be seen that each web crawler requires an estimated 35kB/s in order to function correctly. Therefore, an internet connection of 1Mb/s (125KB/s) can theoretically supposed 4 web crawlers. However, Table 6 showed a 1Mb/s can support up to 16 simultaneous web crawlers. Upon investigation it was noticed that while the internet traffic from a web 58 crawler may peak at 125KB/s, there are times when the web traffic is lower than indicated by the average. This is demonstrated for 2 web crawlers in Figure 20. Figure 20: Web crawler internet traffic Figure 20 shows a possible scenario for network traffic that may occur when two web crawlers are active. Graph properties: ? X-Axis: Time units ? Y-Axis: Used bandwidth ? Black blocks: Immediate download speed ? Golden line: Download average Theoretically, two web crawlers will use up to 70KB/s. As shown in Figure 20 this does happen from time to time, as seen at time unit 3 and time unit 9, but there will also be times when the web crawlers will process the information that has been downloaded and will not download new information. At such a time the moving average will give a false reading that will indicate that data is still being downloaded even though the network line will be inactive; this will happen as the moving average uses historical data in its calculations. In order to properly optimize the line bandwidth the slots where the internet traffic is low must be utilized to provide more web crawlers with data as shown in Figure 21. 59 Figure 21: More efficient internet traffic Graph properties: ? X-Axis: Time units ? Y-Axis: Used bandwidth ? Black blocks: Immediate download speed ? Green blocks: Additional web crawlers In Figure 21 the number of web crawlers were doubled. This resulted in more efficient use of the internet bandwidth; when specific crawlers indicated by the black squares are busy processing the information they have received the web crawlers indicated by the green squares will proceed to download their data. Figure 20 and Figure 21 suggests that the number of web crawlers calculated by dividing the line speed by 35KB/s can be squared in order to make maximum use of the internet connection. This would confirm why 16 web crawlers were able to operate on a line that otherwise would have supported only 4. Therefore, using the information from Table 6, Table 7, Figure 20 and Figure 21, the general method for determining the number of web crawlers can be summarized as: ? Determining the line speed in KB/s. ? Assuming each web crawler requires 35KB/s in order to operate, calculate the number of web crawlers (X) by dividing the line speed by 35KB/s and rounding the answer to the closest number. ? As a 70KB/s line could support 4 web crawlers instead of 2 and a 128KB/s line 16 instead of 4, the number of web crawlers must be squared. 60 Using this information, Equation 1 can be used to determine the optimal number of web crawlers per implementation: Number of Crawlers = ( ???0.125 35 ) 2 Where LS = Internet connection speed in kb/s Equation 1: Number of web crawlers for internet connection speed Using Equation 1 shows that a 1Mb/s internet connection will be able to support 16 web crawlers, as proven by Table 6 and that a 16MB/s internet connection can support 118 web crawlers, which was proven by a separate test running 128 threads. It should be noted that Equation 1 is merely used as a close estimation; the actual amount of data that a web crawler uses depends on the websites that has been assigned to crawl, for example some websites can only be downloaded at 30KB/s while others may download at 40KB/s. This is demonstrated with the test using the 16Mb/s internet connection; according to Equation 1 the web crawler will run 118 threads but the test proved 128 threads were possible. Therefore, when selecting the active number of web crawlers the system administrator is encouraged to use this formula as a guideline, not as a definitive answer. The number of active threads can be set at any time the web crawlers are not active by using an interface within the Back-End. Figure 22: Verify amount of web crawlers 61 The method suggested in Figure 22 is a possible way of verify whether the internet connection can number of active web crawlers. Using Equation 1 the system administrator can set the initial amount of web crawlers and initiate the web crawling process. To verify whether the internet connection can support the number of active web crawlers it would not be necessary to initiate a full crawl of all the websites as the entire verification process will take too long. Instead the amount of websites should be chosen so that each web crawler has 5 websites in its frontier. A different number of websites per crawler can be used but during development it was found that 5 websites per web crawler was ideal. Once the web crawling process has finished the system administrator must verify whether all the websites has been scanned. A quick way of accomplishing this is by determining whether the last website that has been scanned is the last website allocated to the web crawler. If not all of the websites were scanned the internet connection does not have sufficient bandwidth for all the web crawlers and as such the number of web crawlers must be decreased. If the web crawl was successful the system administrator can either opt to save the results or increase the amount of active web crawlers and restart the process. 4.1.2. Twitter API Unlike the web crawlers the Linq2Twitter component does not access historical data but connects to live stream of Twitter data and waits for data that matches specific criteria. The streaming process can be started once the required Twitter keys had been provided and the application has been verified. Once started, the application will wait until it receives a record containing one of the required keywords. When a record is received it will be presented to the user for processing while the library will continue waiting for more results. Like before, there are various methods to handle the records: ? Should the record be processed or directly saved to the database? ? Should each record be saved to the database as it is retrieved or should records be saved in a group? ? Will a multithreaded application have any benefits? Whether the record can be processed or not will depend on whether the processing will have any influence on the number of records that are retrieved from Twitter. To determine this two tests were carried out; in the first test the Twitter records were directly saved to the database with no processing at all, while in the second a record viability filter was implemented. The record viability filter performs the following checks: 62 ? Is the language of the Twitter record English? ? If yes, has the record been detected before? ? If no, does the message content contain the required keyword? ? If yes, store to memory. The test was carried out 4 times; twice without the filter and twice with the filter. Each test was carried out for 10 minutes. The results are shown in Table 8. Table 8: Twitter record process; no filter vs filter Keyword No Filter Test 1 No Filter Test 2 Filter Test 1 Filter Test 2 Correct keywords detected 1261 1438 611 572 Keyword not detected 0 0 12 24 Non-English language 0 0 585 615 Duplicate entries 0 0 107 142 Total Amount of Tweets 1261 1438 1315 1353 All tests were carried out on the following system: ? Intel Core i7-2600QM 2.4GHz; ? 8GB Memory; ? 1Mb/s internet connection speed. From Table 8 it can be seen that including a record viability filter will not influence the number of results and therefore it will be included in the Back-End. Once a result has been processed it must be saved to a database to be used by the Front-End. This brings us to the second question, namely should each record be saved to the database as it is retrieved or should results be saved in a batch? In order to decide which technique would be better the processes involved with each saving technique must be analysed. This is shown in Figure 23. 63 Figure 23: Database save methods comparison For both methods it can be seen that a connection to a database must be created in order to access the database. Should the user wish to save a record as it is presented a database connection will have to be created every time, which will lead to a significant amount of overheard required just to initialize the database connection. This will slow the program, especially if the database is located on a remote server. It will be possible to create the connection once and use it whenever a record has to be updated, but this will still lead to a significant number of database statements that will pass the necessary information to the database. It might also cause the system to become unstable as the database connection might close if it remains unused. This might happen as the user does not know when the Twitter API will get a result. It will therefore be better to store a certain number of records in the system’s memory and save the records to the database once enough results have been accumulated. This will result in one connection to the database and one statement to pass all the information, thereby lessening the amount of overhead. During development it was decided to save the data to the database once 250 records have been accumulated. While this number may be altered at will it was discovered that using 250 records will not slow the system and it will ensure that information will be rapidly saved to the database. It will also lessen the number of records that might be lost in case of a system failure such as a power outage of loss of internet connection. A final thought must be given to whether multithreading might be of use. Because results are only presented as they are found, adding additional threads will not yield any additional rewards as the records might not be present at the available time. As more keywords are added to the system there might come a time when the library will start skipping records. At such a time multithreading techniques may be added but at this time the Twitter API will be restricted to a single thread. 64 4.1.3. Facebook API As stated in Section 2.3.2.2, using the Facebook API will require the user to provide 2 authentication keys, after which the user will be able to perform queries using the API. The Facebook API provides the user with two methods of retrieving public posted data; the Graph API, which is the primary method application to access the Facebook servers, and a Public Feed API which presents the user with a stream of user and page status updates. The Graph API lets the user retrieve information from the Facebook servers via queries, similar to the Twitter REST API (Section 3.2.2.1). After the user has authenticated the application, the user may use the Graph API to make various requests including but not limited to searching public posts, interacting with the current user’s Facebook profile and handling various errors. On the other hand the Public Feed API provides the user with a stream of public status updates as they are posted on Facebook. As before it will be tempting to make use of the Public Feed API as only a single connection will be used over which continuous updates are passed. Unfortunately, the Public Feed API is not capable of providing the user with continuous posts. Instead the Public Feed API will only present the user with limited information regarding user and page status updates. To retrieve further information the user will have to use the Graph API instead. This is fully explained in the official Facebook API documentation [39] (please note that this link uses a secure connection. In order to view the page and its content the user will have to register a Facebook account). Due to this limitation the Public Feed API will not be of use to the ORM service and therefore the software will make use of the Graph API. To use the Graph API within an application the user must first provide the 2 authentication keys to authenticate the application. If authentication is successful the user will be presented with an Access Token, which must be used with every subsequent request. In order to retrieve information regarding the latest public posts the user must use the Graph API to perform a search request. This is accomplished by accessing the Search functionality from the Graph API by using the following URL: https://graph.facebook.com/search?q=QUERY&type=OBJECT_TYPE In order to specify the keywords that must be detected the user must replace “QUERY” with the specific keywords in a comma separated string. The “OBJECT_TYPE” specifies the information for each result that must be returned. A sample query is shown below: https://graph.facebook.com/search?q=Microsoft,MTN&type=id,from,caption,message 65 This query will present the user with the unique Facebook ID, information regarding the user that created the message, the title of the message and the message itself from the latest posts in which Microsoft or MTN feature as keywords. Each query will return the latest 25 results. This is a restriction built into the Graph-API over which the user has no control. This may lead to several questions: ? Will 25 records for each query be enough? ? If not, is there any way to ensure a maximum amount of results will be retrieved while staying within Facebook’s restrictions? To answer the first question the tempo at which new Facebook public posts are generated must be determined. This was accomplished by developing a test application that requested data for 5 keywords from Facebook each minute. Once the data has been retrieved it was passed through a filter which removed all the records that has previously been found. This test was carried out in 5 minute segments for two separate groups of keywords. Table 9: New/Filtered Facebook results per minute Keywords Query 1 Query 2 Query 3 Query 4 Query 5 Batch 1 88/37 20/105 21/104 18/107 31/94 Batch 2 89/36 32/93 42/83 44/81 32/93 From Table 9 it can be seen that for the initial query the number of new results is much higher than the number of filtered results as there are no historical data for the new data to be compared against. Any duplicates that occur are posts that have been repeated by multiple users on Facebook itself. It can also be noted that from the second query onwards the number of new posts are significantly lower than the number of filtered results. Though the number of new posts depend on which keywords are used, the number of new posts for both Batch 1 and Batch 2 dropped more than 50% from the initial query. This shows that new data are being generated very slowly and that 25 results per query will not cause the user to lose any data. 66 4.1.4. Complete Back-End implementation Figure 24: Complete Back-End implementation Figure 24 shows the complete implementation of the Back-End as discussed in Sections 4.1.1-4.1.3. Each system is designed to function as a separate unit; this will ensure that even if one system malfunctions it will not influence the others. All results will be saved to a designated store drive as well as a database where the Front-End and Website can access the data. To ensure the system will run at optimal capacity, the following system requirements are recommended: ? Quad-Core 2.4GHz CPU; ? 8GB Memory; ? 1TB HDD; ? 10Mb/s or faster Internet Connection. Using the system described above will ensure that 256-512 web crawlers can be used simultaneously with the Twitter and Facebook Scanners. 67 4.2. The Front-End The following components will be used in the Front-End ? The dtSearch Engine as result generator; ? The Levenstein-Munkres algorithm as similarity checker; ? The AlchemyAPI tool as sentiment calculator. 4.2.1. dtSearch Engine As the index files are created by an algorithm unique to the dtSearch Engine, the software must use the result generating capabilities presented by the dtSearch Engine. The result generating capabilities of the dtSearch Engine allows the user to: ? search for specific or multiple keywords within an index file; ? search for instances where keywords are mentioned but others are not (Boolean searches); ? specify the number of results that has to be returned; ? search using synonyms of the specified keywords; ? regenerate the original web page. To use the result generating capabilities the user must first specify which index files to use, the keywords that have to be detected and which features to use. Once specified the dtSearch Engine will scan through all the index files to detect the specified keywords and report back any results. Unlike the components in the Back-End there are not many different implementations available for the dtSearch Engine result generator. When provided with the results from a web crawl using 788 websites with a crawl depth of 0 and 26 keywords the result generator took 3 seconds to generate 200 results for each keyword. This is deemed fast enough and therefore no multithreading capabilities will be included. 4.2.2. Similarity Filter As discussed in Section 3.2.3 a similarity filter will be included as a portion of web and social networking crawler results will have insignificant meaning. These results include tags within a web page or tags under a picture from Facebook or Twitter results. 68 However, some of the Twitter and Facebook results that are passed through the Back-End filters will still not be viable for the ORM service, such as duplicate results that were detected in separate crawling sessions, or results that are too long and cannot be saved to the database without being truncated. Additionally, as Twitter messages are limited to 140 characters it was decided to remove results that contain more than three question marks. Such results will have no significant meaning as it will involve people asking each other questions with no answer in the text itself. Therefore the Front-End will contain two additional filters that will remove these results. Implementing the similarity filter may not be difficult but tuning the threshold of the filter may take some time. If the filter threshold is too low not enough results will be passed through but if the filter threshold is too high too many useless results will be included within the results the user receives. To determine the optimal threshold a test was executed. In the test the web and social network crawlers were executed for twenty minutes which provided 9 298 results. Of the 9 298 results 2 150 were duplicates and 5 results were too long and thus removed. The remaining 7 143 results were passed through a series of similarity filters, each with a different similarity threshold. The filtered out results were manually scanned and determined whether they were relevant to their keyword or not. The results are shown in Table 10. Table 10: Similarity filter tuning table Similarity threshold (%) Amount of results passed through Example of filtered out result 0 0 N/A 10 6384 I so love my apple devices 20 6996 I want an Apple TV 30 7115 Checkers after school 40 7131 An Apple a day 50 7140 N/A 60 7140 Checkers, ?? 70 7142 Checkers ?? 80 7143 N/A 90 7143 N/A 100 7143 N/A Table 10 shows the similarity thresholds that was used as well as the amount of results that were passed through the filter and an example of a result that was filtered out. It can be seen that results that are more than 30% similar to their keyword contain very little significant meaning, but results that are below the 20% threshold are indeed viable for an ORM service, for example “I so love my apple 69 devices” contains a positive sentiment towards Apple products. Therefore results between 20% and 30% need to be investigated. For each category all the results were manually checked and the percentage results that contained significant meaning were calculated. The results are shown in Table 11. Table 11: 20% to 30% similarity filter investigation Similarity threshold (%) % Relevant results 20 93.4% 21 85.7% 22 64.7% 23 45.1% 24 40.0% 25 58.3% 26 42.8% 27 50.0% 28 75.0% 29 66.6% Table 11 shows the similarity thresholds that was investigated as well as the percentage of results that was found to be relevant. It can be seen that the results near the 20% threshold offers more viable results than the 30% threshold but there are always a combination of viable and non-viable results. It will therefore be impossible to set the threshold to a specific value where all viable results will be retrieved and all non-viable results filtered out. As such the user must make a decision: if the threshold is set to 30% the user will obtain the largest number of viable results but the number of non-viable results will take some time to filter out. On the other hand if the user sets the threshold to 20% a large number of non-viable results will be filtered about but some viable results will be lost as well. For this project the threshold will be set to 20% which will filter out most of the non-viable results while only losing a few viable ones, as proven by Table 11. It should be noted that the values obtained in Table 11 are specific to this test and may change depending on the current mood of the public or the keywords used. A final thought can be given to multithreading, namely whether it is necessary. In the current test the similarity for all 9 298 keywords were calculated in 1.77 seconds. This is fast enough for everyday use and therefore multithreading will not offer any benefits. All tests were carried out on the following system: 70 ? Intel Core i7-2600QM 2.4GHz; ? 8GB Memory; ? 1Mb/s internet connection speed. 4.2.3. Sentiment analysis tool Once the results have passed through the similarity filter their sentiments have to be calculated. As discussed in Section 3.2.4 the Alchemy API will be used for this research. The only way to interface with AlchemyAPI is to use the library file that is provided when the user registers on the network. The library will allow the user to send a sentiment request to the AlchemyAPI servers that will process the given text and present the user with a result. Unfortunately the AlchemyAPI does not allow batch processing; the user cannot pass multiple messages through a single sentiment request to be calculated together. This technique presents a problem for the ORM system. ORM systems require a large number of messages to be processed in the shortest amount of time in order to keep its information as up to date as possible. To compensate for this the AlchemyAPI allows the user to send multiple requests at the same time, the number of which differs depending on the user’s AlchemyAPI account. As this project uses a free account it can only pass up to 5 concurrent requests. This begs the question: what is the optimal number of threads to use? In order to determine this several tests were executed where a number of messages were retrieved from the database and passed to the AlchemyAPI using a varying number of concurrent threads. Due to the restrictions of the free account this test was limited to using 100 messages. The results are shown in Table 12. Table 12: AlchemyAPI multithreading Amount of threads Time (seconds) 1 61.42 2 32.25 3 30.86 4 31.57 5 31.65 The results in Table 12 are not as expected. It can be seen that if 2 threads are used the calculation speed is indeed 100% faster, but from there on increasing the number of threads does not influence the calculation speed in any significant way whereas theoretically increasing the number of concurrent threads should bring the speed down even further. It is theorised that AlchemyAPI can only process a 71 limited number of results at a time and therefore implemented a queuing policy in order to process all the sent messages as fast as possible. AlchemyAPI was contacted in order to verify whether this is correct but no response has been retrieved at the time of writing. 4.2.4. Complete Front-End implementation Figure 25 shows the final Front-End Architecture. It can be seen that a Front-End instance does not interface with anything except the local storage drive where the dtSearch Engine index files are located and the database where the Facebook and Twitter files can be extracted. As such several Front-End instances can be opened at the same time, allowing several system administrators use the system at the same time. Figure 25: Final Front-End Architecture 72 4.3. The Website The website will not make use of any third-party tools as it will only be used to display results that have been generated by the Front-End. For this reason the implementation of the website will not be discussed here. 4.4. Final method of operation Figure 26: Final ORM operational flow The operational flow of the ORM system can be seen in Figure 26. To start the ORM process the system administrator will specify a series of keywords as well as a series of websites and their crawl depths by using an interface in the Front-End. This data will be saved to the database, and will be used by the Back-End in order to acquire information from the web, as detailed in Section 4.1. The results from the 73 Back-End will be saved back to the database as well as a local storage drive such as a hard disk. After saving the results the web crawlers will scale down their processes and wait for a pre-determined amount of time to pass before restarting the crawling process. The Twitter and Facebook scanners are unaffected by the web crawlers and will remain active when the web crawlers scale down their processes. The results from each web crawl session will be saved in a different location in order to preserve historical information. The system administrator can process the Back-End results by using a different Front-End interface, as detailed in Section 4.2. The results that can be processed depend on the Back-End status; if the Back- End web crawlers are idle all web crawler results can be processed but if the web crawlers are currently busy crawling the web the user can only process the results for a previous session. Twitter and Facebook results can be processed as they are saved to the database. Results for the website will become available as the system administrator updates the database with the processed results. 74 Chapter 5 Results The ORM system was evaluated by using it in a real world scenario similar as it would be used by a client. The scenario contained a list of keywords that must be detected as well as a list of websites and their respective crawl depths that must be scanned by the web crawlers. This chapter will detail the results that was retrieved by testing the ORM system in this manner. 75 5.1. Process The scenario used in the evaluation consisted of 200 websites with a crawl depth of 1 and the following keywords: ? Spar ? Shoprite ? Wimpy ? Spur ? CNA ? Musica ? Woolworths ? Edgars ? Audi ? Jeep These keywords were chosen as they span different sectors of the industry, namely general shopping, food, entertainment, clothing outlets and motoring. The websites were chosen by randomly selecting 200 websites from a list containing 788 websites in different categories. As the ORM system would run on a server within a real world scenario, the ORM system was installed on a system with higher specifications than the one used during development. The “server” system has the following specifications ? Intel Core i7-3770K 3.0GHz; ? 16GB Memory; ? 1Mb/s internet connection speed. 5.2. The Back-End To test the Back-End within a real world scenario, all three of the ORM components were run in parallel. 5.2.1. Web crawler As a 1Mb/s internet connection was available, the Back-End was set to use 16 active web crawlers with the 200 websites divided amongst them. A graph indicating the execution time for each web crawler thread is given in Figure 27. 76 Figure 27: Web crawler execution time The average execution time for all 16 web crawlers was 2 hours 18 minutes and 16 seconds. The shortest execution time for a web crawler was 1 hour 10 minutes and 28 seconds and the longest 2 hours 48 minutes and 13 seconds. From Figure 27 it can be seen that web crawlers 2 and 15 executed significantly faster than any of the others. When investigated it was noticed that the websites assigned to web crawlers 2 and 15 contained fewer links to other websites, and therefore less pages had to be scanned. This is shown in Figure 28. Figure 28: Web crawler URLs and Links scanned 0 2000 4000 6000 8000 10000 12000 0 2 4 6 8 10 12 14 16 18 Ex ec u ti o n t im e (s ) Web crawler instance Web crawler execution time (seconds) 0 1000 2000 3000 4000 5000 6000 7000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 it em s sc an n ed Web crawler instance Number of URLs and Links scanned URLs scanned Links scanned 77 It should be noted that for all the web crawlers the number of links that were scanned is significantly more than the number of URLs that were scanned. This happens because the dtSearch Engine implements a duplicate policy; once a link has been detected the dtSearch Engine will first proceed to verify that the page has not been scanned previously. If the page has not been scanned before the dtSearch Engine will load the URL and scan the associated page. If the link has been scanned before the dtSearch Engine will log that the link has been detected before proceeding to check the next one. Therefore the number of URLs scanned represent the number of web sites that the dtSearch Engine has downloaded and indexed, whereas the number of links scanned represent the number of links the dtSearch Engine has detected in total. Figure 29: Web crawler URLs and Links scanned per second As stated in Section 1.3 the web crawler must be able to scan the internet as fast as possible. Therefore the crawling rate of the web crawlers measured in web pages per second are very important. This is shown in Figure 29. Figure 29 shows that no single web crawler instance was able to scan the web at a rate faster than 1 website per second. However, as each web crawler instance is downloading a separate website at a time 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 W eb p ag es p er s ec o n d Web crawler instance Number of URLs and Links scanned per second URL scan rate Link detection rate 78 all the crawling rates can be added together in order to get an indication for how fast the web crawlers are scanning the internet. Once all the web pages were scanned the cumulative web crawling rate was calculated to be 2.45 webpages per second and the cumulative link detection rate to be 7.19 links per second. This will increase significantly should a faster internet connection be used. 5.2.2. Twitter Scanner The Twitter Scanner was loaded with the 10 keywords as specified in Section 5.1 and executed for 3 hours and 40 minutes. During this test 20 694 results were detected. From the total number of results 16 370 were non-English results and 1 311 were duplicates of previous results. As such 17 861 (85.44%) results were filtered out which left 3 013 valid results. The information is graphically shown in Figure 30. Figure 30: Twitter results breakdown Analysing the results shows that valid tweets containing of the ten keywords were detected at a rate 0.22 valid results per second. While this may seem low it should be kept in mind this is the rate at which results that are of any use are presented by the Twitter public stream. To verify that no data was lost two separate programs were connected to the Twitter public stream and both received the same amount of results. If all results are taken into account the result detection rate is 1.57 results per second. Unlike Twitter results breakdown Invalid lanauge Duplicate results Valid results 79 the web crawlers a faster internet connection will not have any additional benefit; the only way to detect more results would be to add more keywords to the Twitter filter. 5.2.3. Facebook Scanner The Facebook Scanner was provided with the 10 keywords as specified in Section 5.1 and executed for 3 hours and 40 minutes. During this test 27 991 results were detected. From the total number of results 23 416 were duplicates of previous results and 4 107 were results that did not contain significant meaning or was written in a different language. As such 27 523 (98.32%) results were filtered out which left 468 valid results. The information is graphically shown in Figure 31. Figure 31: Facebook results breakdown Analysing the results show that valid Facebook posts that contained one of the ten keywords were detected at a rate of 0.035 valid results per second. While this may seem low it should be kept in mind this is the rate at which results that are of any use were detected. Unlike Twitter it was not needed to connect two separate programs to the Facebook public data as presence of duplicate entries presented 15% 83% 2% Facebook results breakdown Invalid result Duplicate result Valid result 80 by the same user proves that no data has been lost. If all results are taken into account the result detection rate is 2.12 results per second. It should be noted that although more results are retrieved from Facebook than Twitter, Twitter has more valid results. The reason for this lies in the scanning methods of Facebook and Twitter. To extract data form Twitter a continuous stream is opened that automatically presents new results whereas Facebook does not support this functionality and has to be periodically scanned. By scanning Facebook a fixed number of results are always returned. However, results are generated very slowly on Facebook, as demonstrated in Section 4.1.3 and therefore a large number of duplicate results will be retrieved. 5.3. The Front-End The data generated by the Back-End could be accessed by the Front-End once all its tests were complete. 5.3.1. Web crawler result generator When provided with the index files generated by the web crawler it took the result generator 2.53 seconds to generate results for the 10 keywords with a similarity threshold of 20%, as discussed in Section 4.2.2. The number of results for each keyword is shown in Table 13. Table 13: Amount of web crawler results Keyword Amount of results Audi 79 CNA 2 Edgars 22 Jeep 47 Musica 4 Shoprite 42 Spar 45 Spur 53 Wimpy 10 Woolworths 104 Total 408 81 A screenshot of the web crawler results screen is shown in Figure 32. From the screenshot it can be seen that all web crawler results contain the following information: ? Title, location, page and paragraph of the web page where the result was located, ? The data along with the keyword that was detected, ? An internal score as calculated by dtSearch that indicates how strongly the result matches the keyword along with a relevancy score which will filter out results which are deemed too similar to its keyword, ? The number of times the specific result was found as well as the number of times the keyword features within the text (Hit Count). ? The calculated sentiment of the given text. Figure 32: Screenshot of web crawler results When entering a keyword into the system, the software will allow the system administrator to specify a competitor a keyword. This will be saved to the database as additional keywords and will be included in the scanning processes by the Back-End, but when analysing the information the system will show the information for the current keyword as well the information for its competitors. This will allow the system administrator to compare the web crawling results for the keyword against those of its competitor. After the results have been analysed by passing the results through the similarity filter and 82 sentiment analysis tool, the analysed results will allow any potential client to see how their company compares against its competitors. It may be noticed that the “Competitors” grid at the bottom of the screen is currently empty. For this scenario no competitors were entered and as such no information can be displayed. If the user were to give the current profile a competitor, for example “Woolworths”, its results will be shown in the grid. This demonstrates the effectiveness of scanning the internet and generating results at a later date. By using the index files all the data is already present and the user can request results for new keywords at any time without rescanning the web. Figure 33: Results report of web crawler The user has the ability to generate a report of all the results that were detected. The report will give a summarized version of all keyword mentions as well as the website, page and paragraph where the keyword was found. The report will exclude information such as calculated sentiments as sentiments are only calculated once a user has marked a result as relevant. The user will also have the ability to re-generate the web page of any detected result in order to provide some information regarding the context in which the keyword was found. Examples of the results report and regenerated web page are shown in Figure 33 and Figure 34. 83 Figure 34: Regenerated web page of web crawler result As explained in Section 3.1.2 the web crawler cannot detect the context in which a keyword features and therefore not all of the results that are provided will be relevant. As such the user must filter through the results provided by the web crawler and must manually determine which results are relevant, as demonstrated by the results column in Figure 32. Once the user has marked a result as relevant the system will calculate the result’s sentiment and category by using AlchemyAPI before adding the result to a list that contains all the results that the user has marked as relevant. Once the user has finished marking the relevant results he or she may proceed to review the results. This is shown in Figure 35. 84 Figure 35: Web crawler results review When reviewing the results the user have the ability to alter the title, category, sentiment and text of the result in order to remove any unnecessary characters which may have been added when the result was acquired from the web. After reviewing the results the user will be able to save the results to the database from where it will be used to generate statistics. 5.3.2. Twitter scanner results When provided with the results from the Twitter scanner it took the Front-End 0.53 seconds to process the 3 013 results that were retrieved by the Back-End. Like the web crawler results a similarity threshold of 20% was used, which filtered out 284 results. A screenshot of the Twitter results processor for “Edgars” is shown in Figure 36. Figure 36: Twitter results processor From the screen shown in Figure 36 the user can select which results are relevant and mark them to be saved. Once the saving procedures starts the sentiment of each marked result will be calculated by using AlchemyAPI and it will be saved to the database to be used when generating statistics. 5.3.3. Facebook scanner results 85 When provided with the results from the Facebook scanner it took the Front-End 0.83 seconds to process the 468 results that were retrieved by the Back-End. This may seem strange as 3 013 Twitter results took only 0.53 seconds to scan, but it should be noted that a single Twitter message can only be 150 characters in length, while a Facebook message contain up to 7 700 characters. Like the web crawler and Twitter scanner results a similarity threshold of 20% was used, which filtered out 123 results. A screenshot of the Facebook results processor is shown in Figure 37. Figure 37: Facebook results processor During a deeper analysis, it was noticed that the searching techniques used by the Facebook Graph Search compensates for synonyms and spelling mistakes and as such the user may not always receive the correct keyword. For example: searching for “Edgars” yielded results for both the clothing outlet as well as various people named “Edgar”. As such additional filters must be implemented in order to ensure that the keyword the user searchers for does indeed feature within the message returned from Facebook. After implementing such a filter 161 results were returned from the original web and social network crawlers. From the screen shown in Figure 37 the user can select which results are relevant and mark them to be saved. Once the saving procedures starts the sentiment of each marked result will be calculated by using AlchemyAPI and it will be saved to the database to be used when generating statistics. 5.3.4. Online reputation calculation Once the user has calculated the sentiment of the web crawler, Twitter and Facebook results the profile overview can finally be calculated. This will be done by generating the averages of the calculated 86 sentiments for each of the categories (web crawler, Twitter Scanner and Facebook). If the average sentiment for the keyword is larger than 0 then result will have a positive perception, and vice versa. It will take too long to determine the online reputation of all the brands that have been listed at the start of the chapter as the free account of AlchemyAPI only allows the user to process 1000 sentiments during a day. Therefore the online reputation of only one company will be determined in this section. It was decided to determine the online reputation of Edgars as the web and social networking crawlers have produced enough results to determine its online reputation but the number of results will not surpass the limitations of the AlchemyAPI free account. Figure 38 shows a screenshot of the final result for the scenario used in this chapter. It can be seen that Edgars currently has a positive perception on both the internet, Twitter and Facebook. When manually verifying the online sentiments it was found that 82.85% of the sentiments were correctly identified, which matches the figures presented in Section 3.2.4. Figure 38: Profile overview 5.4. The Website The Website will be able to show the processed results to any interested party without allowing them access to the database. In order to be useful the Website must allow the user to: ? View the processed results of any keyword, as well as indicate the origin of the keyword. ? Indicate whether the specified brand has a positive or negative online reputation. ? Allow the user to compare the keyword against similar brands. All of these results can easily be retrieved from the database by using the processed results. Screenshots of the webpages are shown in Figure 39 - Figure 42. 87 Figure 39: Website results – Internet Figure 39 shows the results from the web crawlers after they have been processed. In the screenshot the user can clearly see the title, location, a summarized version of the result, the keyword as well as the sentiment of each result. Figure 40: Website results – Twitter 88 Figure 40 shows the results that have been retrieved from Twitter after they have been processed. In the screenshot the user can clearly see the full Twitter message as well as the keyword that was detected and the calculated sentiment. Figure 41: Website results – Facebook Figure 41 shows the results that have been retrieved from Facebook after they have been processed. In the screenshot the user can clearly see the full Facebook message as well as the keyword that was detected and the calculated sentiment. 89 Figure 42: Website - Overall sentiment Figure 42 shows the final screen that gives the overall reputation of the selected keyword, in this case “Edgars”. From the screenshot it can be seen sentiments for the keywords are clearly indicated and are the same as the results provided by the Front-End as seen in Figure 38. 90 Chapter 6 Conclusion and Recommendations This chapter will provide an overview of the goals of the research before stating whether the goals have been met and whether the research was a success or not. A final thought will be given to any enhancements that may be researched in the future. 91 In order to determine whether the project was a success its accomplishments must be evaluated against the original goals. As stated in Section 1.3, the goals of this project were to investigate existing online reputation monitoring systems in order to develop a new system that will: ? scan web pages at a sufficient rate to ensure all the results are continually kept up to date; ? scan social networking sites for public opinions; ? analysing results in real time to give the user information such as the location, page, paragraph and sentiment of the results; ? using the analysed results to generate reports; ? present the user with as much control over the entire process as possible. By using the knowledge gained from investigating BrandsEye and Brand.Com, an online monitoring system was developed that is capable of: ? scanning web pages at a rate of 2.45 web pages per second by using a 1Mb/s internet connection; ? using the information from the web crawlers to detect 200 results for each of 10 random keywords in 2.53 seconds; ? using the web crawler results to generate various reports as well as the original website where the results were located; ? acquiring results from Twitter and Facebook at a rate of 0.22 and 0.035 valid results per second by using 10 random keywords; ? filtering and processing data retrieved in 3 hours 40 minutes from Twitter and Facebook results in 0.53 and 0.85 seconds respectively; ? generating the sentiments of the web crawler as well as the Twitter and Facebook results; ? generating a complete overview of the keyword to indicate how the keyword is seen by the online community. As can be seen from the results all the original goals were met and therefore the project was a success but several areas could be improved upon. Currently the main stumbling block is the use of AlchemyAPI in order to calculate the sentiments of the results. In order to use AlchemyAPI the user must pay a montly subscription fee which will become expensive. Therefore sentiment analysis tools can be investigated in the future to see whether it will be possible to develop a custom sentiment analysis tool that will minimize if not get rid of all the subscription fee costs. The dtSearch Engine is another possible area of investigation; while the dtSearch Engine only has to be purchased one time it does not allow the user a lot of flexibility. While this is not a problem at the present time additional functions might be added in the future that the dtSearch Engine may either 92 prohibit or cannot perform. It would be ideal if the system could use a custom web crawler specifically made for an ORM system over which he or she has full control. Lastly a thought must be given towards the social networking sites that are currently scanned. At present Twitter is generating sufficient data for an ORM service but due to its internal search processes Facebook may not always provide the user with sufficient information, as indicated by the “Edgars” example in Section 5.3.3. Additionally Facebook has upgraded its Graph API in April 2014 which has deprecated the post scanning functionality the current ORM system makes use of. Currently Facebook still provides the post scanning functionality by means of the previous API version but it is unknown how long this API will remain supported. As such either a new method of interfacing with the Facebook network must be investigated or the system must drop support for Facebook and instead make use of another system such as Google Plus. 93 Appendix A Index files This add-on chapter will give a quick overview of how index files are generated, used and what their advantages are. 94 Index files contain all the words that were detected within the web pages and their respective locations. The example below demonstrates this principle; it should be noted that this may not be the exact algorithm the dtSearch Engine uses and the example is intended for demonstration purposes only. Original Text The end game to South Africa's big platinum strike is drawing near after the producers said they would take their latest wage offer directly to employees after marathon wage talks to end the 13-week strike collapsed on Thursday. Source: "Platinum firms take offer to workers” by Reuters on 28/4/2014, published on News24.Com. Link: http://www.fin24.com/Companies/Mining/Platinum-firms-take-offer-to-workers-20140428 Index File: Word Location 13-week 34 Africa’s 6 After 13,27 Big 7 Collapsed 36 Directly 24 Drawing 11 Employees 26 End 2,32 Game 3 Is 10 Latest 21 Marathon 28 Near 12 Offer 23 On 37 Platinum 8 Producers 15 Said 16 South 5,25,31 95 Strike 9,35 Take 19 Talks 30 The 1,14,33 Their 20 They 17 Thursday 38 To 4 Wage 22,29 Would 18 Example 1: Index file demonstration By the table in Example 1 the location of all mentions of a specific word can be quickly retrieved, allowing the user or a software application to quickly scan a document. However, each index file implementation uses its own algorithm hence making it incapable for other indexing applications to read each other’s files. The dtSearch Engine is no different; when the dtSearch Engine indexes web page information it creates the index files on a designated storage device. It would be ideal if this data could be saved to a local database instead of a hard-drive, but due to the unique dtSearch algorithm it would be impossible to read the data from the index file with a custom application and the files are too large to consider saving the entire file to the database. 96 Appendix B Conference Presentations This add-on chapter will indicate where this research has been presented and provide any articles that have been written. 97 Title: Development of an online reputation monitor Format: Poster presentation Presented at: Southern African Telecommunications Networks and Applications Conference (SATNAC) Stellenbosch, Western Cape, South Africa 2 – 4 September 2014 Title: Development of an online reputation monitor by using existing components Format: Full-paper Presented at: Southern African Telecommunications Networks and Applications Conference (SATNAC) Port Elizabeth, Eastern Cape, South Africa 1 - 3 September 2014 98 Abstract- Customer opinion about companies are very important and companies often get customer feedback via surveys or other official methods. Some customers prefer to voice their opinion on the internet where they take comfort in anonymity. Currently this form of customer feedback is not closely monitored. This project aims to address this shortcoming by developing a system capable of monitoring various web and social networking sites for customer feedback. Index Terms— Online Reputation Monitor, Web crawler, Facebook, Twitter, dtSearch. I. INTRODUCTION Advertising is a tool that is used to establish a basic awareness of a product or service in a potential customer by providing selected information [1]. Research shows that advertising has a major influence on customer preferences and can help consumers to make decisions [1][2]. However, according to Reichheld [3] the tremendous cost of marketing makes it hard for a company to grow profitable. Reichheld [3] believes that the only path to a profitable growth rate lies in the company's ability to get loyal customers to promote a company by sharing information about the company’s products and the customer’s experience [4]. Getting customers to talk about a company is not enough as customers may either promote or advise against the company. As such the company needs to monitor its public reputation. Most companies know this and often employ techniques such as focus groups and surveys to generate various statistics but these methods are not always effective. People often feel under pressure when their opinions are personally asked and therefore adjust their answers to avoid any potential confrontation. Instead many customers prefer to voice their opinion on the internet where they take comfort in anonymity. Therefore companies need to determine their online reputation. Determining online reputation is not a new field and has been done for years by organizations such as BrandsEye and Brand.Com. However, the services these companies offer have certain limitations. The services normally require a monthly fee and companies are not allowed to purchase the online reputation monitoring (ORM) software. The user also has very little control over the software itself: such services search for user specified keywords over a range of websites, yet the user cannot specify which websites to monitor or change the keywords without restarting the service. Additionally the services rarely make use of historical data, meaning a company’s reputation can only be analyzed from the present day onwards. To solve this problem a new ORM system is proposed. The system will allow the user to scan a number of web pages and social networking sites such as Twitter and Facebook at a sufficient rate to ensure that all results are continually kept up to date. Once results from the web and social networking sites have been gathered the user will be presented with various methods capable of analyzing the results. The user will be capable of retrieving information regarding the result’s source such as the website, page and paragraph, the date at which the result was generated as well as the sentiment and the category of the result. Once all the results have been analyzed the user will have the option to generate various reports that will indicate how his or her company is seen by the online community. II. BACKGROUND The basic functionality of ORM involves scanning a variety of web pages and social networking sites, analyzing the results to determine their relevancy, using the analyzed results to generate various statistics and visualizing the results. Clearly an ORM system isn't a single process, but a collaboration of several processes that includes a web and social networking crawler, a method of analyzing the results and a sentiment analysis tool. A. Web crawler Web crawlers are programs that explore the World Wide Web, retrieve information and store the results for future use [5]. This process is known as web crawling. The type of data that is extracted from web pages depend on the implementation of the web crawler. Some web crawlers are configured to extract only specified phrases [6], while others extract and index each word in a web page for future use [7]. Figure 43: Basic Crawler Architecture Development of an online reputation monitor G.J.C. Venter, W.C. Venter, A.J. Hoffman School of Computer and Electronic Engineering, North-West University, Potchefstroom Campus, Private Bag x4001, x442, Potchefstroom 2520 Tel: +27 76 358 8241, Fax: +27 18 299 1977 Email: mgmGertV@gmail.com, willie.venter@nwu.ac.za, alwyn.hoffman@nwu.ac.za 99 Figure 3 shows the architecture of a basic web crawler [5]. Before a web crawler is initialized a user must specify a list of seed URLs which is stored in the frontier, a list of unvisited URLs. When the web crawler starts, it will load the first URL in the frontier, download and scan the contents and store any results in a database or on a local storage device. Once the crawler has finished scanning the page it will load the next URL from the frontier and repeat the process. This is known as the crawling loop. Many web sites contain multiple pages, which in turn contains additional subpages. This is illustrated in Figure 4. Figure 44: Crawl Depth Illustration Web crawlers are often configured to scan only a certain depth. This is the extent to which a web crawler scans the page and is specified when the user adds the seed URL. If the user wish to scan only the original page the crawl depth will be set to 0. If the user sets the crawl depth to 1, the web crawler will scan the seed URL and add the URLs for Page 1, Page 2 and Page3 to the top of the frontier. Once the web crawler has finished scanning the seed URL, it will proceed to download and scan Page 1, followed by Page 2 and finally Page 3. The same process will be followed for Page 4-9 if the user sets the crawl depth to 2. Web crawler execution ends when all the links in the frontier have been scanned. At this stage timers are usually initialized which will redeploy the crawler at a specific time or after a specified interval. B. Social Networking Crawlers Unlike regular web sites, web crawlers cannot scan social networking sites. Social networking sites only present data once the user has registered and even if logged in the site will only show data regarding the people the current user is connected to. Normal web crawlers do not possess authentication capacities and therefore cannot access social networking sites and those that do will only have access to limited info. Therefore various social networking sites have designed an interface that allows registered applications to access the site’s public data. This interface is known as an Application Programming Interface (API). All social networking sites have different API’s and therefore different methods of extracting information. As such it will not be possible to design a universal method of interfacing with all social networking sites. The project will only be able to interface with some social networking sites due to time constraints. It was decided to interface with Facebook and Twitter as they had the highest Alexa rank at the time, which is an indicator of network traffic. C. Sentiment analysis tools Once the web or social networking crawler returns a result it must first be analyzed to determine whether the result contains any relevant data before it can be presented to the user. However, the user often wishes to know whether the sentiment of the result is positive or negative. There are two main methods used to calculate the sentiment of some text, namely the lexical approach and the machine learning approach. A system that is based on the lexical approach uses a dictionary of pre-tagged words. Each word in the text that has to be analyzed is compared against the dictionary to find its polarity. This indicates whether the word has a positive or negative sentiment as well as the strength of the sentiment. After the polarity of each word has been determined, the polarity of the given text is calculated by summing the polarity for each word. According to Annett [8], the accuracy of such a system varies between 64% and 82%, depending on orientation of statistic metrics and the dictionary that was used. A system that uses the machine learning approach uses two components; a series of feature vectors and a collection of tagged corpora which is a collection of documents that the system uses to train itself [9]. Feature vectors are usually a variety of uni-grams, single words from within a document, or n-grams, two or more words from a document that are in sequential order. Other features that are often proposed include the number of positive words, number of negative words, and the length of a document. Both the feature vectors and collection of tagged corpora are used to train a classifier, which can be applied to an untagged document to determine its sentiment. According to Annett [8], the accuracy of such a system varies between 63% and 82%, but the results are dependent of the features that were selected. III. DESIGN As stated in the introduction, existing ORM system have various technical limitations that limit their usefulness and therefore it was decided to implement a new ORM system. To include all the functionality of ORM as discussed in Section II it was decided to develop a system consisting of three parts, namely a Back-End, a Front-End and a Website. A. The Back-End To begin the ORM process information must be retrieved from the internet. This will be done by the Back-End, which will initialize and maintain all the web and social networking crawlers. Before the ORM process can be started the user must specify a list of seed URLs, their respective crawl depths as well as a series of keywords. The URLs and the crawl depths will be used by the web crawler whereas the social networking crawler will use the keywords to find any results from Facebook and Twitter. 100 Information gathered by the web and social networking crawlers will be saved either to a database or a local storage device. The storage location will depend on the type of web and social networking crawler that will be used. Once the Back-End has finished crawling the web and social networking sites it will proceed to wait for a predetermined amount of time before restarting the process. The Back-End will be designed to function with as little human interaction as possible. The only interaction that may be required would be to verify that the Back-End has not encountered an error or to turn off the system in times of maintenance. B. The Front-End After the data has been acquired by the Back-End the results need to be processed. This is the goal of the Front-End. The Front-End will proceed to load the stored results for a specified date range and process the data by identifying the paragraphs in which the keywords feature before passing the paragraphs through a similarity filter and a sentiment analysis tool. Afterwards the user will be able to verify the results before saving them to a database. The similarity filter will filter out results that are too close to the keywords, such as tags within a web page. This will lessen the amount of results that will have to be processed by the sentiment analysis filter. The sentiment analysis tool will use the results that have passed through the similarity filter and determine whether the results are positive or negative. These results will be used to determine the overall opinion of the company or brand that is being investigated. Once the sentiment of the results have been calculated they will be saved back in the database. The Front-End will be designed to allow multiple instances of the software to run in parallel. This will allow multiple system administrators to access the system at the same time which will increase the amount of results that can be processed at once. While it will be possible to fully automate the Front-End, it would be wise to allow some degree of human interaction to verify if the results that pass through the filters are correct. The main reason for this is the fact that the meaning of words can differ depending on the context which it is used in. Using the keyword “Kalahari” to monitor the public opinion of the online marketplace “Kalahari.com” may yield results for both the online marketplace and the Kalahari Desert, which will influence the results and statistics. C. The Website Once the results have been processed the system administrator will require a method to display the processed results to the user or any other interested party. This can easily be accomplished via a web site as it will allow the data to be accessed anywhere. IV. COMPONENT SELECTION The implementation of the ORM system will make use of existing components - there is no use in redesigning the wheel. Available components will be evaluated and component selection will be based on the results of the evaluation. A. Web crawler Ideally the ORM system would make use of a powerful web crawler such as GoogleBot or BingBot, respectively developed by Google and Microsoft. Unfortunately these crawlers are not available for public use. An investigation discovered many free and commercially available web crawlers available on the internet. A commercial text retrieval engine as well as several open source web crawlers were acquired and evaluated. The dtSearch Engine is a commercial text retrieval engine with many features that provides the user with a web crawler that is actively maintained. The user has no control over the internal functionality of the dtSearch Engine as it is not open- source but the dtSearch Engine can be manipulated by using the settings that are presented. The dtSearch Engine creates index files when scanning the internet which can be accessed by a result generator included with the web crawler. The HTML Agility Pack (HAP) is a free library that allows the user to parse web pages directly from the web. The HAP does not have any internal web crawling capabilities but provides the user with components to build a custom web crawler and result generator. The HAP is open source; as such the HAP is completely cost-free but support for the library may stop if the community stops working on the project. A second open-source web crawler named Abot was evaluated. The Abot crawler contains pre-programmed functionality that is based on the HAP while still presenting the user with as much functionality as possible. Various aspects of the crawlers were compared using a weighted average to determine the web crawler best suited for this project. The aspects considered were the speed with which the web crawler is able to scan the web, the time it will take to incorporate the web crawler into the ORM system, additional features and the customizability of the web crawler as well as its cost. The results are shown in Table 14. For a full discussion of the weight assigned to the different aspects as well as the scores of the different web crawlers for each aspect please refer to the dissertation “Development of an Online Reputation Monitor” by G.J.C. Venter at the North-West University. Table 14: Web crawler comparison Category dtSearch Engine HAP Abot Speed (30%) 26 27 19 Implementation time (30%) 30 6 12 Existing Features (15%) 15 2 7 Customizability (15%) 7 15 12 Cost (10%) 0 10 10 Total (100%) 78 60 60 From Table 14 it can be seen that despite its cost the dtSearch Engine will be best suited for the ORM service. While the HAP and the Abot web crawlers may be almost as fast and free, they lack in the “implementation time” and “existing features” categories. 101 B. Social networking crawlers As already explained, to extract sufficient information from social networking sites the ORM system will have to make use of the social network’s API. The API provided by Twitter enables the user to extract data in one of two ways: by performing a search query via the REST API or getting a continuous feed of data from the Streaming API. This project will make use of the Streaming API because querying the RESP API is slow and creates significant internet overhead. Two free Twitter components are available to allow the user to access the Twitter Streaming API, namely TweetInvi and Linq2Twitter. The differences between the components are marginal; both return only results which match a specific number of conditions. Various aspects of the components are compared using a weighted average to determine the component best suited for the project. The aspects considered were the amount of results, the time it will take to implement the component within the ORM system, additional features and customizability. The results of the components are shown in Table 15. For a full discussion of the weight assigned to the different aspects as well as the scores of the different components for each aspect please refer to the dissertation “Development of an Online Reputation Monitor” by G.J.C. Venter at the North- West University. Table 15: Twitter API component comparison Category TweetInvi Linq2Twitter Amount of results (30%) 27 30 Implementation time (30%) 30 10 Existing Features (20%) 7 20 Customizability (20%) 10 20 Total (100%) 74 80 From Table 15 it can be seen that it will be faster to implement the TweetInvi component but the limited information of the results makes it less viable than Linq2Twitter. TweetiInvi would also require additional filters in order to filter out non-English results which would take additional time. In accordance with the results Linq2Twitter was chosen to be used in the ORM system. Facebook has a different API than Twitter and requires a different interfacing method and a different component. As the ORM system has to be operated within a Microsoft Windows environment the ORM system will make use of the Facebook SDK for .Net. C. Result Analysis Method - String Similarity Formula The results of the web and social networking crawlers that contain the specified keywords will be saved, but a certain portion of the results, such as tags within a web page, will have insignificant meaning. Manually filtering out these results will cost the user valuable time, therefore a similarity filter based on the Levenshtein and Kuhn-Munkres algorithms will be used to filter out unnecessary results. There are other string similarity formulas available, as proposed by Mihalcea [10] and Li [11], but are deemed too complex for this implementation. The Levenshtein algorithm is used to measure the similarity between two strings by calculating the least number of edit operations that are necessary to modify one string into becoming another [12]. The Kuhn-Munkres algorithm is an algorithm capable of solving linear assignment problem (LAP) instances [13]. The similarity between a result and its keywords can be calculated by breaking up both the result and the keywords into separate lists of their individual terms called tokens, followed by comparing the similarity of the tokens within each list against each other before comparing the tokens in the list of keywords against the tokens in the list of results. The Levenshtein algorithm is used to compare the similarity between the tokens by providing a set of rules that calculates the cost of changing a token to another by using a series of one step operations. Each one-step operation has an associated cost; substitution costs 2 units whereas cost of insertions and deletions is 1 unit. After the tokens in each list have been compared the algorithm will compare the tokens between the list of results and the list of keywords. This will return a matrix, which will be solved by the Kuhn-Munkres algorithm. For a full discussion of the string similarity tool, please refer to the dissertation “Development of an Online Reputation Monitor” by G.J.C. Venter at the North-West University D. Sentiment Analysis Tool Once a record has been generated and processed its sentiment must be calculated to determine whether the message is positive or negative. Various sentiment analysis tools are available on the internet such as the AlchemyAPI [14] and the Salience Engine provided by Lexalytics [15]. To use the AlchemyAPI the user must first register on the AlchemyAPI website to acquire a software development kit (SDK) and two registration keys. In order to determine the sentiment of a sentence the user must use the SDK with the 2 keys to verify the application and pass the sentence that must be analyzed as a parameter. AlchemyAPI will proceed to analyze the sentence and present the user with the result. The Salience Engine from Lexalytics provides the user with a multi-lingual text analysis engine that can be integrated into systems for business intelligence and social media monitoring. The Salience Engine includes many features, such as sentiment analysis, named entity extraction and summarization. From features alone the Salience Engine would be better suited for this project, but in order to acquire the Salience Engine the user would have to contact Lexalytics and request access to the system. There are also no pricing opinion available on the website, whereas the AlchemyAPI SDK could easily be acquired once registered. As such the AlchemyAPI was used for this project but the Salience Engine can be investigated in the future. V. Implementation The advantages and disadvantages of different implementations of the chosen tools are discussed in this section. 102 A. dtSearch Engine – Web crawling The scanning of thousands of web pages using the basic dtSearch Engine implementation will take quite long because the dtSearch Engine scans a single page at a time while the other pages are kept in a queue. To overcome this problem multiple instances of the dtSearch Engine are initiated in the ORM and the web pages are divided amongst them. Multiple instances requires the use of threading at the cost of more system resources. To investigate the threading option the number of threads were increased while the CPU and memory usage as well as the internet bandwidth usage were monitored. All tests were done using a 1Mb/s (128 KB/s) internet connection. The results are shown in Table 16. Table 16: Threading comparison Threads 1 2 4 8 16 Time (s) 2381 1259 722.37 573.41 430.03 CPU Usage (%) 1.5 2.5 4.1 5.0 5.1 Memory Usage (%) 1.4 2.5 4.1 7.08 14.5 Bandwidth (kB/s) 23.67 57.02 98.16 127.18 133.28 Websites loaded 788 788 788 788 788 Stable Yes Yes Yes Yes Yes From Table 16 it can be seen that as the number of threads increased the amount of time required to scan all the sites decreased, but the usage of the system resources increased. More threads could be used but at 32 threads there were too many simultaneous web connections which caused some of the web crawlers to time-out as they could not receive information fast enough. This problem can be solved with a faster internet connection. B. Social networking sites - Twitter The Linq2Twitter component provides results when received from the Twitter API. The number of results are not dependent on the computer speed and multiple instances are therefore not needed. During testing it was noticed that some results from the Linq2Twitter component cannot be used by the ORM service, e.g. non-English results and results that have been previously detected. Saving these results to the database will increase the execution time of the Front-End as the results have to be filtered out each time the results for a specific keyword are requested. To prevent the saving of these results an additional filter was developed that only saves English results. The record has not been detected before and the message of the keyword does contain the keyword. Results failing any of the checks were discarded. Table 17: Twitter Filter Results Test1 1 Test 2 Test 3 Correct 1377 1657 1942 Incorrect Language 3426 3055 3184 Duplicate record 295 830 658 Total 5098 5542 5784 Percentage Correct 27.0% 29.8% 33.5% The effect of the filter can be seen in Table 17. Each result was passed through the filter as it was received. It can be seen that only about 30% of the results are indeed viable for the ORM service. C. Social networking sites – Facebook Unlike Twitter, the Facebook streaming components do not provide the user with continuous posts but with limited information regarding user and page status updates. To retrieve information regarding the API the user will have to perform a search request using another component of the Facebook SDK called the Graph API. The Graph API will provide the user with various information regarding the latest Facebook public posts such as user ID, the message and the date at which it was created and will always return the latest 25 results. During integration it was discovered that if the Graph API is queried once per minute the ORM system will not lose any results as results are generated very slowly on the Facebook system, but will receive various duplicates. In order to prevent the saving of these results an additional filter was developed to filter out any duplicate results. To test the filter 5 queries were executed each a minute apart. For the 1st query the filter reported 70.4% unique results, for the 2nd 16%, for the 3rd 16.8%, for the 4th 14.4% and for the fifth 24.8%. The initial query contains the most unique results and subsequent queries contain fewer unique results as more results are duplicated and filtered out. VI. Results A. Web Crawling The Web crawler was loaded with 250 websites, crawl depth of 1. The results for the crawler test are shown in Table 18. Table 18: Web crawler results ID Time #Seed URLs #Links detected #Links scanned Crawl Rate 1 3 hours 31 minutes 16 6332 2877 0,23 2 3 hours 43 minutes 16 4221 1585 0,12 3 3 hours 17 minutes 16 4604 2075 0,18 4 2 hours 29 minutes 16 3182 1175 0,13 5 2 hours 32 minutes 16 3352 1239 0,14 6 2 hours 55 minutes 16 4056 1773 0,17 7 2 hours 40 minutes 16 4085 1555 0,16 8 3 hours 19 minutes 16 4091 1458 0,12 9 2 hours 53 minutes 16 3897 1690 0,16 10 2 hours 50 minutes 16 3611 1317 0,13 11 3 hours 29 minutes 15 4302 1617 0,13 12 3 hours 11 minutes 15 4805 1805 0,16 13 2 hours 5 minutes 15 2741 1184 0,16 14 3 hours 6 minutes 15 6631 2194 0,2 15 3 hours 34 minutes 15 3683 1326 0,1 16 2 hours 4 minutes 15 3014 1028 0,14 All 2 hours 59 minutes 250 66607 25898 2.41 From Table 18 it can be seen that the web crawler scanned the web pages at a rate of 2.41 web pages per second. This will result in scanning 1000 web pages within 40 minutes. If a higher scan rate is required it can be realized by using a more powerful computer with a faster internet connection. To test whether the web was scanned correctly a single keyword was chosen and the number of instances found were verified against the original web page. All instances were found. The test was executed on a computer with a 3GHz CPU, 16GB RAM and a 1Mb/s internet connection. 103 It should be noted that less links were scanned than detected. This occurred because of duplicate links that were detected within web pages such as style sheets, as well as links that are not allowed to be crawled, as dictated by the website. B. Twitter Scanning The Twitter Scanner was loaded with 26 keywords and executed for 1 hour 35 minutes. During the execution 74 201 results were detected at an average rate of 13.01 results per second. Of the 74 201 results 51 556 results were in a non- English language and 2 882 were duplicates of previously detected results. When passed through the similarity filter 6 542 results were removed, which provides the ORM service with 13 221 viable results. Due to the nature of social networking sites it will be impossible to verify the results against the Twitter database and therefore only the results as presented by Linq2Twitter can be verified. C. Facebook Scanning The Facebook Scanner was loaded with 26 keywords and executed for 1 hour 54 minutes. During the execution 13 250 results were detected at an average rate of 2.12 results per second. Of the 13 250 results 11 469 were duplicates of previous detected results which left 1 781 results. When passed through the similarity filter 766 results were found viable for the ORM service. Due to the nature of social networking sites it will be impossible to verify the results against the Facebook database and therefore only the results as presented by Facebook SDK can be verified. D. Sentiment Analysis To test the sentiment analysis tool the sentiment of 125 random results from each of the three web crawlers were calculated and manually verified. For the web crawler 91.8% of the sentiments were calculated correctly, for the Twitter Scanner 91.2% and for the Facebook Scanner 95.2%. VII. Conclusion This paper demonstrated an online reputation monitoring system capable of monitoring both the web and various social networking sites. The system is capable of extracting information from the internet in real time from various user- defined web sites as well as Twitter and Facebook and can determine which results are most likely to contain any relevant information via various filters. This improves upon existing ORM applications as neither BrandsEye nor Brand.Com can simultaneously scan the internet, Twitter and Facebook nor can any of the mentioned systems grant the user immediately access to any historical data. For a full discussion of the results and comparisons of the ORM system, please refer to the dissertation “Development of an Online Reputation Monitor” by G.J.C. Venter at the North- West University. While this system is capable of performing ORM services there are various areas available for future research. Development of an offline sentiment analysis tool as well as a custom web crawler will greatly improve the cost-efficiency of the system. Some of the included filters could also be further improved upon; while the Twitter Scanner will automatically filter out the non-English results, this feature is not available for the Facebook Scanner. VIII. Bibliography [1] A. B. Ayanwale, T. Alimi and M. A. Ayanbimipe, "The influence of advertising on consumer brand preference," Journal of Social Science, vol. 10, no. 1, pp. 9-16, 2005. [2] S. J. Hoch and Y.-W. Ha, "Consumer learning: advertising and the ambiguity of product experience," Journal of consumer research, pp. 221-223, 1986. [3] F. F. Reichheld, "The one number you need to grow," Harvard business review, vol. 81, no. 12, pp. 46-55, 2003. [4] N. Hu, L. Liu and J. J. Zhang, "Do online reviews affect product sales? The role of reviewer characteristics and temporal effects," Information Technology and Management, vol. 9, no. 3, pp. 201-214, 2008. [5] G. Pant, P. Srinivasan and F. Menczer, in Crawling the web, Springer, 2004, pp. 153-177. [6] Web2Mine, "Easy Web Extract Software," Web2Mine, 2013. [Online]. Available: http://webextract.net/. [Accessed July 2013]. [7] dtSearch, "How dtSearch Works," dtSearch, [Online]. Available: http://www.dtsearch.com/PLF_howdtworks.html. [Accessed March 2013]. [8] M. Annett and G. Kondrak, "A comparison of sentiment analysis techniques: Polarizing movie blogs," in Advances in artificial intelligence, Springer, 2008, pp. 25-35. [9] R. Feldman, "Techniques and applications for sentiment analysis," Communications of the ACM, vol. 56, no. 4, pp. 82-89, 2013. [10] R. Mihalcea and P. Tarau, "A language independent algorithm for single and multiple document summarization," Proceedings of IJCNLP, vol. 5, 2005. [11] Y. Li, D. McLean, Z. A. Bandar, J. D. O'shea and K. Crockett, "Sentence similarity based on semantic nets and corpus statistics," Knowledge and Data Engineering, IEEE Transactions on, vol. 18, no. 8, pp. 1138-1150, 2006. [12] RosettaCode.org, "Levenshtein Distance," 25 January 2014. [Online]. Available: http://rosettacode.org/wiki/Levenshtein_distance. [Accessed 3 March 2014]. [13] D. Dasgupta, G. Hernandez, D. Garrett, P. K. Vejandla, A. Kaushal, R. Yerneni and J. Simien, "A comparison of multiobjective evolutionary algorithms with informed initialization and Kuhn-Munkres algorithm for the sailor assignment problem," in Proceedings of the 2008 GECCO conference companion on Genetic and evolutionary computation, ACM, 2008, pp. 2129- 2134. 104 [14] AlchemyAPI, "About Us," Alchemy API, 2014. [Online]. Available: http://www.alchemyapi.com/ company/about-us/. [Accessed 17 April 2014]. [15] Lexalytics, "Salience Engine," Lexalytics, 2014. [Online]. Available: http://www.lexalytics.com /technical-info/salience-engine-for-text-analysis. [Accessed 28 04 2014]. G.J.C. Venter received his undergraduate degree from the North-West University (NWU) in Potchefstroom in 2012. He is currently studying towards his Masters in Engineering at the same institution. His research includes web data extraction and online reputation monitoring. 105 Bibliography [1] N. Hu, L. Liu and J. J. Zhang, “Do online reviews affect product sales? The role of reviewer characteristics and temporal effects,” Information Technology and Management, vol. 9, no. 3, pp. 201-214, 2008. [2] F. F. {Reichheld, “The one number you need to grow,” Harvard business review, vol. 81, no. 12, pp. 46-55, 2003. [3] E. Garner, “Customer Feedback Techniques,” Business Know-how, [Online]. Available: http://www.businessknowhow.com/marketing/customer-feedback.htm. [Accessed April 2013]. [4] BrandsEye, “About Overview,” Brandseye, 2013. [Online]. Available: http://www.brandseye.com/about-us/overview/. [Accessed January 2013]. [5] Brand.Com, “Brand.com,” Brand.com, 2013. [Online]. Available: http://www.brand.com/biz- control-plans.html. [Accessed September 2013]. [6] A. Engelbrecht, Interviewee, BrandsEye employee. [Interview]. 02 August 2014. [7] A. B. Ayanwale, T. Alimi and M. A. Ayanbimipe, “The influence of advertising on consumer brand preference,” Journal of Social Science, vol. 10, no. 1, pp. 9-16, 2005. [8] S. J. Hoch and Y.-W. Ha, “Consumer learning: advertising and the ambiguity of product experience,” Journal of consumer research, pp. 221-223, 1986. [9] E. Amigó, J. Artiles, J. Gonzalo, D. Spina, B. Liu and A. Corujo, “WePS-3 evaluation campaign: Overview of the online reputation management task,” in CLEF 2010 (Notebook Papers/LABs/Workshops), 2010. [10] G. Pant, P. Srinivasan and F. Menczer, in Crawling the web, Springer, 2004, pp. 153-177. [11] Web2Mine, “Easy Web Extract Software,” Web2Mine, 2013. [Online]. Available: http://webextract.net/. [Accessed July 2013]. [12] dtSearch, “How dtSearch Works,” dtSearch, [Online]. Available: http://www.dtsearch.com/PLF_howdtworks.html. [Accessed March 2013]. [13] I. Lunden, “Analyst: Twitter Passed 500M Users In June 2012, 140M Of Them In US; Jakarta ‘Biggest Tweeting’ City,” 30 July 2012. [Online]. Available: 106 http://techcrunch.com/2012/07/30/analyst-twitter-passed-500m-users-in-june-2012-140m-of- them-in-us-jakarta-biggest-tweeting-city/. [Accessed 11 March 2014]. [14] Pear Analytics, “Twitter Study - August 2009,” Auguest 2009. [Online]. Available: http://web.archive.org/web/20110715062407/www.pearanalytics.com/blog/wp- content/uploads/2010/05/Twitter-Study-August-2009.pdf. [Accessed 11 March 2014]. [15] Twitter, “REST API v1.1 Resources,” Twitter, 2013. [Online]. Available: https://dev.twitter.com/docs/api/1.1. [Accessed 11 March 2014]. [16] Twitter, “The Streaming API,” Twitter, 2013. [Online]. Available: https://dev.twitter.com/docs/api/streaming. [Accessed 11 March 2014]. [17] D. Rushe, “Facebook posts record quarterly results and reports $1.5bn profit for 2013,” TheGuardian, 29 Jan 2014. [Online]. Available: http://www.theguardian.com/technology/2014/jan/29/facebook-record-quarterly- results?CMP=EMCNEWEML6619I2. [Accessed 11 March 2014]. [18] T. Nasukawa and J. Yi, “Sentiment analysis: Capturing favorability using natural language processing,” in Proceedings of the 2nd international conference on Knowledge capture, ACM, 2003, pp. 70-77. [19] M. Annett and G. Kondrak, “A comparison of sentiment analysis techniques: Polarizing movie blogs,” in Advances in artificial intelligence, Springer, 2008, pp. 25-35. [20] R. Feldman, “Techniques and applications for sentiment analysis,” Communications of the ACM, vol. 56, no. 4, pp. 82-89, 2013. [21] P. Blinov, M. Klekovkina, E. Kotelnikov and O. Pestov, “Research of lexical approach and machine learning methods for sentiment analysis,” Russia, 2012. [22] CodePlex, “Html Agility Pack,” 11 Jul 2012. [Online]. Available: http://htmlagilitypack.codeplex.com/. [Accessed March 2013]. [23] dtSearch Corporation, “About dtSearch Corp,” dtSearch Corporation, 2013. [Online]. Available: http://dtsearch.com/dtsoftware.html. [Accessed January 2014]. [24] dtSearch, “dtSearch Text Retrieval Engine Programmer's Reference 7.73: Building and Maintaining Indexes,” dtSearch, 2013. [Online]. Available: 107 http://support.dtsearch.com/webhelp/dtsearchCppApi/frames.html?frmname=topic&frmfile=O verviews.html. [Accessed 07 March 2014]. [25] dtSearch, “ContractIQ Adds dtSearch to their Contract Management Solution in Support of the Commercial Industry and IC Clients,” dtSearch, [Online]. Available: http://www.dtsearch.com/CS_Contract_IQ.html. [Accessed 09 August 2014]. [26] dtSearch, “Densan Consultants developed NewsLink,” dtSearch, [Online]. Available: http://www.dtsearch.com/CS_densan_consultants.html. [Accessed 09 August 2014]. [27] dtSearch, “ATS Tackles World Trade Law with dtSearch,” dtSearch, [Online]. Available: http://www.dtsearch.com/cs_networkats.html. [Accessed 09 August 2014]. [28] CodePlex, “Tweetinvi a friendly Twitter C# API,” The TweetInvi Team, 6 March 2014. [Online]. Available: https://tweetinvi.codeplex.com/. [Accessed 17 March 2014]. [29] CodePlex, “Linq2Twitter,” 7 Feb 2014. [Online]. Available: http://linqtotwitter.codeplex.com/. [Accessed 11 March 2014]. [30] RosettaCode.org, “Levenshtein Distance,” 25 January 2014. [Online]. Available: http://rosettacode.org/wiki/Levenshtein_distance. [Accessed 3 March 2014]. [31] D. Dasgupta, G. Hernandez, D. Garrett, P. K. Vejandla, A. Kaushal, R. Yerneni and J. Simien, “A comparison of multiobjective evolutionary algorithms with informed initialization and kuhn-munkres algorithm for the sailor assignment problem,” in Proceedings of the 2008 GECCO conference companion on Genetic and evolutionary computation, ACM, 2008, pp. 2129-2134. [32] R. Mihalcea and P. Tarau, “A language independent algorithm for single and multiple document summarization,” Proceedings of IJCNLP, vol. 5, 2005. [33] Y. Li, D. McLean, Z. A. Bandar, J. D. O'shea and K. Crockett, “Sentence similarity based on semantic nets and corpus statistics,” Knowledge and Data Engineering, IEEE Transactions on, vol. 18, no. 8, pp. 1138-1150, 2006. [34] G. V. Bard, “Spelling-error tolerant, order-independent pass-phrases via the Damerau- Levenshtein string-edit distance metric,” Proceedings of the fifth Australasian symposium on ACSW frontiers, vol. 68, pp. 117-124, 2007. 108 [35] F. J. Damerau, “A technique for computer detection and correction of spelling errors,” Communications of the ACM , vol. 7, no. 3, pp. 171-176, 1964. [36] V. Levenshtein, “Binary codes capable of correcting deletions, insertions and reversals,” Academician P.S. Novikov, vol. 164, no. 4, pp. 845-848, 1965. [37] AlchemyAPI, “About Us,” Alchemy API, 2014. [Online]. Available: http://www.alchemyapi.com/company/about-us/. [Accessed 17 April 2014]. [38] M. Ogneva, “How Companies Can Use Sentiment Analysis to Improve Their Business,” Biz360, 19 April 2010. [Online]. Available: http://mashable.com/2010/04/19/sentiment- analysis/. [Accessed 08 September 2014]. [39] Facebook, “Public Feed API,” Facebook, 2013. [Online]. Available: https://developers.facebook.com/docs/public_feed/. [Accessed 27 March 2014]. [40] CodeProject, “An improvement on capturing similarity between strings,” Code Project, 5 Aug 2005. [Online]. Available: http://www.codeproject.com/Articles/11157/An-improvement-on- capturing-similarity-between-str. [Accessed March 2013].