Skip to content

Offline Implementation

fmalinowski edited this page Mar 18, 2015 · 18 revisions

ServerRunner Class that implements a basic web server running on the port 8000. It also retrieves all the data from the MySQL database and generates the whole offline graph through an instance of HashtagGraphBuilder.

MySQLAccessor Used to connect to the SQL database for the purpose of extracting tweets. Has the methods needed to query the db, but as described previously, this is only done once during startup in order to improve performance.

HashtagGraphBuilder This class is used to build the complete offline graph in memory based on the data extracted from the MySQL table tweets. This class is called before starting the web server. The purpose is to avoid querying the DB too often in order to provide faster responses to the clients requests. The HashtagGraphBuilder builds an instance of SortedHashtagGraph (subclass of HashtagGraph). The generated sortedHashtagGraph is used to generate smaller subgraphs upon requests coming from the users, rather than re-polling the SQL db, which can be quite slow.

The hashtags are the nodes of the graph. An edge exists between 2 hashtags when those two keywords are present in the same tweet. In addition, the nodes and edges are weighted according to their frequencies across all the tweets. If a hashtag appears in a tweet and its node representation does not exist, it is be added to the graph. However if that node is already present in the graph, its weight will be incremented. The same behavior happens with the edges. If two hashtags appear together in the same tweet but there is no existence of an edge between them then that edge is added to the graph. However if the edge is already present in the graph, its weight is incremented. This is the role of the method populateGraph in the HashtagGraphBuilder class, which is called during the creation of the SortedHashtagGraph.

Once the SortedHashtagGraph object is completely generated, the sortGraph() method is called. This method sorts all the nodes and edges according to their popularity (weights of nodes and edges) so that a subset of this graph will be generated with only the most common nodes and edges.

HashtagQueryProcessor An instance of that class is created every time, the server receives a request from a client for a given hashtag. The object instanciates a FrontEndGraphBuilder, which will create a subgraph (FrontEndGraph) of the original one adapted for the desired hashtag. Then it returns a JSON object that has been generated by the FrontEndGraph and that JSON object is sent back to the client.

FrontEndGraphBuilder This class is used to build the offline graph that will be displayed on the user's browser, that is to say the sub graph of the original one. That "small" graph is generated upon a request from an user.
We select the 20 most popular nodes from the SortedHashtagGraph (original graph) linked to the requested hashtag. Then for each of those 20 hashtags, we look for 5 hashtags that are linked to them. If among those 5 new hashtags, we get a hashtag that has already been selected during the first selection, we select a new node. Then for the last hop, we select 2 hashtags linked to the 5 nodes retrieved in the last step.
At the end of the process, we end up with 200 nodes which is an acceptable number of nodes for the display of the graph with D3 without having an important impact on the reactivity of the graph on the user's side.
However this doesn't count the edges! Indeed with such a number of nodes, we could get "200 chooses 2" edges that is to say about 20,000 edges. However, with the tweets that we handle in our project, about 1200/1300 edges are displayed in the worst cases.
Once this subgraph is generated, we return it to the FrontEndGraphBuilder and then the HashtagQueryProcessor that will send it back to the client on a JSON format.

Some statistical data are also added in the JSON object such as the number of tweets in which each of the hashtags appears as well as the percentage of tweets that deals with this hashtag, the degree of the node and its popularity rank. Those statistical data are based on the original graph and not on the subgraph in order to give a correct idea of the original graph to the user. All those data are computed on the backend to avoid any loss of performance on the client side.

Some pictures related to the offline mode:


The General info window of the graph



Over a node (here "NFL") for the query "NFL"



Over an edge (here between "valentinesday" and "valentine") for the query "NFL"



Over a node (here "Food") for the query "Food"

Clone this wiki locally