Interested in a PLAGIARISM-FREE paper based on these particular instructions?...with 100% confidentiality?

Order Now

Assignment: Polyglot Persistence with NoSQL Systems Introduction In this assignment, you will show that you can work with different NoSQL systems for data persistence and understand the strength and weakness of each system with respect to certain query workload features. You are asked to work with a Question and Answer data set and a list of target queries. The target queries include very basic OLTP type queries and analytic queries. You will design data schema based on the data set feature and the given query workload for MongoDB and Neo4j respectively. You will show that your design can support the target queries by loading the data in each system following your schema and by running queries against the data. Data set The data that you will use is the latest dump (publication date: 2017-08-31) of the Open Data Stack Exchange question and answer site (https://opendata.stackexchange.com/). The dump is released and maintained by stackexchange: https://archive.org/details/ stackexchange. The original dump contains many files in XML format. The assignment uses a subset of the data stored in four csv files. The data files and the description (readme.txt) can be downloaded from ELearning. The assignment data set contains the following files: _ Posts.csv stores information about post, each row represents a post, which could be a question or an answer _ Users.csv stores user’s profile, each row represents a user _ Votes.csv stores detailed vote information about post, each row represents a vote, including the vote type, the date this vote is made and a few other information _ Tags.csv contains summary of tag usage in this site. 1 Target Queries Simple queries _ Find all users involved in a given question (identified by id) and their respective profiles including the creationDate, DisplayName, upVote and DownVote. _ Assuming each tag represents a topic, find the most viewed question in a certain topic.1 Analytic queries _ We want to discover the questions that are easy to answer. We measure easiness with the time it took to receive an accepted answer. The shorter the time, the easier the question. Given a list of topics, you are asked to list the question easiest to answer in each topic. _ We are interested in discovering trending topics in a certain period. An indicator of trending could be the number of users who have participated in the discussion, by posting either questions or answers, during the period. You are asked to list five such hottest topics in a given period as well as the number of users in each. _ We are also interested in discovering champion user in a certain topic. A champion user has the most answers accepted in that topic. Given a topic, you are asked to list the champion user and all questions the user has answers accepted in that topic. _ Some questions may not have accepted answers. They are considered as unanswered questions. We would like to recommend those questions to a group of potential answerer. For any active user, if the user has n answers accepted in a certain topic and n is greater than or equal to a threshold value _, we consider the user a potential answerer of unanswered questions in that topic. You are asked to recommend up to 5 most recent unanswered questions to a given user. _ We want to know if the acceptance of an answer would affect its upVote as well as other answers’ upVote. We denote the day an answer is accepted as “decision day”. You are asked to find out among all questions whoes total number of upVote is greater than or equal to a certain threshold value _ : – The question whose accepted answer has the highest percentage of upVote received after the “decision day” – The question whose other answers as a whole has the highest percentage of upVote received after the “decision day”. 1For any query, if there are more results than the specified limit, e.g. two questions have the highest view count but we only want one, you can return any of the valid results within the limit 2 _ Consider all users involved in a question as co-authors, for a given user, find out this user’s top 5 co-authors based on the number of times they appear in the same question either as originator or answerer. Tasks Your tasks include: _ Schema Design for each system You should provide two schema design versions to support both the simple queries and the analytic queries. For each schema version, make sure you utilize features provided by the storage system such as indexing, aggregation, ordering, filtering and so on. Please note that your schema may deviate a lot from the relational structure of the original data set. You will not get mark if you present a schema that is an exact copy of the relational structure in the original data set. You can discard data that are not useful. You can duplicate data if you find that helps with certain queries. You may preprocessing the data by adding fields that are not in the data set. Such preprocessing should not be used to create and save any query result directly in the database. You need to justify any preprocessing you have employed. The justification should include benefits, potential performance cost as well as how often the preprocessing should be carried out. Keep in mind that the storage system is also used to support online live transaction. Any activity such as posting or voting will send one or more write queries to the system. You should not use preprocessing that would cause significant delay to regular queries. _ Query Design and Implementation Load the complete data on both system and set up proper indexes that will be used by the target queries. Design and implement all target queries. You may implement a query using the provided query command (e.g. MongoDB shell or Cypher query) alone, or a combination of JavaScript and shell command in the case of MongoDB or as Python/Java program. In case that a programming language is used, make sure that you do most processing at the server side. You may get significant mark deduction if most of the analysis are implemented with programming language. Deliverable and Submission Guideline This is a group project, each group can have up to 2 students. Each group needs to produce the following: _ A Written Report . The report should contain four sections. The first section is a brief introduction of 3 the project. Section two and three should cover a system each. Section four should provide a comparison and summary. There is no mark on section one. It is included to make the report complete. So please keep it really short. Section two and three should contain the following two sub sections – Schema Design In this section, describe the schema with respect to the particular system. Your description should include information at “table” and “column” level as well as possible primary keys/row keys and secondary indexes. You should show sample data based on schema. For instance, you may show sample documents of each MongoDB collection, a sample property graph involving all node types and relationship types for Neo4j. Highlight the data that are different to the given one and justify your decision. These would include data you decide not to put in the database, data that are duplicated, or data that are results of preprocessing. – Query Design and Execution In this section, describe implementation of each target query. You may include the entire shell query command, or show pseudo code. You should also run sample queries and provide a brief performance analysis of each query by citing actual query execution statistics if such statistics are available or by indicating that certain indexing would be used. In section four, compare the two systems with respect to ease of use, query performance and schema differences. You can also describe problems encountered in schema design or query design. Submit a hard copy of your report together with a signed group assignment sheet in Week 10 lab. _ System Demo Each group will demo in week 10 lab. You can run demo on your own machine, on lab machine or on some cloud servers. Please make sure you prepare the data before the demo. The marker does not need to see your data loading steps. The marker will ask you to run a few randomly selected queries to get an overview of the data model and query design. The actual marking will be done after the demo using submitted code. _ Source Code/Script and soft copy of report Submit a zip file through the eLearning site, before 2pm on Tuesday 10th of October, 2017 (week 10). The zip file should contain the following: – a soft copy of the report 4 – query script or program code for each system – data loading script or program for each system – a Readme document for how to prepare the data and run the queries. The instruction should be detailed enough for the markers to quickly prepare the data and to run the queries. For instance, it is preferable that you prepare a single script for data loading. You should also indicate where and how run-time argument are supplied. If you used special features only available in a particular version or environment, indicate that as well. You may not get mark if the marker is not able to prepare the database and run your code.

Assignment: Polyglot Persistence with NoSQL Systems

Introduction

In this assignment, you will show that you can work with different NoSQL systems for

data persistence and understand the strength and weakness of each system with respect to

certain query workload features. You are asked to work with a Question and Answer data

set and a list of target queries. The target queries include very basic OLTP type queries and

analytic queries. You will design data schema based on the data set feature and the given

query workload for MongoDB and Neo4j respectively. You will show that your design can

support the target queries by loading the data in each system following your schema and

by running queries against the data.

Data set

The data that you will use is the latest dump (publication date: 2017-08-31) of the Open

Data Stack Exchange question and answer site (https://opendata.stackexchange.com/).

The dump is released and maintained by stackexchange: https://archive.org/details/

stackexchange. The original dump contains many files in XML format. The assignment

uses a subset of the data stored in four csv files. The data files and the description

(readme.txt) can be downloaded from ELearning.

The assignment data set contains the following files:

_ Posts.csv stores information about post, each row represents a post, which could

be a question or an answer

_ Users.csv stores user’s profile, each row represents a user

_ Votes.csv stores detailed vote information about post, each row represents a vote,

including the vote type, the date this vote is made and a few other information

_ Tags.csv contains summary of tag usage in this site.

1

Target Queries

Simple queries

_ Find all users involved in a given question (identified by id) and their respective

profiles including the creationDate, DisplayName, upVote and DownVote.

_ Assuming each tag represents a topic, find the most viewed question in a certain

topic.1

Analytic queries

_ We want to discover the questions that are easy to answer. We measure easiness with

the time it took to receive an accepted answer. The shorter the time, the easier the

question. Given a list of topics, you are asked to list the question easiest to answer

in each topic.

_ We are interested in discovering trending topics in a certain period. An indicator of

trending could be the number of users who have participated in the discussion, by

posting either questions or answers, during the period. You are asked to list five such

hottest topics in a given period as well as the number of users in each.

_ We are also interested in discovering champion user in a certain topic. A champion

user has the most answers accepted in that topic. Given a topic, you are asked to list

the champion user and all questions the user has answers accepted in that topic.

_ Some questions may not have accepted answers. They are considered as unanswered

questions. We would like to recommend those questions to a group of potential

answerer. For any active user, if the user has n answers accepted in a certain topic

and n is greater than or equal to a threshold value _, we consider the user a potential

answerer of unanswered questions in that topic. You are asked to recommend up to

5 most recent unanswered questions to a given user.

_ We want to know if the acceptance of an answer would affect its upVote as well

as other answers’ upVote. We denote the day an answer is accepted as “decision

day”. You are asked to find out among all questions whoes total number of upVote is

greater than or equal to a certain threshold value _ :

The question whose accepted answer has the highest percentage of upVote received

after the “decision day”

The question whose other answers as a whole has the highest percentage of

upVote received after the “decision day”.

1For any query, if there are more results than the specified limit, e.g. two questions have the highest view

count but we only want one, you can return any of the valid results within the limit

2

_ Consider all users involved in a question as co-authors, for a given user, find out

this user’s top 5 co-authors based on the number of times they appear in the same

question either as originator or answerer.

Tasks

Your tasks include:

_ Schema Design for each system

You should provide two schema design versions to support both the simple queries

and the analytic queries. For each schema version, make sure you utilize features

provided by the storage system such as indexing, aggregation, ordering, filtering and

so on. Please note that your schema may deviate a lot from the relational structure

of the original data set. You will not get mark if you present a schema that is an exact

copy of the relational structure in the original data set.

You can discard data that are not useful. You can duplicate data if you find that

helps with certain queries. You may preprocessing the data by adding fields that are

not in the data set. Such preprocessing should not be used to create and save any

query result directly in the database. You need to justify any preprocessing you have

employed. The justification should include benefits, potential performance cost as

well as how often the preprocessing should be carried out. Keep in mind that the

storage system is also used to support online live transaction. Any activity such as

posting or voting will send one or more write queries to the system. You should not

use preprocessing that would cause significant delay to regular queries.

_ Query Design and Implementation

Load the complete data on both system and set up proper indexes that will be used

by the target queries. Design and implement all target queries. You may implement

a query using the provided query command (e.g. MongoDB shell or Cypher query)

alone, or a combination of JavaScript and shell command in the case of MongoDB

or as Python/Java program. In case that a programming language is used, make

sure that you do most processing at the server side. You may get significant mark

deduction if most of the analysis are implemented with programming language.

Deliverable and Submission Guideline

This is a group project, each group can have up to 2 students. Each group needs to produce

the following:

_ A Written Report .

The report should contain four sections. The first section is a brief introduction of

3

the project. Section two and three should cover a system each. Section four should

provide a comparison and summary.

There is no mark on section one. It is included to make the report complete. So

please keep it really short.

Section two and three should contain the following two sub sections

Schema Design

In this section, describe the schema with respect to the particular system. Your

description should include information at “table” and “column” level as well

as possible primary keys/row keys and secondary indexes. You should show

sample data based on schema. For instance, you may show sample documents

of each MongoDB collection, a sample property graph involving all node types

and relationship types for Neo4j. Highlight the data that are different to the

given one and justify your decision. These would include data you decide not

to put in the database, data that are duplicated, or data that are results of

preprocessing.

Query Design and Execution

In this section, describe implementation of each target query. You may include

the entire shell query command, or show pseudo code. You should also run

sample queries and provide a brief performance analysis of each query by citing

actual query execution statistics if such statistics are available or by indicating

that certain indexing would be used.

In section four, compare the two systems with respect to ease of use, query performance

and schema differences. You can also describe problems encountered in

schema design or query design.

Submit a hard copy of your report together with a signed group assignment sheet in

Week 10 lab.

_ System Demo

Each group will demo in week 10 lab. You can run demo on your own machine, on

lab machine or on some cloud servers. Please make sure you prepare the data before

the demo. The marker does not need to see your data loading steps. The marker will

ask you to run a few randomly selected queries to get an overview of the data model

and query design. The actual marking will be done after the demo using submitted

code.

_ Source Code/Script and soft copy of report

Submit a zip file through the eLearning site, before 2pm on Tuesday 10th of October,

2017 (week 10). The zip file should contain the following:

a soft copy of the report

4

query script or program code for each system

data loading script or program for each system

a Readme document for how to prepare the data and run the queries. The instruction

should be detailed enough for the markers to quickly prepare the data

and to run the queries. For instance, it is preferable that you prepare a single

script for data loading. You should also indicate where and how run-time

argument are supplied. If you used special features only available in a particular

version or environment, indicate that as well. You may not get mark if the

marker is not able to prepare the database and run your code.

Interested in a PLAGIARISM-FREE paper based on these particular instructions?...with 100% confidentiality?

Order Now