Task 1: Estimating Population Size
The United States collects and analyzes demographic data from the U.S. population. The U.S. Census Bureau provides annual estimates of the population size of each U.S. state and region. Many important decisions are made using the estimated population dynamics, including the investments in new infrastructure, such as schools and hospitals; establishing new job training centers; opening or closing schools and senior centers; and adjusting the emergency services to the size and characteristics of the demographics of metropolitan and other areas, states, or the country as a whole. The census data and estimates are publicly available on the U.S. census website.
As a professional in the data analytics industry, you should know how to use tools that support the different stages and methods of analyzing data. These tools include environments that support performing data scraping, wrangling data, or applying various analyses.
For this project, you will use Python to scrape the web links from the HTML code of the U.S. Census Bureau’s Population Estimates web page, use SQL to spot differences in the population size, and use linear regression in R to predict the size of the population of your state in 2020.
The goal is to demonstrate your skill sets with Python, SQL, and R to support various data analytics processes.
You will use versions of Python, SQL, and R of your choosing that you will indicate in the attached “Student Submission Form.” You will also include the names of the files that house your responses to the task prompts, the code used, and all input and output files you used in your analyses.
Submit a completed copy of the attached “Student Submission Form” that includes the following elements:
1. versions of the programming environments for Python, SQL, and R used for the task
2. an inventory of the code, input, and output files used in each part
Submit one zipped folder with three subfolders that include the code, input, and output files from each part of the task, and that has a completed “Student Submission Form” in the main folder. Place the responses to the task prompts from each part in one PDF file for each part, and include these PDF files in the respective subfolders.
Your submission must be your original work. No more than a combined total of 30% of the submission and no more than a 10% match to any one individual source can be directly quoted or closely paraphrased from sources, even if cited correctly. Use the Turnitin Originality Report available in Taskstream as a guide for this measure of originality.
You must use the rubric to direct the creation of your submission because it provides detailed criteria that will be used to evaluate your work. Each requirement below may be evaluated by more than one rubric aspect. The rubric aspect titles may contain hyperlinks to relevant portions of the course.
Part I: Python
Develop a web links scraper program in Python that extracts all of the unique web links that point out to other web pages from the HTML code of the “Current Estimates” web link and that populates them in a comma-separated values (CSV) file as absolute uniform resource indicators (URIs).
A. Explain how the Python program extracts the web links from the HTML code of the “Current Estimates” web link.
B. Explain the criteria you used to determine if a link is a locator to another HTML page. Specify the code segment that executes this action as part of your explanation.
C. Explain how the program ensures that relative links are saved as absolute URIs in the output file. Specify the code segment that executes this action as part of your explanation.
D. Explain how the program ensures that there are no duplicated links in the output file. Specify the code that executes this action as part of your explanation.
E. Provide the Python code you wrote to extractallthe unique web links from the HTML code of the “Current Estimates” web link that point out to other HTML pages.
F. Provide the HTML code of the “Current Estimates” web page.
G. Provide the CSV file that your script created.
H. Test your script and provide a screenshot of the successfully executed results.
Part II: SQL
I. Identify the differences in the population size estimates for each U.S. state the Census Bureau provided in two consecutive years using the most current data and the latest historical data datasets for the national total population in two different SQL tables.
J. Write a code to join the two tables on the year and state fields into one SQL table that identifies the absolute differences (in whole rounded hundreds) in the estimates of 10,000 individuals or more between the two datasets. If the earlier estimates are larger than 10,000, the cells should indicate a negative value. Provide a screenshot of your tested code showing successful execution.
K. Explain how you prepared the data and how the datasets were imported into two SQL tables. Provide a screenshot of the successfully executed SQL code.
L. Export the data from the SQL table into a CSV file, with rows representing the states and columns representing the years that both datasets estimate, that only shows the differences between the datasets (in whole rounded tens of thousands) that exceed 10,000 individuals.
Part III: R
M. Create a linear regression analysis with R to predict the size of the population for the state you live in for 2020 based on the Current Estimates Data dataset.
N. Explain how you prepared the data and how the dataset was imported into R, including a screenshot of your results.
O. Using the estimates for the most recent year in the dataset, create an R script to display a histogram (using one million as the interval size) of the current estimated population size of your state. Provide a screenshot of your results.
P. Create an R script that will tabulate a statistical description of the estimated 2020 data. Provide a screenshot of your results.
Q. Predict the population size of your state using a linear regression. Provide a screenshot of your results.
R. Acknowledge sources, using in-text citations and references, for content that is quoted, paraphrased, or summarized.