CS-181

Syllabus: CS 181 - Data Systems

Course Information

Professor Thomas C. Bressoud (Dr. B)
Office Olin 211
Email bressoud@denison.edu
Office Hours MTWF: 2:30 p.m. to 3:30
Class Meeting MTWF: 11:30 a.m to 12:20 a.m.
Final Exam E: Wed., May 9, 9:00 - 11:00 a.m.
Course Page https://notebowl.denison.edu
Teaching Assistants Gavin Thomas
thomas_g8@denison.edu
Luke Wiest
wiest_l1@denison.edu

Course Description

The Data Systems course provides a broad perspective on the access, structure, storage, and representation of data. It encompasses traditional Database systems, but extends to other structured and unstructured repositories of data and their access/acquisition in a client-server model of Internet computing. Also developed are an understanding of data representations amenable to structured analysis, and the algorithms and techniques for transforming and restructuring data to allow such analysis. The primary programming language used in the course will be Python.

Course Content

We have divided the course content into six units, scaffolding the learning goals and starting from the basis of the CS 11X introductory course and having each unit building from the last and each introducing additional complexity in Data Systems. Since this is the just the second offering of the course, units may take more or less class time to treat adequately, so the durations listed for each unit in the table below are approximate.

Unit Name Description and Learning Goals
01 Tabular Data Model The focus of this unit is on single, rectangular table/data frame datasets. We will learn about Python facilities for generating/transforming lists (list comprehensions and lambda functions) and work toward structured data frames supported by the pandas module. Data will be input from URLs and files and organized in rectangular text formats like CSV and variations. The data model implied by so-called “tidy” data arrangements will be covered. This unit will also reinforce foundational concepts from the CS1 class.
Duration: 3 weeks
02 Relational Data Model The single most important form of Data System, grounded in nearly 50 years of development of both underlying theory of structure and efficient implementations, is the Relational Database system. This unit will explore the data model associated with relational databases and the operations and organization involved in sound design of the set of tables for a relational database. We will also gain skills in querying existing database and creating databases using the Structured Query Language (SQL) through both direct commands and through programmatic use.
Duration: 3 weeks
03 Tree Data Model When we interact with data systems on the Internet, many providers of data structure and represent provided data in a flexible form that is based on a tree data model. Dominant specific examples include XML and JSON data formats, but HTML trees will also be discussed. This unit will focus on the operations involved in “traversing” such structures and formats and explore both declarative and programmatic means to process the data and build data frames in our programs to allow for analysis.
Duration: 2.5 weeks
04 Unstructured Data Data acquired over the Internet often has less structure. The original form of the data may be fuylly unstructured. Alternatively, data “fields” within a structured format may simply be viewed as an unstructured blob of text. This unit learns programmatic tools of string methods and regular expressions that allow us to process and build patterns of structure out of these unstructured sequences of text.
Duration: 1.5 weeks
05 APIs Systems providing data (servers) must be built in ways that allow client systems to follow established protocols and conventions that allow the client to determine and specify what data is desired. Further, systems may need to identify particular users and/or client applications to determine and distinguish valid requests from unauthorized access. The protocols and conventions for network-based access of data are known as the Application Programming Interface. This unit will cover some of the most common types of APIs and associated authentication, and look at the protocols and network communication that underlie these APIs.
Duration: 2 weeks
06 Further Topics In the final unit of the semester, we will choose from a number of advanced topics in Data Systems. Possible choices include applying our understanding of tree models to perform so-called “web scraping”, programming techniques for handling stream based data sources, understanding and using NOSQL (Not Only SQL) systems, and acquiring and storing data using cloud-based systems like Amazon Web Services (AWS) or Google Cloud.
Duration: 2 weeks

Integrated Course Aspects

- Visualizations: For each unit, we use Tableau, a cloud and desktop based tool that facilitates the design and construction of appealing visualizations from well structured datasets.
- Real-world Data Sources: The world is filled with messy data, sometimes incomplete and often repurposed and combined from multiple sources. One of the goals of this course is to enable students to adapt and compensate for the messiness of data and to structure and transform it into a form amenable for analysis and use. So projects in each instructional unit ask students to acquire data from primary sources and adapt it to their needs.

Resources

  1. Textbook

    The only required textbook for this course is the trade book: SQL Queries for Mere Mortals: A Hands-On Guide to Data Manipulation in SQL (3rd Edition) by John L. Viescas and Michael J. Hernandez, 2014, Addison Wesley. ISBN-13: 978-0321992475. Available from Amazon Prime at $25.94.

  2. DataCamp

    Through the DataCamp for the Classroom initiative, the folks at DataCamp are actively working to engage more students in learning R, Python, and statistics. Their mission is in line with our course objectives and so students in the class are able to take DataCamp courses free of charge through the duration of the course.

  3. Delivered materials

    We will also be using chapters from a yet-to-be-released book, Data Technologies and Computational Reasoning, through the gracious cooperation of Deborah Nolan, University of California, Berkeley and Duncan Temple Lang, University of California, Davis. The material in this book is used for an introductory data science course taught at Berkeley and the language employed is R, but we have written complimentary iPython notebooks for the chapters of interest to us.

    In addition, there will be numerous PDF-based materials distributed through the term for resources for which we have been granted permission to distribute.

  4. Online resources

    The Data Systems treated in this course cover a very broad spectrum, and there is no current textbook that adequately covers the subjects outlined above. Also, the Data Systems of the Internet are changing and evolving at a rapid rate. Therefore much of our reading and exploration of the topics of this course will involve investigation through tutorials, instructional materials, and other online sites and resources, and part of the goals of this class is to get students comfortable enough in this kind of exploration that, when faced with a need to acquire and structure data from a new data source, they can learn about the system and “come up to speed” in figuring out how to solve the problems before them.

Course Components

  1. Classwork/Quizzes/Participation

    Several times per unit, we will engage in hands-on tutorial and in-class sessions working with iPython notebooks. These will be submitted and checked for due-diligence completion and to evaluate basic correctness. We will have periodic pop and scheduled quizzes to help assess how well students are keeping up with the material. The lowest two Classwork/Quiz submissions will be dropped, but there will be no “negotiation” of consideration for additional submissions.

    In addition, students are expected to attend class and to be actively engaged, participating in discussions and asking questions.

  2. Exercises

    During each of the units covered, there will be exercise assignments intended for students to master the building blocks needed to achieve the learning goals of the unit. Typically, these will take the form of iPython notebooks, and many will have assert checks in the cells following the exercise cells so that students will have some level of immediate feedback regarding the correctness of their solutions. Exercises are individual and will be graded in detail. Late submissions will lose 10% per subsequent 24 hour period.

  3. Projects

    There will be approximately four main projects in the course, in which students will use and combine the knowledge gained from the exercises, the class coverage, and their own exploration from the Internet to synthesize solutions to project problems. Most projects involve the basic steps of acquiring data from one or more Internet sources, cleaning, processing and combining data, and designing and producing a visualization to answer basic questions about the data. Most of these projects will be completed in teams of two students.

  4. Midterms

    There will be 4.5 midterms spread throughout the semester. The first will only count half, and will be predominantly review from CS1. Each subsequent midterm will assess students' mastery of new material, but some cumulative material may also be expected because of the way the units build from one to the next. Note that all midterms count toward the final grade. Midterms may not be “made up” after the fact, but some accommodation to take a midterm early will be arranged in the case of a University sanctioned excuse. The first (review) midterm is scheduled for a week and a half from the start of class.

Grading

Category Weight
Homework 25%
Projects 20%
Classwork/Quizzes 6%
Midterms (4.5) 25%
Final Exam 20%
Participation 4%

Policies

Academic Integrity

Proposed and developed by Denison students, passed unanimously by DCGA and Denison’s faculty, the Code of Academic Integrity requires that instructors notify the Associate Provost of cases of academic dishonesty, and it requires that cases be heard by the Academic Integrity Board. Further, the code makes students responsible for promoting a culture of integrity on campus and acting in instances in which integrity is violated.

Academic honesty, the cornerstone of teaching and learning, lays the foundation for lifelong integrity. Academic dishonesty is intellectual theft. It includes, but is not limited to, providing or receiving assistance in a manner not authorized by the instructor in the creation of work to be submitted for evaluation. This standard applies to all work ranging from daily homework assignments to major exams. Students must clearly cite any sources consulted—not only for quoted phrases but also for ideas and information that are not common knowledge. Neither ignorance nor carelessness is an acceptable defense in cases of plagiarism. It is the student’s responsibility to follow the appropriate format for citations. Students should ask their instructors for assistance in determining what sorts of materials and assistance are appropriate for assignments and for guidance in citing such materials clearly.

For further information about the Code of Academic Integrity see http://denison.edu/forms/code-of-academic-integrity.

In this class, you may discuss problems with other students in the class, at the level of drawing pictures and abstract solutions on the whiteboard, but written and typed work must be your own. Under no circumstances is it ok for exchange of files between students, nor for one student to examine another student’s code on their screen. While it may be easy for students to pass files around in violation of this policy, it is also very easy for me to run programs to measure the degree of similarity between students' submissions, and I do so on a regular basis. Also note that these “plagiarism detectors” are not fooled by changing variable names and moving things around that some students think will fool the professor and disguise such cheating. See, for example, the MOSSMeasure of Software Similarity) system provided by Stanford University.

Students found responsible for breaches of academic integrity will earn a failing grade for the course and may receive even more severe consequences.

Accommodations

Any student who feels he or she may need an accommodation based on the impact of a disability should contact me privately as soon as possible to discuss his or her specific needs. I rely on the Academic ​Resource​ Center​ (ARC)​ in ​020 Higley to verify the need for reasonable accommodations based on the documentation on file in that office.


© Thomas Bressoud 2018