wiki:cs122b-2017-winter-project3
Last modified 7 months ago Last modified on 02/16/17 15:00:41

CS122B Project 3: reCaptcha, HTTPS, Stored Procedure, XML Parsing

Due: 2/20/2017, Monday, 11:45 pm. Submit on EEE.
Notice that we use 1 day after the official deadline as the submission cut-off time on EEE to allow you to use the 24-hour grace period if you chose so. After that, EEE will no longer accept submissions.

Tasks

  1. (Optional) Register a domain name for Fabflix
  2. Adding reCAPTCHA to Project 2
  3. Adding HTTPS to Project 2
  4. Implementing a Dashboard using Stored Procedure
  5. Importing large XML data files into the Fabflix database

Task 1 (Optional): Register a domain name for Fabflix

If you are tired of using an IP address to visit your AWS instance and want to make it "real," you can register a domain name from a registrar company such as Godaddy. Then use the AWS instance to do the hosting. Now we are open for real business!


Task 2: Adding reCAPTHA to Project 2

Improve the user-authentication process of project 2 by adding a module that verifies real users and blocks bots.

  1. Download and deploy TomcatFormReCaptcha.war.
  2. Get a reCAPTCHA from Google. Use the public IP of your AWS instance as the domain. You will get a secret key and a site key.
  3. Follow the instructions on the page shown in step 2 to modify the file "webapps/TomcatFormReCaptcha/index.html". In particular, use your own "site key" to replace the "data-sitekey" value.
  4. Test the page http://YOUR_IP:8080/TomcatFormReCaptcha to see if reCAPTHA works for the front end (You should get an error saying "Recaptcha WRONG!!!!" if you followed the instructions properly)
  5. Read and understand the file "sources/TomcatFormReCaptcha.java", especially the logic related to verifying "g-recaptcha-response". Also modify the file "MyConstants.java" to use your own "SITE_KEY" and "SECRET_KEY" values.
  6. Recompile the Java files and reload the app using the Tomcat console. Test if the backend of reCAPTHA is working properly.
  7. Apply the same logic to your project 2 by making the necessary changes on the login frontend and backend logic.

Task 3: Adding HTTPS to Project 2

See this reference on how to configure Tomcat to enable HTTPS connections.

Step 1: On your AWS instance, use the following command to create a keystore to be used by Tomcat. Remember what password you use for the keystore. Our example uses "changeit".

sudo keytool -genkey -alias fabflix -keyalg RSA -keystore /etc/tomcat8.5/keystore

Step 2: Uncomment the Connector tag in /etc/tomcat8.5/server.xml, which has port="8443" and defines an SSL HTTP/1.1 connector. Modify it and use the following configuration:

<Connector port="8443" 
 protocol="HTTP/1.1"
 connectionTimeout="20000"
 redirectPort="8443"
 SSLEnabled="true"
 scheme="https"
 secure="true"
 sslProtocol="TLS"
 keystoreFile="/etc/tomcat8.5/keystore"
 keystorePass="changeit" />

Step 3: Restart tomcat:

sudo service tomcat restart

At this point, https://[YOUR_IP]:8443/manager/html should also an entry point to Tomcat. Note that it uses the secure kind of http (https) and a different port, which is 8443 by default. Make sure to open the port 8443 on your AWS instance.

Step 4: Limit a Tomcat application to HTTPS only. Currently http://YOUR_IP:8080/manager/html entry (which uses HTTP) is still open. In order to disable HTTP and enable HTTPS only, change your TomcatFormReCaptcha application's web.xml by adding the following code before the "</web-app>" tag:

<security-constraint>
 <web-resource-collection>
    <web-resource-name>HTTPSOnly</web-resource-name>
    <url-pattern>/*</url-pattern>
 </web-resource-collection>
 <user-data-constraint>
    <transport-guarantee>CONFIDENTIAL</transport-guarantee>
 </user-data-constraint>
</security-constraint>

This setting will force all HTTP requests to this application to get redirected to the corresponding HTTPS URL. Reload the application to see if the redirection works for the URL http://YOUR_IP:8080/TomcatFormReCaptcha/. Read this reference on more details about how to configure Tomcat to always require HTTPS.

Step 5: Make necessary changes to your project 2 to use HTTPS always with HTTP disabled.

Task 4: Implementing a Dashboard using Stored Procedure

Step 1: To store information about Fabflix's employees, create a new table called employees with the following attributes:

email varchar(50) primary key
password varchar(20) not null
fullname varchar(100)

Step 2: Insert an employee with classta@email.edu, classta, and TA CS122B as its email, password, and fullname respectively.

Step 3: Setup an entry point https://YOUR_IP:8443/fabflix/_dashboard that uses HTTPS. It allows employees to login with a valid email and password to access a dashboard.

Step 4: Allow a logged-in employee to do the following operations using this dashboard. (You can reuse part of your code in Project 1.)

  • Inserting a new star into the database. If the star has a single name, add it as his last_name and assign an empty string ("") to first_name.
  • Providing the metadata of the database; in particular, show the name of each table and, for each table, each attribute and its type.

Step 5: Write a stored procedure called add_movie . Its arguments include all the required fields of the movie, a single star and a single genre. (The reason we don't want to pass multiple stars and genres is that the current MySQL implementation doesn't allow stored procedures to take an array argument.) If a record corresponding to the star or genre exists, it is linked to the movie; if not, it is created and then linked to the movie. Insert a movie only if it does not exist. All necessary stars_in_movies and genres_in_movies records are also created. The procedure should output informative status messages to the user as it performs the task. If a movie's record set cannot be correctly made, the procedure needs to output a corresponding message, and no changes to the database are made.

Step 6: In the dashboard implemented in Step 4, add one more feature that allows an employee to add information related to a movie, including a single star and a single genre. The UI should allow the employee to provide the information, and the backend should call the stored procedure add_movie created above.


Task 5: Importing large XML data files into the Fabflix database

The goal of this task is to use a new source of data to increase the content size of the Fabflix movie database.

You can find the raw version of this new data source in this package. The main source of these files is here. Before starting to use these files, it is highly recommended to check the main source and read its documentation to get a better understanding of this dataset.

Particularly, you are required to parse the files mains243.xml and casts124.xml to add new movies, stars, and genres to the Fabflix database (i.e., on top of what you already have in the database). If needed, the stars_in_movies and genres_in_movies tables should also be updated accordingly. Also, for each XML file in this package, there is a DTD file to be used to understand/validate the structure of that XML file.

Please note that in this project it is expected that, if needed, you make reasonable decisions on how to use this new data source. The files are not clean and you may find inconsistencies in them. In such cases, do not be surprised. Make reasonable decisions. Such issues are very common in many real applications. You can use the following high-level description regarding the mapping of the XML tag attributes to the Fabflix database schema that was given in project 1.

As shown in Figure 1, the root tag in the file mains243.xml is <movies>, which consists of several instances of the tag <directorfilms>. Each <directorfilms> contains the information of a director, along with his/her movies, each movie being an instance of the tag <film>. It is clear where to find the values of all the columns of the movies table, except for the banner_url and trailer_url values that should be left NULL. Furthermore, each <film> has a list of assigned categories in a <cats> tag (i.e., each category as a <cat> tag). The value of the <cat> tag corresponds to the genre name in the genres table of the Fabflix database. If a <cat> does not exist in the genres table, a new genre must be added, and the genres_in_movies table must be updated accordingly.

Figure 1.

For updating the stars and stars_in_movies tables, you should use the content of the file casts124.xml, which is shown in a high-level view in Figure 2.

Figure 2.

Similar to the <movies> tag, the content of the <casts> tag, which is the root of the file casts124.xml, is also grouped by the movie directors. You can find the information of the star-movie connections in the instances of the <m> tag. In this tag, the film is identified using the <f> tag, which links to the <fid> tag of the <film> tag in the main file. Tag <m> also gives you the stagename of the actor of this relation in a tag called <a>. The stagename helps you connect this stars_in_movies record to the right star record. The stars to be added to the stars table are parsed out of the file actors63.xml. Except for the photo_url column of the stars table that should be left NULL, it is clear how to use the actors file to populate this table.

Although there are many libraries for parsing XML, there are two main approaches towards this problem: (1) SAX: event-based XML parsing and (2) DOM: using a tree-based object model for parsing XML. In SAX, the library moves through the XML tags and makes calls to event-handler functions that are implemented by the developer. In DOM, the developer is free to move on a tree of objects that is made available by the library, each object corresponding to an entity in the XML tree (e.g., a tag). You are free to choose between either of SAX or DOM approaches to parse the document. For examples, look here for SAX and here for DOM. You can find here, the instructions and files needed to run these examples.

Requirements:

  • XML parsing: You should write a Java program to parse the XML files and populate the database.
  • Performance Tuning: A naive implementation of your program is expected to be slow. Come up with at least two optimization techniques to reduce the running time of your program. Write an itemized, brief report accessible to describe your optimizations, and the corresponding time reductions.

Remarks:

  • In case of inconsistencies between the provided data with respect to the relational schema, make sure your program does not crash. The inconsistent data must be reported to the user (print the element name and node value) and your program should resume processing.
  • Use NULL for missing values when inserting records into the database.
  • For inconsistent values with the schema (like non-integer values for volume or roman numbers ) you can treat them as NULL.
  • For entries with multiple ISBN/publishers/etc, only use one of them.
  • Keep track of duplicate entries like multiple occurrences of an author and insert them only once.
  • Make sure the encoding of your parser is ISO-8859-1. You can ignore any special character that cannot be handled with this encoding.

Deliverables

Please submit all the required files (*.class, *.jar, *.jsp, *.html etc) to run your program as WAR. In addition, include any source files (*.java) under WEB-INF/sources. Read the following page to get information about how to create a WAR file.

  • Your submission "war" file should be named "project3_[groupID].war".
  • You should have the database name set to "moviedb", DB user name "testuser" and password "testpass" in the program and compile again before you submit.
  • In the WEB-INF/sources include a README file, which contains:
    • how to compile your java sources (if any) and how to install the generated class files.
    • the address of you AWS instance (please make sure it's the same one you used to register reCaptha).
    • the username and password of the tomcat manager app on your AWS instance.

If your project does not compile/run, it will get a very low score. Besides, please keep your AWS instance running until the grading is finished, and open your AWS instance ports 8080 and 8443 to the following IP addresses:

  • 128.195.52.56
  • 128.195.52.58

Demonstration and Grading Criteria

For this project, we do not require demonstration. Instead, we will run your submissions on your AWS instances to do the grading. Please make sure you follow the submission guidelines step by step carefully. If we cannot run/deploy your project easily, you may get some points deducted.

Attachments