Sunday, July 26, 2015

Amazon CloudSearch - Digital Asset Management (DAM)

Introduction

In this post, I introduce a simple search application in the background of Amazon CloudSearch and domain Digital Asset Management (DAM). The primary objective of this search application is to exhibit search capabilities of Amazon CloudSearch with the functionality of Digital Asset Management. Further, we will also will setup and deploy the search application and see it from an end to end perspective.
The application will have source code examples for essential DAM functionalities like digital asset file upload, metadata extraction and storing file related information in a persistence store & search store. We will also share the application source code that will help developers to reuse, extend and customize further.
The below illustration depicts the system workflow of the DAM search application.

System Flow

dam1

Technology stack

The DAM search application is built using Java programming language. The functionality of DAM is developed in API model. It requires only a light weight servlet container to execute the API services. In this application, we have also used fault-tolerant Amazon Web Services like CloudSearch, S3, and DynamoDB. The below table summarizes technology stack used in the application.

Frond endHTML
Back endJava 8, AWS SDK
File storeAmazon S3
SearchAmazon CloudSearch
Persistence storeAmazon DynamoDB
AMIAmazon Linux HVM Image (If application hosted in Amazon EC2)
Other Front end toolsAngularJS, EvaporateJS
Servlet containerEmbedded Jetty

Please note that this blog post is intended for novices and experienced users alike. We expect readers to have Java programming knowledge and AWS experience. It is nice to have search background but not mandatory.
At the end of this tutorial, you should able to setup and deploy the application using the above mentioned technology stack and execute essential search functionalities like basic search, faceting and highlighting. We also encourage you to develop and customize new functionalities in this application by extending the source code. The download link of application source code can be found in the following sections.

Application Setup

In this section, we detail step-by-step instructions on build, deployment, and setup of the DAM application.

Pre-Requisites

The following pre-requisites have to be completed before proceeding to next set of activities. The pre-requisites include installations and configuration of dependent software and DAM application. The following installations and configuration should be done on the server that will host the DAM application.
Note 1: The application can be hosted either on-premise or on Amazon EC2. If it is an on-premise hosting, ensure the firewall is correctly set for outbound access.
Note 2: We recommend to host the application in a Linux based operating as some of the dependent tools have limited support for non-Linux operating system. The following steps apply only for Linux environment.
1.The application is developed using Java. Download and Install Java 8 .Validate the JDK path and ensure the environment variables like JAVA_HOME, classpath, path is set correctly.
2.Gradle is used for application build. Download latest Gradle version from the below link.
http://www.gradle.org/downloads
3.The application repository is hosted on Git. Download Git in the below link.
http://git-scm.com/book/en/v2/Getting-Started-Installing-Git
4.Download the application source code from
https://github.com/8KMiles/amazon-cs-dam/archive/master.zip
or
Use the following command to clone the application source from Git.
git clone https://github.com/8KMiles/amazon-cs-dam.git (requires Git installation)
Copy the source code in your working directory.
5.‘FFMPEG‘ is used for extracting meta-data from the digital assets. Download this tool from http://johnvansickle.com/ffmpeg/releases/ffmpeg-release-64bit-static.tar.xzRun the following commands to setup ‘FFMPEG’
unxz ffmpeg-release-64bit-static.tar.xztar -xvf ffmpeg-release-64bit-static.tar
cd ffmpeg-2.6.3-64bit-static/
cp ffprobe /usr/local/bin/
{In the above commands, we have downloaded ffmpeg, extracted it and copied the ‘ffprobe’ executable file to user local binary directory}

AWS Setup and configuration

In this section, we set up and configure AWS services required for the application.
1.We assume you already have setup Amazon Web services IAM account. Please ensure the IAM user has the right permissions to access AWS services like S3, DynamoDB and CloudSearch.
Note: If you do not have an AWS IAM account with above mentioned permissions, you cannot proceed further.
2.The IAM user should have AWS Access key and Secret key. In the application hosting server, set up the Amazon environment variables for access key and secret key. It is important that the application runs using the AWS environment variables.
To setup AWS environment variables, please read the below link.
http://docs.aws.amazon.com/AWSSdkDocsJava/latest/DeveloperGuide/credentials.html
http://docs.aws.amazon.com/AWSSdkDocsJava/latest/DeveloperGuide/java-dg-roles.html
Alternatively, you can set the following AWS environment variables by running the commands below from Linux console.
export AWS_ACCESS_KEY=Access Key  
export AWS_SECRET_KEY=Secret Key
3.Note: This step is applicable only if the application is hosted on Amazon EC2.
If you do not have an AWS Access key and Secret key, you can opt for IAM role attached to an EC2 instance. A new IAM role can be created and attached to EC2 during the instance launch. The IAM role should have access to AWS resources like S3, DynamoDB and CloudSearch.
For more information, read the below link
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html
4. Amazon S3 is used for storing the digital assets uploaded by the end-user.
a.Create a S3 bucket with a project relevant name (Example: amazon-cs-dam)
b.The S3 bucket should have list and upload permissions for the ‘Everyone’.
http://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html
5. DynamoDB is used as the persistence store for storing the application related data, meta-info of the digital assets.
a. Create DynamoDB table with a one hash key. It is nice to have a relevant table name based on the application (Example: amazon-cs-dam)
b. The read and write throughput can be default. It can be changed later based on application requirements.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GettingStartedCreateTables.html
6.CloudSearch is used as the search layer of this application.
a. Create a CloudSearch domain, name relevant to the application. (Example: amazon-cs-dam)
http://docs.aws.amazon.com/cloudsearch/latest/developerguide/getting-started-create-domain.html
b.Use the following for other inputs of CloudSearch creation
Engine TypeCloudSearch (2013 API)
Desired Instance TypeUse default
Desired Replication CountUse default
Configure IndexManual Configuration
Index Fieldsfile_name – textformat – textduration – textbitrate – textsize – texts3_url – text
Access policies‘Search and Suggester service: Allow all. Document Service: Account owner only’

Application build and deployment

1. After pre-requisites activity, validate Java path, Gradle path, and environment variables are set correctly.
2. Copy the downloaded DAM application source code to your working directory.
The content of DAM application should have the following directory structure
dam2
a.The source code of the application can be found inside the directory /server/src
3. Run the following command in your base of working directory. This gradle command will build the DAM application and generate the build output.
  gradle :server:installDist 
a.The build output can be found inside the directory /server/build
b.The directory of /server/build should have the following directory structuredam3
c.The highlighted ‘install’ is the most important directory as ‘server deployment content’ is stored here. The content of install directory will have a sub-directory ‘server’ and under that the following directory structure will be found. (Example: DAM BASE DIRECTORY/server/build/install/server)
dam4
4. The directory (DAM BASE DIRECTORY/server/build/install/server) is the final output of the build. Under this directory, you will find an executable called server.py
Run the below command to start the server.
About Configurations
5. The server deployment content is briefed in the below tabulation
Folder / FilesDescription
binBinaries of the application
libLibraries required to run the application
uiUser interface directory
application.confConfiguration file that allows end users to input parameters
logback.xmlLog file and log path configuration
server.pyApplication server executable file
Note: We recommend configuring only application.conf and logback.xml. 
6.application.conf  - The parameters configured in the in the application.conf file are explained in the table below.
server {port = 8080host = "0.0.0.0"probe-cmd = "/usr/local/bin/ffprobe -v quiet -of json -show_format -show_streams"tmp-store-dir = /tmpdebug = falseport – Application server port. Ensure no other application run on the same port
host – Application host name
probe-cmd = ffmpeg probe command {do no change this. Leave the default setting}
tmp-store-dir = digital assets meta extract temp path
Debug
  aws {s3 {upload-bucket = amazon-cs-damkey-prefix = uploadupload-access-key = ${?S3_UPLOAD_AWS_ACCESS_KEY}upload-secret-key = ${?S3_UPLOAD_AWS_SECRET_KEY}}dynamodb {table-name = amazon-cs-damhash-key = id}cloudsearch {upload-endpoint = ${?UPLOAD_ENDPOINT}search-endpoint = ${?SEARCH_ENDPOINT}}}upload-bucket = S3 bucket name where digital assets are stored
key-prefix = Key-prefix of the upload object
upload-access-key = AWS IAM user Access Key
upload-secret-key = AWS IAM user Secret Key
DyanmoDB table name
Hash key Id of the DyanmoDB table name
CloudSearch domain upload end point
CloudSearch domain search end point

 About the Application

In this section, we detail the API services with source code explanation. This will guide to understand the search functionalities and how it is implemented in Amazon CloudSearch. The application primarily supports two functions, Upload and Search.
Upload Functionality
The upload functionality is described as below.
  1. Store the uploaded media asset in S3,
  2. Extract the meta-data of the file using the tool ‘FFMPEG’
When the user uploads a digital asset or media file, the file is extracted for metadata like media format, file duration, bitrate, and file size. The extracted data is indexed a new document in CloudSearch domain.
Java Classamazon.cs.dam.api. Api
API Name/probe
CloudSearch classamazon.cs.dam.service.impl. CloudSearchService
URL
DescriptionUpload Digital Asset
List of Tasks1.      Extract meta data using ‘ffmpeg’,2.      Store the file in S33.      Store file name, meta information, s3 information in DynamoDB4.      Store file meta information in CloudSearch
Source
CloudSearchService.java – Line number 114
public void index(Map<String, Object> map) {
String json = Utils.toJson(Arrays.asList(Utils.newMap("type", "add", "id", map.remove("id"), "fields", map)));
if (json != null) {
LOG.debug("CS: {}", json);
try {
csd.uploadDocuments(new UploadDocumentsRequest().withDocuments(new StringInputStream(json))
.withContentLength((long) json.length()).withContentType(ContentType.Applicationjson));
} catch (UnsupportedEncodingException e) {
LOG.debug("", e);
}
}
}dam5
 Search Functionality
Java Classamazon.cs.dam.api. Api
API Name/search
URL
DescriptionSearch Digital Asset
List of Tasks1.      Search assets
Source
CloudSearchService.java – Line number 56
@Override
public Object search(Map<String, String[]> parameterMap) {
 SearchRequest req = new SearchRequest();
 req.withHighlight(DEFAULT_HIGHTLIGHTS_JSON);
 req.withFacet(DEFAULT_FACETS_JSON);
 req.withReturn(DEFAULT_RETURN_FIELDS);
 req.withSort(DEFAULT_SORT_ORDER);
 req.withQuery(DEFAULT_QUERY);
 req.withQueryParser(DEFAULT_QUERY_PARSER);
 for (Entry<String, String[]> entry : parameterMap.entrySet()) {
 String[] values = entry.getValue();
 String value = values[0];
 switch (entry.getKey()) {
 case "cursor":
 req.withCursor(value);
 break;
 case "expr":
 req.withExpr(value);
 break;
 case "facet":
 req.withFacet(value);
 break;
 case "fq":
 req.withFilterQuery(value);
 break;
 case "highlight":
 req.withHighlight(value);
 break;
 case "partial":
 req.withPartial(Boolean.parseBoolean(value));
 break;
 case "q":
 if (value.length() != 0 && !"*".equals(value))
 req.withQuery(value);
 break;
 case "q.options":
 req.withQueryOptions(value);
 break;
 case "q.parser":
 req.withQueryParser(value);
 break;
 case "return":
 req.withReturn(value);
 break;
 case "sort":
 req.withSort(value);
 break;
 case "start":
 req.withStart(Long.parseLong(value));
 break;
 case "size":
 req.withSize(Long.parseLong(value));
 break;
 }
 }
 return this.search.search(req);
}
[caption id="attachment_5112" align="alignnone" width="660"]Search Service Search Service[/caption]


[caption id="attachment_5112" align="alignnone" width="660"]Search Service using Query String Search Service using Query String[/caption]

dam7

[caption id="attachment_5124" align="alignnone" width="660"]Search Faceting Search Faceting[/caption]