Executing Long Running Tasks in Google App Engine – How to do it?

Most of the times a question flashes into the mind of the developers especially those who work on the Google Cloud Platform:

What if I am using Google App Engine Platform as a Service and will be having long running tasks that should run in the background for hours or maybe even days, is it possible?

Yes, it is possible. The answer is: use the “Task Queues”, one of the most laudable features provided by Google App Engine. But when hosted in the production, many types of unexpected problems arise with respect to long running tasks in task queues, which if not addressed, there may be a surprising behavior of the application in the production. Towards the end, we will be quickly discussing the different configurations required for respective application-specific requirements which are necessary for a long running task to run in the task queue.

Before going towards the discussion, let me quickly brief what is Google App Engine and what are Task Queues?

Google App Engine:

Google App Engine (most commonly referred as GAE) is a platform for building scalable web applications and mobile backends. It offers different features for web applications such as Automated Security Scanning for detecting web vulnerabilities, supports popular development tools such as Eclipse, IntelliJ, Maven, Git, Jenkins and PyCharm which makes developing on GAE developer friendly.

Moreover, features like User Authentication using Google Accounts, NoSQL Datastore, Memcache and Task Queues make GAE incomparable.

Therefore, it is most commonly preferred for developing web applications hosted on Google Cloud Platform.

GAE Task Queues:

Sometimes, there might be a scenario where a user takes a particular action on the web application and that task could be run outside of the user’s request which can be executed later.

For example, if a user wants to upload an “online” file to the web application, the user can provide just the link of the file to be uploaded and instead of waiting for the file to upload and prevent itself from performing other tasks on the application, user can return anytime later to check the progress of the upload. Here, the upload task is assigned to a task in task queue which runs asynchronously outside the user’s request and completes the task. Thus, the user can perform other tasks on the web application while the upload job will still be in progress in the background.

In this way, Task queues can help us to carry out important tasks that can be executed in the background.

Coming to the topic, let us proceed towards the discussion on how to run long running tasks in task queues and what configurations prevent the tasks to do so in production?

Let’s understand some very important information for achieving this:

Often times, the application behaves exactly as expected in the development server, but when on the production server there are chances that most of the features of the application does not work or does not yield as expected. This is because when we deploy the application on the cloud, we actually use the cloud resources and configurations that might be different from the development server. These resources and configurations can be configured by understanding the different cloud instances that are offered by the cloud service provider which are generally well described in their documentation along with the respective costs.

GAE offers two types of instance configurations, Frontend instances, and Backend Instances. As the names describe, the frontend instances are used to compute the operations that are carried out at the end-user level. The backend instances perform the computations in the background.

By default, when we first deploy our application to the GAE without editing the app-engine.XML (The application configuration file), the instance allocated is the most basic Frontend Instance. Beyond doubt, each instance is associated with respective prices.

As discussed above, scaling is the most promising feature of Google App Engine. Thereby, each instance can be configured with appropriate scaling. There are three types of scaling offered by the GAE, namely, Automatic Scaling, Basic Scaling and Manual Scaling. The configuration of scaling options can considerably affect the cost of running the application.

As the default instance is a Frontend instance, the default scaling is Automatic Scaling as per the documentation. This is where the concern for running long task in the task queue rises. How?

As per the official documentation, in Automatic Scaling, a task in a task queue can run maximum for 10 minutes. Thus, if there are tasks that can execute within the 10-minute deadline, this type of scaling would not create many problems. But we are talking about the long-running tasks that exceed the 10-minute deadline. So, how to make it work?

Indeed, automatic scaling would not help, we can switch to basic scaling or manual scaling. But while using Frontend instance, only automatic scaling is allowed. Therefore, it is also necessary to use Backend instance instead of Frontend instance.

Now, when used Backend instance, configure the app to use basic or manual scaling. For more information on scaling types and other information, please visit the official GAE documentation. (Link)

With the configuration of Backend instance and a Manual or Basic scaling, there are no restrictions for tasks to execute in 10 minutes deadline, instead, a task can run in the background until it completes its execution. However, using basic scaling would be preferred to control costs if there is no need to complex initializations and relying on state of instance’s memory over time.

Final Words:

While using different features of the Google App Engine, it must be noted that the behavior of the application is different on the development server and production server. Thus, before making a release, the application should be well tested for every feature on the Production Server.

Data Lake – Why should we use one?

From the last few years, we have observed a massive growth in the data than we have ever seen. Many organizations find an opportunity from this big data and develop different strategies to monetize it. But the major challenge is “Where to store all the data?”

We have data warehouses that store the data as per the prescribed standards of the organization. That means, when the data is coming, it may be stopped, different cleaning and smoothing operations might be performed and then are stored in the data warehouse. This indeed gives the concern about what to do about the data that won’t be requiring frequently and still different resources on processing that data are utilized.

This is where the “Data Lake” can be introduced.

A Data Lake is a gigantic data repository where the data is stored in its indigenous form. It acts as a centralized repository where the data coming from different sources are stored in its raw form without any cleaning or transformations thereby storing the data in its true form.

So why should one opt for Data Lake?

From the past two years, it has been observed that massive amounts of data are generated and there is a need to address this massive explosion of data. Most of the times, there is a comparison between Data Warehouse and Data Lake, but Data Warehouse consists of different components and stores the data in some standards which can be prescribed in the data transformation processes. The data lake can be thought of as a system that comes before data warehousing.

The term “DataLake” was first coined by James Dixon, CTO of Pentaho in 2012 to contrast with “Data Mart” or “Data Warehouse” which is a smaller repository of refined data extracted from the raw data.

He explained: “If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake and various users of the lake can come to examine, dive in, or take samples.”

Indeed, the Data Lake is not a replacement for Data Warehouse, actually if designed right, it can complement with your existing Data Warehouse and work effectively together. The best part of this integration will be that it can store all formats of the data (Structured, Semi-Structured and Unstructured) that is situated into one place.

(Image Source: www.solutionsreview.com)

Data Science – Understanding the concept and why it is important?

Over the last decade, there has been a massive growth in both data generated and data retained. These data are retained by companies as well as you and me, isn’t it? Sometimes, we call this as the “Big Data”.

Nowadays, the term “Data Science” is gaining a wide recognition. But what does a data scientist do? Data scientists are the people who make sense out of all the big data and determine what can be done with it in order to increase the productivity.

Let’s understand with an example:

Consider, you are visiting a candy shop, generally a person takes those candies that he likes, in contrast, data scientists are the people who will get all the flavors of the candies and analyze them because they really need to know what each one tastes like. In short, the title “Data Scientist” encompasses different flavors of the work. According to me, that is the major difference between a “Data Scientist”, “Statistician”, “Analyst” or an “Engineer”. A data scientist is one who does little of those tasks done by a statistician, analyst and an engineer.

To be more specific, a data scientist is one who does the following primary tasks:

  1. Data Cleaning
  2. Data Analysis
  3. Statistics
  4. Engineering

Let’s have a look at each of the tasks in brief:

  • Data Cleaning:

The data coming from different sources may contain a lot of noise, might be unformatted and might not be useful for generating valuable insights. This task ensures that all the data is well formatted and also conforms to some set of rules and standards.

  • Data Analysis:

In this task, lots of plots of data are made in order to understand the pattern of the data. Through this process, some theories regarding the data behavior are crafted in a way that will be easy to communicate and easy to act on.

  • Statistics:

A data scientist develops different models by understanding the data patterns through data analysis and develops some strategies based on the understood or developed statistics. But the most challenging aspect of this task is that the models or statistics cannot act as a permanent solution to the defined problem. Therefore, a lot of time is dedicated to this task in which a data scientist may need to evaluate and make some changes in the existing models, as well as going back to the data and bring out new features to help make better models.

  • Engineering:

The above-discussed tasks can just be defined or act as a tip of the iceberg. This is because even if we have state of the art data models for different applications, it doesn’t do anyone much good if the insights are not given to the customers or users and do it consistently. This means building a sort of a data product that can be used by the people who are not data scientists. This can be implemented in many forms like chart visualizations, metrics on a dashboard, or an application.

Understanding all the above tasks of a data scientist, in brief, it can also be understood that a long-term life cycle of a data science project may involve going back and re-analyzing the data models if there is always a new source of data coming in and there is a need to incorporate them.

Analyzing such traits and tasks of a data scientist it can be concluded beyond doubt that how great importance the data science and data scientist may have in the growth of any organization in the era of highest competition and the need of constant improvements in the services of the organizations.

(Image Courtesy: www.georgianpartners.com)

Internet of Medical Things (IoMT) -A Digital Revolution

Indeed the Internet of Things (IoT) is constantly contributing to solving our day to day problems. Products such as Pizza Button, IoT Refrigerators etc seems to be futuristic but its powering great amount of technology around us.

When the first pacemaker was implanted in 1958, it set off a wave of medical engineering innovation that has saved lives and reduced disability for millions of people.

Now, the emergence of Internet of Things in the field of Healthcare has marked an extraordinary revolution, the Internet of Medical Things (IoMT).

The Internet of Medical Things promises better, less invasive control of chronic conditions by delivering real-time data on a person’s insulin levels, heart rate, blood pressure, treatment compliance and other factors.

Let’s understand with an example:

Consider a patient equipped with some IoTM devices is receiving consistent care on any day in any place without always the need for face to face office visit for therapies and medication. As the huge data will be generated from the devices with information of different patient health data, the strengths and the deficits of the patients will be instantly available and clearly understandable. Deeper insights and data tracking can lift the curtains of understanding. Interesting!

Additionally, it may include a feature that enables the patients to share their data and get new insights, get support from the team of health advisors and many other benefits against the patient data in order to keep them coming back.

Although IoTM has clinch of benefits, but medical device security is still a topic of deep concern. As unsecured medical devices may be the major threat to our health this technology will only be all set to explode when all these concerns are resolved and gain the patient trust.

 

(Image Source: www.pixabay.com)

Business Intelligence – Just Big Data or “Good” Data?

Suppose you are running a “main-road” side shop and fortunately there is a high growth of customers in your shop, say at the minimum 50 customers per minute and everyone is willing to pay in cash. Your cash box contains different sections for keeping 100 rupees, 500 rupees, and 1000 rupees and etc separately. Answer the simple question as follows:

Will you make your customers wait so that you can arrange your cash box before they pay or will you collect all the money and arrange them at the end of the day?

A passionate business will obviously favor the second choice, isn’t it?

Nowadays the business scenario is actually identical to the above scheme.

Think Business Intelligence, the first thing that comes to our mind is graphs and charts that reveal valuable data about the customers, profits, and operations!

Nowadays, as the businesses are expanding, getting only the insights and acting upon them by just trusting these “raw” analytics may not be effective enough to stay in the competition.

What if we could simplify our data more?

As the business grows, the data inflow also expands. The storage cost, processing costs and development cost also horizontally scales up thereby increasing the need and use of non-relational databases.

But, not only collecting, storing and generating analytics from the data is important, nowadays the term “Good Data” is equally important for the businesses.

I would define “Good Data” as the information that would guarantee a business to take such decisions that would ultimately lead to profits and higher return on investments.

Though businesses may have different data warehouses storing “Big Data”, but instead of just generating insights from these “Big Data” if further simplification is applied to generate “Good Data” (defined above), the insights generated after these data shall be more reliable and highly valuable. This is already proven by certain organizations who worked on this concern and are successfully leading the competitive market.

Therefore, not just the Big Data, but the Good Data in the real sense, makes the business intelligence systems to be more brilliant and perceptive.

(Image Source: www.datapine.com)