Sunday, October 27, 2019

Applied Data Science - Capstone project




Applied Data Science - Capstone Project
Using location based and clustering to build recommendation system


          I.     Introduction Business Problem.

Ho Chi Minh city is largest city of Vietnam which have nearly ten million in terms of population and have a lot of non-residents such as business travelers or tourists. The statistics showed that there was more than 6 million international tourist visited Ho Chi Minh city in the first nine months of 2019. This is amazing number for any related business and this is the reason why a group of young investors would like to find a good location to start their business by setting up a restaurant or coffee shop in this crowded and dynamic city. The investor would like to leverage the data analyse advise them where is good location to open their business.
From this standpoint, there are several ways of approach such as identify where are the most attractives of people in the city or where are the business centres and so on. One of the approaches is using available location-based data to analyse it and make the recommendation.

        II.     Description of Data

In solving this problem, the location data comes from a csv file which define the latitude, longitude and other information of all the cities in Vietnam as well as its neighborhoods. This is the sample data of city and its neighborhoods:

Source file: https://raw.githubusercontent.com/dodtoan/Coursera_Capstone/master/vn.csv
Obviously, this data is raw data and need to be cleaned before actually use. The cleaned data can be used as "source" data to explore further venues in every single neighborhood using FourSquare API. There are some unnecessary fields should be removed cause Foursquare just need the latitude and longitude of the cities only and the purpose of the analyse just focuses on the Ho Chi Minh city so other city information would be removed too. The data after cleansing would be like:

By exploring the venue data from Foursquare, clustering algorithm would be applied to categorize the neighborhoods into several clusters which they have the similar properties and from that view, the good location to start cafe/restaurant business can be suggested.
To make a suggestion, some properties of data from Foursquare would be leverage to analyse to find the pattern and relation between the venues. They are:
1.    Name of venue;
2.    Categories;
3.    Latitude;
4.    Longitude;

      III.     Methodology

Stated in the business problem, the expectation outcome is a recommendation where is a suitable location to settle a restaurant. This kind of question would be good use case to utilize unsupervised machine learning and more precise, it is K-mean clustering algorithm and integrate the outcome with Foursquare API and folium library to visualize the result.
Let take a detail look into the data. The source data of location based for Ho Chi Minh city and its neighborhood has the size of (19,4), it means there are 19 neighborhoods in the investigated area.

With the folio library, the data set can be represented on the map as below:

From this point, the next step is using Foursquare to explore the venues in the neighborhoods. Cause the limited of the subscription, there is maximum of 100 venues in results and the radius for the exploration was set for 700m. This make a result as below:

The result showed that there were 98 venue categories found. To analyze each neighborhood and how the relative with its venues, above data need to be standardized. After applying the get_dummies() method in Python and merging the result, the new dataset looks like:

This dataset is still complicated to analyze cause there are 98 venues categories which most of them may not relevant to the features need to analyse then it would be transform to the new shape. The good idea is shortening the result into the top 5 common venues. After transforming the dataset, the new result would be:

It is obviously found that now the dataset had only 18 cities instead of 19 cities at the beginning. This missing will be discussed later but the new dataset is good enough to analyse.
It is time to apply the K-mean clustering algorithm. Before running the K-mean to the dataset, it would be necessary to find out what is best K. Using the Elbow method, the result showed that the K would be 3 or 4 but let’s take 3.

     IV.     Result:

Applying K-mean algorithm with K=3, merged with the original data, the result showed in the table below with 3 clusters:

Using folium library to visualize the result on the map, the clusters will be represented:

By exploring more detail on each cluster, the data showed that, it is recommended:
(i)             to open the restaurant, the good location is in the Cluster 0 which contains the neighborhoods as below:

(ii)           to open the coffee shop, the good location is in the Cluster 2 which contains the neighborhoods as below:

       V.     Discussion

Even though the algorithm generated the recommendation but actually there are several points need to be considered.
Firstly, from the data source point of view, it was not rich enough to analyse. It is both from the city data. To improve this barrier, the more detail data source would be collected, such as location of neighborhood at the ward level instead of district level as current situation.
Secondly, the business proposal used the simple features to analyse, that is venue categories. It would be suggested that the more features will be applied in the future version of the solution, such as venue price, venue like, venue rate.
These above limitations can be obviously found in the report when there is one missing city in the final result and the optimal K in the Elbow method looked not really good.

     VI.     Conclusion

Absolutely, machine learning could resolve many business problems nowadays but by this study, the important thing is the data for analyzing would be detail enough and also requires the analyst pay pretty much attention on exploring the data. Almost of the algorithms are integrated in the libraries and save a lot of effort in the data science project. In this example project, despite the are several aspects need to be improved but it definitely showed the result to audience in a pretty much visual way.