Applied
Data Science - Capstone
Project
Using location based and clustering to build recommendation
system
I. Introduction Business Problem.
Ho Chi Minh city is largest city of Vietnam which have
nearly ten million in terms of population and have a lot of non-residents such
as business travelers or tourists. The statistics showed that there was more
than 6 million international tourist visited Ho Chi Minh city in the first nine
months of 2019. This is amazing number for any related business and this is the
reason why a group of young investors would like to find a good location to
start their business by setting up a restaurant or coffee shop in this crowded
and dynamic city. The investor would like to leverage the data analyse advise
them where is good location to open their business.
From this standpoint, there are several ways of approach
such as identify where are the most attractives of people in the city or where
are the business centres and so on. One of the approaches is using available location-based
data to analyse it and make the recommendation.
II. Description of Data
In solving this problem, the location data comes from a csv
file which define the latitude, longitude and other information of all the
cities in Vietnam as well as its neighborhoods. This is the sample data of city
and its neighborhoods:
Source file:
https://raw.githubusercontent.com/dodtoan/Coursera_Capstone/master/vn.csv
Obviously, this data is raw data and need to be cleaned
before actually use. The cleaned data can be used as "source" data to
explore further venues in every single neighborhood using FourSquare API. There
are some unnecessary fields should be removed cause Foursquare just need the
latitude and longitude of the cities only and the purpose of the analyse just
focuses on the Ho Chi Minh city so other city information would be removed too.
The data after cleansing would be like:
By exploring the venue data from Foursquare, clustering
algorithm would be applied to categorize the neighborhoods into several
clusters which they have the similar properties and from that view, the good
location to start cafe/restaurant business can be suggested.
To make a suggestion, some properties of data from
Foursquare would be leverage to analyse to find the pattern and relation
between the venues. They are:
1.
Name of venue;
2.
Categories;
3.
Latitude;
4.
Longitude;
III. Methodology
Stated in the business
problem, the expectation outcome is a recommendation where is a suitable
location to settle a restaurant. This kind of question would be good use case
to utilize unsupervised machine learning and more precise, it is K-mean
clustering algorithm and integrate the outcome with Foursquare API
and folium library to visualize the result.
Let take a detail look into the data. The source data of
location based for Ho Chi Minh city and its neighborhood has the size of (19,4),
it means there are 19 neighborhoods in the investigated area.
With the folio library, the data set can be
represented on the map as below:
From this point, the next step is using Foursquare to
explore the venues in the neighborhoods. Cause the limited of the subscription,
there is maximum of 100 venues in results and the radius for the exploration
was set for 700m. This make a result as below:
The result showed that there were 98 venue categories found.
To analyze each neighborhood and how the relative with its venues, above data
need to be standardized. After applying the get_dummies() method in
Python and merging the result, the new dataset looks like:
This dataset is still complicated to analyze cause there are
98 venues categories which most of them may not relevant to the features need
to analyse then it would be transform to the new shape. The good idea is
shortening the result into the top 5 common venues. After transforming the
dataset, the new result would be:
It is obviously found that now the dataset had only 18
cities instead of 19 cities at the beginning. This missing will be discussed
later but the new dataset is good enough to analyse.
It is time to apply the K-mean clustering algorithm. Before
running the K-mean to the dataset, it would be necessary to find out what is
best K. Using the Elbow method, the result showed that the K would be 3 or 4
but let’s take 3.
IV. Result:
Applying K-mean algorithm with K=3, merged with the original
data, the result showed in the table below with 3 clusters:
Using folium library to visualize the result on the map, the
clusters will be represented:
By exploring more detail on each cluster, the data showed
that, it is recommended:
(i)
to open the restaurant, the good location is in
the Cluster 0 which contains the neighborhoods as below:
(ii)
to open the coffee shop, the good location is in
the Cluster 2 which contains the neighborhoods as below:
V. Discussion
Even though the algorithm generated the recommendation but
actually there are several points need to be considered.
Firstly, from the data source point of view, it was not rich
enough to analyse. It is both from the city data. To improve this barrier, the
more detail data source would be collected, such as location of neighborhood at
the ward level instead of district level as current situation.
Secondly, the business proposal used the simple features to
analyse, that is venue categories. It would be suggested that the more features
will be applied in the future version of the solution, such as venue price,
venue like, venue rate.
These above limitations can be obviously found in the report
when there is one missing city in the final result and the optimal K in the
Elbow method looked not really good.
VI. Conclusion
Absolutely, machine learning could resolve many business
problems nowadays but by this study, the important thing is the data for analyzing
would be detail enough and also requires the analyst pay pretty much attention
on exploring the data. Almost of the algorithms are integrated in the libraries
and save a lot of effort in the data science project. In this example project,
despite the are several aspects need to be improved but it definitely showed
the result to audience in a pretty much visual way.