Murali Kashaboina's Posts - Data Science Central2020-08-08T07:01:46ZMurali Kashaboinahttps://www.datasciencecentral.com/profile/MuraliKashaboina826https://storage.ning.com/topology/rest/1.0/file/get/2802437284?profile=RESIZE_48X48&width=48&height=48&crop=1%3A1https://www.datasciencecentral.com/profiles/blog/feed?user=0vnh4ktfju6ee&xn_auth=noSimple Correlational Analysis on Socioeconomic Factors Impacting Covid-19 Outbreak in US Countiestag:www.datasciencecentral.com,2020-06-21:6448529:BlogPost:9573782020-06-21T00:16:16.000ZMurali Kashaboinahttps://www.datasciencecentral.com/profile/MuraliKashaboina826
<p><strong>Background</strong></p>
<p>As part of my PhD work, I recently had to analyze any dataset(s) of my interest and present findings. I ended up conducting a study on US County-wise Covid-19 data. I wanted to share my key findings through this blog.</p>
<p><strong>Study Question</strong></p>
<p>The primary question I wanted to address through data analysis was <em>“Do counties’ socioeconomic factors such as population size, poverty rate, unemployment rate, education percent and…</em></p>
<p><strong>Background</strong></p>
<p>As part of my PhD work, I recently had to analyze any dataset(s) of my interest and present findings. I ended up conducting a study on US County-wise Covid-19 data. I wanted to share my key findings through this blog.</p>
<p><strong>Study Question</strong></p>
<p>The primary question I wanted to address through data analysis was <em>“Do counties’ socioeconomic factors such as population size, poverty rate, unemployment rate, education percent and urbanization rate have any direct impact on the Covid-19 outbreak across counties within United States?”</em></p>
<p><strong>Selected Datasets</strong></p>
<ol>
<li>Covid-19 daily cases and deaths data for every county within United States (US) captured and published by the New York Times between January 1, 2020 and May 31 2020 (Coronavirus (Covid-19) data in the United States, 2020).</li>
<li>Socioeconomic characteristics data such as population sizes, poverty, unemployment and education rates for every county within US captured and published by United States Department of Agriculture (USDA) (County-level socio-economic data in United States, 2020)</li>
<li>Land area data for every county within US captured and published by United States Census Bureau (County-level land area data in United States, 2020)</li>
<li>Urban-rural classification data for every county within US captured and published by United States Census Bureau (Counties’ urban-rural classification data in United States, 2020)</li>
</ol>
<p><strong>Data Details</strong></p>
<ol>
<li><strong>Covid-19 data</strong>: This data contained daily cases and deaths reported by each county across US captured between January 1, 2020 and May 31, 2020. Figure 1 below shows a snippet of the data.</li>
<li><strong>Counties’ population data</strong>: This data provided estimates of population for each US county captured between 2010 and 2019. Besides yearly estimates, this data contained many other estimates categorized by different demographics. For the purposes of current study, the 2019 estimate was used.</li>
<li><strong>Counties’ poverty data</strong>: This data provided estimates of poverty percentages for each US county which were last updated in 2018. There were several other estimates and metrics such as household incomes. For the purposes of current study, the overall poverty percentages updated in 2018 was used.</li>
<li><strong>Counties’ unemployment data</strong>: This data contained yearly unemployment rates for each US county captured between years 2000 and 2019. For the purposes of current study, the 2019 rates were used.</li>
<li><strong>Counties’ education data</strong>: This data provided percentages of population with less than high school diploma, with high school diploma, with some college degree and with bachelor’s degree or higher for every US county measured every decade starting in year 1970 and between years 2014 and 2018. For the purposes of this study, the percent with some college degree and the percent with bachelor’s or higher are used.</li>
<li><strong>Counties’ urban-rural data</strong>: This data provided population census based urban-rural classification data expressed in percentages for every US county. This data was captured for the decade in 2010 with incremental updates. The overall rural percent was considered.</li>
<li><strong>Counties’ land area data</strong>: This data provided land area in square miles for each US county captured every decade starting in 1990 and last updated in 2019. For the purposes of this study, 2019 values were used</li>
</ol>
<p style="text-align: center;"><a href="https://storage.ning.com/topology/rest/1.0/file/get/5956598858?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/5956598858?profile=RESIZE_710x" class="align-center" width="250"/></a><em>Figure 1</em>: US Counties’ Covid-19 Cases By Date</p>
<p><strong>Method</strong></p>
<p>The method used in this analysis was to find correlations between US counties’ socioeconomic factors and the number of county-wise Covid-19 cases reported. The socioeconomic factors studied included population density, poverty rate, unemployment rate, education percent and urbanization percent. Correlation analysis is a statistical method that gives insights about existence of connections between quantitative variables and provides a metric to infer the strength of such relationships.</p>
<p><strong>Analysis Tool</strong></p>
<p>Jupyter Notebook was used to script the analysis steps using Python 3 programming language. While Python data analysis libraries such as NumPy, Pandas and Sklearn were used for data cleansing, integration, wrangling and executing correlations, Python’s plotting libraries such as Matplotlib and Seaborn were used to plot the results.</p>
<p><strong>Analysis Steps</strong></p>
<p>The steps involved extracting the required data fields from each dataset, cleansing the data, wrangling, additional fields computation and integration of data records to prepare data ready for executing correlational analysis. Cleansing of data involved imputation of missing values in different datasets and standardization of state and county fields so that records from different datasets can be mapped. Data wrangling involved collapsing and consolidating metric values and reshaping the datasets. Integration step involved matching records from all the processed datasets using state and county fields as unique keys to prepare consolidated dataset that is ready for analysis. Each dataset had county and state name fields. These two fields are used as primary keys to map records from different datasets. The final step involved executing the correlational analysis on the integrated dataset.</p>
<p>While poverty rate and unemployment rate were used directly from the respective datasets, the factors population density, education percent and urbanization percent were computed and derived. Population density was calculated as population size per square miles by using counties’ population estimates from population dataset and the counties’ land area from land area dataset. Education percent of a county was computed by adding percent with some college degree and percent with bachelor’s or higher degree found in the counties’ education rates dataset. Urban percent of a county was deduced by subtracting rural percent value from 100.</p>
<p><strong>Data Preparation</strong></p>
<p>The state and county fields in different datasets were represented in different formats. For example, the county field in different datasets had values such as ‘Baldwin County, Alabama’, ‘Baldwin County, AL’, ‘Baldwin, AL’ where state was either indicated as full name or using state code. In other datasets, the county field had values such as ‘Baldwin County’ and ‘Baldwin’ with state captured in separate field either using full name or state code.<br/> As part of data preparation, county field was standardized to contain only the actual name part such as ‘Baldwin’. Similarly, the state field was standardized to contain only the state code such as ‘AL’. In case of datasets where state field had full name, the full name was transformed into a 2 letter state code.</p>
<p>Metric values in different datasets were on different scales. For example, total number of covid-19 cases was in hundreds of thousands, while population density values were in hundreds scale. It was challenging to visualize the data because data values with different scales could not fit the chart area. As such, there was a need to normalize and scale the data values specifically for visualization purposes. For example, logarithmic scale was used for total number of covid-19 cases, while min-max normalization coupled with a scaling factor used to visualize other metrics.</p>
<p><strong>Results</strong></p>
<p>Correlations were evaluated for all counties in US and separately for counties from top 4 states where the reported cases were higher. The top 4 states are New York, New Jersey, California and Illinois. Figure 2 below captures the correlations between the counties’ socioeconomic factors and total number of covid-19 cases.</p>
<p style="text-align: center;"><a href="https://storage.ning.com/topology/rest/1.0/file/get/6157512084?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/6157512084?profile=RESIZE_710x" class="align-center" width="700"/></a><em>Figure 2</em>: Correlation Matrices for All Counties, Counties in NY, NJ, CA, IL</p>
<p style="text-align: left;"><strong>Findings</strong></p>
<ul>
<li style="text-align: left;">Population density manifested positive correlation with the total number of covid-19 cases. This was an expected result because it is generally believed that the outbreak was significant in highly populated counties</li>
<li style="text-align: left;">State level correlations indicated that covid-19 impact was considerably higher in densely populated counties in New York, New Jersey and Illinois</li>
<li style="text-align: left;">Socioeconomic factors such as poverty, unemployment and education rates did not manifest any correlations with the total number of covid-19 cases. However, there was observable positive correlation between poverty rates and covid-19 cases in New Jersey. The joint positive correlations of both population density and poverty rate in New Jersey, perhaps indicates that the impact of covid-19 was higher in highly populated counties with higher poverty rates</li>
<li style="text-align: left;">Urban percentage manifested observable positive correlation with the total number of covid-19 cases. While urban percent showed strong positive correlation in New Jersey, it correlated significantly in the other three states. This result indicates that covid-19 impact was much higher in counties with lot more urbanization where typically higher population is expected</li>
</ul>
<p><strong>Other Findings</strong></p>
<p>Results also showed correlations among the socioeconomic factors themselves. The following are key non-covid related inferences.</p>
<ul>
<li>Population density manifested higher degree of positive correlations with urban and education percents. From this it can be inferred that urban counties had populations with higher education levels</li>
<li>Poverty rate manifested strong positive correlation with unemployment rates. This is perhaps an expected result since most counties with higher poverty rates also generally manifested higher unemployment rates</li>
<li>Both poverty and unemployment rates manifested strong negative correlation with education percent. This indicates that most populations in counties with higher poverty and unemployment rates lacked college degree</li>
</ul>
<p>Figure 3 below shows a scaled view of county-wise population density vs. covid-19 case in New York state.</p>
<p style="text-align: center;"><a href="https://storage.ning.com/topology/rest/1.0/file/get/5956999095?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/5956999095?profile=RESIZE_710x" class="align-center" width="500"/></a><em>Figure 3</em>: NY Counties’ Covid Cases Vs. Population Density Scaled Dot Plot</p>
<p style="text-align: left;"><strong>Key Inference</strong></p>
<ul>
<li style="text-align: left;">Such insights can be used by decision makers in both healthcare and government organizations to plan, prepare and implement measures against such outbreaks. Such insights would also help decision makers identify vulnerable counties, particularly with higher populations and with higher urban areas, where there is higher possibility of outbreak as indicated by the results</li>
<li style="text-align: left;">The findings also uncovered the need for more structured approaches to collect and process data for epidemiological outbreaks, need for a robust mechanism to collect such data in an automated fashion and a need for tools to conduct analysis and to assess actionable responses</li>
</ul>
<p><strong>Analysis Artifacts</strong></p>
<p>If anyone is interested in the analysis artifacts, please message me on LinkedIn.</p>
<p style="text-align: center;"><strong>References</strong></p>
<p>Coronavirus (Covid-19) data in the United States [Data set]. (2020). The New York Times. Retrieved from <a href="https://github.com/nytimes/covid-19-data">https://github.com/nytimes/covid-19-data</a></p>
<p>Counties’ urban-rural classification data in United States [Data set]. (2020). United States Census Bureau. Retrieved from <a href="https://www.census.gov/programs-surveys/geography/guidance/geo-areas/urban-rural.html">https://www.census.gov/programs-surveys/geography/guidance/geo-areas/urban-rural.html</a></p>
<p>County-level land area data in United States [Data set]. (2020). United States Census Bureau. Retrieved from <a href="https://www.census.gov/library/publications/2011/compendia/usa-counties-2011.html#LND">https://www.census.gov/library/publications/2011/compendia/usa-counties-2011.html#LND</a></p>
<p>County-level socio-economic data in United States [Data set]. (2020). United States Department of Agriculture Economic Research Service. Retrieved from <a href="https://www.ers.usda.gov/data-products/county-level-data-sets/">https://www.ers.usda.gov/data-products/county-level-data-sets/</a></p>
<p>Garattini, C., Raffle, J., Aisyah, D. N., Sartain, F., & Kozlakidis, Z. (2019). Big data analytics, infectious diseases and associated ethical impacts. Philosophy & Technology, 1, 69. <a href="http://dx.doi.org/10.1007/s13347-017-0278-y">http://dx.doi.org/10.1007/s13347-017-0278-y</a></p>
<p>Morgan, O. (2019). How decision makers can use quantitative approaches to guide outbreak responses. Philosophical Transactions of the Royal Society B: Biological Sciences, 374(1776), 1</p>
<p>Redding, D. W., Atkinson, P. M., Cunningham, A. A., Lo Iacono, G., Moses, L. M., Wood, J. L. N., & Jones, K. E. (2019). Impacts of environmental and socio-economic factors on emergence and epidemic potential of Ebola in Africa. Nature Communications, 10(1), 4531. <a href="http://dx.doi.org/10.1038/s41467-019-12499-6">http://dx.doi.org/10.1038/s41467-019-12499-6</a></p>Key Graph Based Shortest Path Algorithms With Illustrations - Part 2: Floyd–Warshall's And A-Star Algorithmstag:www.datasciencecentral.com,2020-01-25:6448529:BlogPost:9254532020-01-25T11:29:48.000ZMurali Kashaboinahttps://www.datasciencecentral.com/profile/MuraliKashaboina826
<p>In part 1 of this article series, I provided a quick primer on graph data structure, acknowledged that there are several graph based algorithms with the notable ones being the shortest path/distance algorithms and finally illustrated Dijkstra’s and Bellman-Ford algorithms. Continuing with the shortest path/distance algorithms, I have illustrated Floyd-Warshall and A* (A-Star) algorithms in this part 2 of the article. As was stated in part 1, while the inner workings of these algorithms are…</p>
<p>In part 1 of this article series, I provided a quick primer on graph data structure, acknowledged that there are several graph based algorithms with the notable ones being the shortest path/distance algorithms and finally illustrated Dijkstra’s and Bellman-Ford algorithms. Continuing with the shortest path/distance algorithms, I have illustrated Floyd-Warshall and A* (A-Star) algorithms in this part 2 of the article. As was stated in part 1, while the inner workings of these algorithms are thoroughly covered in many text books and informative resources online, I felt that not many provided visual examples that would otherwise illustrate the processing steps to sufficient granularity enabling easy understanding of the working details. In addition, a solid understanding of the intuition behind such algorithms not only helps in appreciating the logic behind them but also helps in making conscious decisions about their applicability in real life cases.</p>
<p><strong>Floyd-Warshall's Algorithm</strong></p>
<p>Floyd-Warshall’s algorithm is a dynamic programming based algorithm to compute the shortest distances between every pair of the vertices in a weighted graph where negative weights are allowed. Dynamic programming is a problem-solving approach in which a complex problem is incrementally solved, essentially iteratively, in a way that the values computed in previous iteration are used to compute the values in the current iteration. As such, to qualify for dynamic programming, a problem should be divisible into sub-problems with identical problem context. Floyd-Warshall employs dynamic programming approach by dividing a path between two vertices into two sub-paths that are connected via a third intermediary vertex. For example, a path between vertices A and D can be considered as a path via a third intermediary vertex C so that distance d(A,D) can be computed as sum of distances d(A,C) and d(C,D), i.e., d(A,D) = d(A,C) + d(C,D) and then the path between vertices A and C can be further considered as a path via yet another intermediary vertex B so that d(A,C) = d(A,B) + d(B,C). As such, the main problem of computing distance between vertices is divided into sub-problems of computing intermediary distances wherein both the main problem and the sub-problems deal with the identical context of finding distances between two vertices. At every step in the dynamic programming approach, Floyd-Warshall algorithm determines a shortest tentative distance between two vertices. Note that a distance between two vertices is considered as tentative until algorithm confirms it to be the shortest distance. The core of the algorithm is that the shortest distance between vertices A and C is minimal of either the tentative shortest distance found so far between A and C or the sum of the tentative shortest distances from A to B and from B to C, where B is the intermediate vertex. The algorithm iterates over the vertices of the graph and at each iteration, treats current iteration vertex as the intermediary while evaluating the tentative distances between pairs of other vertices. The algorithm completes after testing every pair possible in the graph and every intermediate vertex possible between a pair.</p>
<p style="text-align: center;"><a href="https://storage.ning.com/topology/rest/1.0/file/get/3831604935?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3832208605?profile=RESIZE_710x" class="align-center"/></a><strong><em>Figure 3 : Illustrations of Floyd-Warshall Algorithm</em></strong></p>
<p style="text-align: left;">Figure 3 illustrates the logical steps in the execution of Floyd-Warshall algorithm. In essence, the algorithm maintains an adjacency matrix comprising of evaluated distances between every pair of vertices that would get updated at each iteration. The example graph contains 4 vertices and hence algorithm would maintain an adjacent matrix of size 4 x 4 = 16 values. As an example, the value in 3<sup>rd</sup> row and 2<sup>nd</sup> column represents the evaluated distance between vertex 3 and vertex 2. At step 0, the algorithm initiates the matrix values. The algorithm sets distance of a vertex to itself as 0 as shown in the diagonal values of the matrix. The algorithm sets known direct edge weights between any two vertices as initial distances in the matrix. For all other pairs of vertices without direct edges, the algorithm sets the distances as infinity in the matrix.</p>
<p style="text-align: left;">The algorithm then begins step 1. At step 1, the algorithm employs dynamic programming approach by evaluating the minimum tentative distance between a pair of vertices by treating vertex 1 as the intermediate vertex. As such the algorithm evaluates the tentative distance between vertices i and j, A1( i, j ), as the minimum of either A0( i, j ) or A0( i, 1 ) + A0( 1, j ) for all values of i and j where i is not equal to j and neither i nor j equal to 1. In this case, A1 is the copy of adjacency matrix being updated in step 1 while A0 is the adjacency matrix created at initiation step 0. Essentially the dynamic programming formula of evaluating distances at step 1 is given by A1( i, j ) = Min( A0( i, j ), A0( i, 1 ) + A0( 1, j ) ) ∀ i ≠ j ⋀ i ≠ 1 ⋀ j ≠ 1. Because the formula constraints must be satisfied during the execution, the algorithm fixes the row 1, column 1 and the diagonal values for step 1 as shown in the figure 3. This means that the fixed values cannot be updated during this step. As such, the only distances that can be evaluated in step 1 are A1( 2, 3 ), A1( 2, 4 ), A1( 3, 2 ), A1( 3, 4 ), A1( 4, 2 ) and A1( 4, 3 ). The algorithm would evaluate the value A1( 2, 3 ) using the formula A1( 2, 3 ) = Min( A0( 2, 3 ), A0( 2, 1 ) + A0( 1, 3 ) ). Similarly, algorithm would evaluate the value A1( 4, 3 ) using the formula A1( 4, 3 ) = Min( A0( 4, 3 ), A0( 4, 1 ) + A0( 1, 3 ) ). As illustrated in figure 3 step 1, the algorithm would evaluate and update the values of A1 matrix accordingly to complete step 1.</p>
<p style="text-align: left;">The algorithm then begins step 2. At this step, the algorithm employs dynamic programming approach by evaluating the minimum tentative distance between a pair of vertices by treating vertex 2 as the intermediate vertex. Essentially the dynamic programming formula of evaluating distances at step 2 is given by A2( i, j ) = Min( A1( i, j ), A1( i, 2 ) + A1( 2, j ) ) ∀ i ≠ j ⋀ i ≠ 2 ⋀ j ≠ 2. In this case, A2 is the copy of adjacency matrix being updated in step 2 while A1 is the copy of adjacency matrix updated in step 1. In this step, the algorithm fixes row 2, column 2 and the diagonal values and hence the distances that can be evaluated in step 2 are A2( 1, 3 ), A2( 1, 4 ), A2( 3, 1 ), A2( 3, 4 ), A2( 4, 1 ) and A2( 4, 3 ). The algorithm would evaluate the value A2( 1, 3 ) using the formula A2( 1, 3 ) = Min( A1( 1, 3 ), A1( 1, 2 ) + A1( 2, 3 ) ). Similarly, the algorithm would evaluate the value A2( 4, 3 ) using the formula A2( 4, 3 ) = Min( A1( 4, 3 ), A1( 4, 2 ) + A1( 2, 3 ) ). Using this approach as shown in figure 3 step 2, the algorithm would evaluate and update the values of A2 matrix and complete step 2. In a similar manner, the algorithm executes step 3 and step 4. Note that at step 3, the algorithm evaluates the minimum tentative distance between a pair of vertices by treating vertex 3 as the intermediate vertex using the dynamic programming formula A3( i, j ) = Min( A2( i, j ), A2( i, 3 ) + A2( 3, j ) ) ∀ i ≠ j ⋀ i ≠ 3 ⋀ j ≠ 3, where A3 is the copy of adjacency matrix being updated in step 3 while A2 is the copy of adjacency matrix updated in step 2. Similarly, at step 4, the algorithm evaluates the minimum tentative distance between a pair of vertices by treating vertex 4 as the intermediate vertex using the dynamic programming formula A4( i, j ) = Min( A3( i, j ), A3( i, 4 ) + A3( 5, j ) ) ∀ i ≠ j ⋀ i ≠ 4 ⋀ j ≠ 4, where A4 is the copy of adjacency matrix being updated in step 4 while A3 is the copy of adjacency matrix updated in step 3. The updated matrix at step 4 contains the final shortest distances between any pair of vertices in the example graph.</p>
<p style="text-align: left;">As can be inferred from the illustration, the algorithm executes N steps where N is the number of vertices in the graph. At each step, the algorithm evaluates utmost (N<sup>2</sup> - 3N + 2) number of tentative distance values. Therefore the total number of distances evaluated by the algorithm is given by N x (N<sup>2</sup> - 3N + 2). As such the time complexity of Floyd-Warshall algorithm is in the order of N<sup>3</sup>. This time complexity is same as if executing Dijkstra’s algorithm (with time complexity of N<sup>2</sup>) N number of iterations where at each iteration, a vertex in the graph is considered as the source vertex to evaluate its distances to remaining vertices.</p>
<p style="text-align: left;"><strong>A* (A-Star) Algorithm</strong></p>
<p style="text-align: left;">A* algorithm is a heuristic based, greedy, best-first search algorithm used to find optimal path from a source vertex to a target vertex in a weighted graph. As was stated in part 1, an algorithm is said to be greedy if it leverages local optimal solution at every step in its execution with the expectation that such local optimal solution will ultimately lead to global optimal solution. A* algorithm works on a greedy expectation that a vertex with lowest evaluated cost is the best vertex to be on the path that will lead to the optimal path from the starting vertex to the target vertex. Incremental estimate of cost is evaluated based on two cost components. If V is the next vertex on the path, then the estimate of the cost is given by f(V) = h(V) + g(V), where h(V) is a heuristic associated with vertex V and g(V) is the estimated cost of the path thus far from the source vertex to vertex V. While g(V) is estimated as sum of edge weights from the source vertex to vertex V as in Dijkstra’s algorithm, the value h(V) is an assigned value to vertex V that can be considered as a metric indicating how close vertex V is from the target vertex.</p>
<p style="text-align: center;"><a href="https://storage.ning.com/topology/rest/1.0/file/get/3832702701?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3832710184?profile=RESIZE_710x" class="align-center" width="500"/></a><em><strong>Figure 4 : Dijkstra's Algorithm Example</strong></em></p>
<p style="text-align: center;"><strong><a href="https://storage.ning.com/topology/rest/1.0/file/get/3832703548?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3832710506?profile=RESIZE_710x" class="align-center" width="500"/></a></strong></p>
<p style="text-align: center;"><em><strong>Figure 5 : Pure Greedy Best First Algorithm Example</strong></em></p>
<p style="text-align: left;">As was illustrated in part 1 and as shown in figure 4, Dijkstra’s algorithm guarantees a shortest path from a source vertex A to target vertex F, however at the expense of computing shortest paths to every other vertex from source vertex A. Dijkstra’s algorithm greedily selects next vertex on the path purely based on how close the vertex is from the source vertex. Dijkstra’s algorithm does not take into consideration any heuristic indicating how close the next vertex is from the target vertex. The shortest path computed by Dijkstra’s algorithm in the example is {A, B, D, F}. On the other hand, as shown in figure 5, a pure heuristic driven, greedy, best first algorithm goes about picking the next vertex purely based on the vertex’s closeness to the target vertex indicated by the associated heuristic value. It does not take into consideration how close the next vertex is from the source vertex. As shown in figure 5, starting at source vertex A, the algorithm first selects vertex E as the next vertex because it has lower heuristic compared to vertices B and C. Then starting at vertex E, the algorithm selects vertex G as the next vertex because it has lower heuristic compared to vertex D and finally starting at vertex G, the algorithm selects the target vertex F whose heuristic is 0. As such the optimal path computed by pure greedy best first algorithm, in the example, is {A, E, G, F}.</p>
<p style="text-align: left;">A* can be considered as an algorithm that tries to combine the best of the both worlds – Dijkstra’s and pure greedy best first algorithms. In essence, A* algorithm not only considers how close next vertex is from the target vertex indicated by h(n) but also considers how close next vertex is from the source vertex indicated by g(n) to find the optimal path. Starting at the source vertex, the algorithm would evaluate f(n) = h(n) + g(n) value for all the neighbors and will pick the neighbor V with lowest f(n) value as the next vertex on the path. The algorithm would repeat computing f(n) values for the neighbors of vertex V and selects the next vertex with lowest f(n) value. The algorithm repeats this process until evaluation reaches the target vertex.</p>
<p style="text-align: center;"><a href="https://storage.ning.com/topology/rest/1.0/file/get/3832704522?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3832704522?profile=RESIZE_710x" class="align-center"/></a><em><strong>Figure 6 : Illustrations of A* Algorithm</strong></em></p>
<p style="text-align: left;">Figure 6 illustrates the logical steps in the execution of A* algorithm. The steps are intentionally simplified to ensure that focus is on understanding the intuition. The goal of the algorithm in the illustration is to find the optimal path between source vertex A and target vertex F. Each vertex in the weighted graph is assigned with a heuristic value relative to target vertex. The algorithm maintains an ordered set representing path vertices, initially containing the source vertex A. As shown in figure 6, at step 1, the algorithm starts at source vertex A and computes f(n) = h(n) + g(n) for each of its neighbors, B, C and E. The algorithm selects vertex E as the next best vertex as its f(n) value 10 is the lowest as compared to B’s value of 13 and C’s value of 11. The algorithm adds vertex E to the path vertices set. At step 2, the algorithm moves to vertex E and computes f(n) value for each of the neighbors of vertex E – value of 23 for vertex C, 11 for vertex D and 13 for vertex G in the example. Since vertex D’s f(n) value is the lowest, the algorithm selects vertex D as the next best and adds it to the path vertices set. The algorithm then moves to vertex D at step 3 and computes f(n) values for its neighbors – a value of 28 for vertex C, 11 for vertex F and 20 for vertex G. The algorithm selects vertex F since it has the lowest f(n) value, adds vertex F to the path vertices set and moves to vertex F. The algorithm determines that it has reached the target vertex F and stops the execution. The resulting path vertices set contains the optimal path {A, E, D, F}.</p>
<p style="text-align: left;">One of the main challenges in employing A* algorithm is the determination of vertex heuristic values. There are suggested methods to compute either exact values or approximate values, covering whose detail is perhaps a different discussion topic. Whichever method is employed, it is generally suggested that heuristic values assigned to vertices are distinct so that no two vertices get the same heuristic value. This is to avoid the need for breaking a potential tie when more than one vertex ends up getting same f(n) value. The selection of the heuristic computation method also impacts the time complexity of A* algorithm. In the worst case, the time complexity can depend exponentially on the number of vertices for which f(n) value would need to be evaluated.</p>
<p style="text-align: left;"><strong>More Algorithms</strong></p>
<p style="text-align: left;">In the upcoming continuation parts of this article, I will cover additional graph based shortest path algorithms with concrete illustrations and I hope that such illustrations would help in understanding intuitions behind such algorithms.</p>Key Graph Based Shortest Path Algorithms With Illustrations - Part 1: Dijkstra's And Bellman-Ford Algorithmstag:www.datasciencecentral.com,2020-01-15:6448529:BlogPost:9223022020-01-15T00:00:00.000ZMurali Kashaboinahttps://www.datasciencecentral.com/profile/MuraliKashaboina826
<p>While many of the programming libraries encapsulate the inner working details of graph and other algorithms, as a data scientist it helps a lot having a reasonably good familiarity of such details. A solid understanding of the intuition behind such algorithms not only helps in appreciating the logic behind them but also helps in making conscious decisions about their applicability in real life cases. There are several graph based algorithms and most notable are the shortest path…</p>
<p>While many of the programming libraries encapsulate the inner working details of graph and other algorithms, as a data scientist it helps a lot having a reasonably good familiarity of such details. A solid understanding of the intuition behind such algorithms not only helps in appreciating the logic behind them but also helps in making conscious decisions about their applicability in real life cases. There are several graph based algorithms and most notable are the shortest path algorithms. Algorithms such as Dijkstra’s, Bellman Ford, A*, Floyd-Warshall and Johnson’s algorithms are commonly encountered. While these algorithms are discussed in many text books and informative resources online, I felt that not many provided visual examples that would otherwise illustrate the processing steps to sufficient granularity enabling easy understanding of the working details. As such, I had to use simple enough graphs to visualize the algorithmic flow for my own understanding and I wanted to share my examples along with the explanations through this article. Since there are many algorithms to illustrate, I decided to divide the article into several parts. In part 1, I have illustrated Dijkstra’s and Bellman-Ford algorithms. Before diving into algorithms, I also wanted to highlight salient points on the graph data structure.</p>
<p><strong>Quick Primer On Graph Data Structure<br/></strong></p>
<p>A graph is a data structure comprising of a finite non-empty set of vertices wherein some pairs of vertices are connected. In real life, such vertices represent real world objects wherein some pairs of such objects are related and such relationship is represented by a link connection. The link between a pair of vertices is referred to as an edge. Edges have directionality. In case of unidirectional edge, an arrow points from the tail vertex (source) to the head vertex (target) and hence link goes one way. As such, an edge between vertices v1 and v2 is an ordered pair (v1, v2) where v1 is the tail vertex and v2 is the head vertex. In case of a bidirectional edge, arrows point in both the directions and hence link goes both ways. As such, an edge between vertices v1 and v2 is unordered pair wherein both (v1, v2) and (v2, v1) represent the same edge. A graph which contains all unidirectional edges is called as a directed graph. A graph which contains all bidirectional edges is called as undirected graph. A graph in which some edges are unidirectional and some are bidirectional is called as mixed graph. The number of edges incident to a vertex is called as the degree of the vertex. The out-degree of a vertex is the number of directed edges incident to the vertex where the vertex is the tail and the in-degree of a vertex is the number of directed edges incident to the vertex where the vertex is the head. In addition, edges have weights. Edge weight represents the capacity or cost or distance of that edge. As such, edge weight can be positive or negative number. A path from vertex v1 to vertex vn is a sequence of vertices v1, v2, v3...vn in a graph such that the pairs (v1, v2), (v2, v3)…(vn-1, vn) are connected via edges in the graph. As such, two vertices are connected if a path exists between them in the graph. A path is said to be simple if all the vertices are distinct with the exception of the first and the last vertices. A path is said to be circular or cyclic if the first and the last vertex are same. A directed graph without any circular paths is called as Directed Acyclic Graph (DAG). The number of edges in a path represents the path’s length and the sum of the edge weights in the path represents the capacity or cost or distance of that path. If the path weight is negative in a cyclic path, then that path is referred to as negative cycle. </p>
<p>A graph is said to be complete if each of its vertices is connected to all other vertices. If there are N vertices in a complete graph, then there will be N(N-1)/2 edges in the graph. Complete graphs are also commonly referred to as universal graphs.</p>
<p><strong>Dijkstra’s Algorithm</strong></p>
<p>Dijkstra’s algorithm is a greedy algorithm used to find the shortest path between a source vertex and other vertices in a graph containing weighted edges. An algorithm is said to be greedy if it leverages local optimal solution at every step in its execution with the expectation that such local optimal solution will ultimately lead to global optimal solution. As such, Dijkstra’s algorithm works on a greedy expectation that a sub-path between vertices A and B within a global shortest path between vertices A and C, is also a shortest path between A and B. The limitation in Dijkstra’s algorithm is that it may not work if there are negative edge weights and definitely will not work if there are negative cycles in the graph. The algorithm measures the shortest path from the source vertex to all other vertices by visiting a source vertex, measuring the path lengths from the source to all its neighboring vertices and then visiting one of the neighbors with the shortest path. Algorithm repeats these steps iteratively until it completes visiting all vertices in the graph.</p>
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/3817275953?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3821291440?profile=RESIZE_710x" class="align-center" width="750"/></a></p>
<p style="text-align: center;"><em>Figure 1: Illustrations of Dijkstra's Algorithm</em></p>
<p>Figure 1 illustrates the logical steps in the execution of Dijkstra’s algorithm. The algorithm starts with maintaining two sets; 1] a set with all unvisited vertices and 2] a set with all visited vertices which is initially empty. In addition, the algorithm maintains a tally of tentative shortest paths for each vertex in the graph measured thus far from the source vertex and keeps a reference to a predecessor vertex on the path to each vertex from the source vertex. As shown in figure 1, at step 1, the algorithm sets the distances from the source vertex, A in the example, to rest of the vertices as infinity and the distance from the source to itself as 0. At step 2, the algorithm measures the tentative distances to each of the unvisited neighbors of source vertex, B, C and E in the example. A distance of a vertex from the source is considered as tentative until algorithm confirms it to be the shortest distance. Tentative distance of vertex V, given the tentative distance of vertex U from the source, is measured using the formula d(U) + d(U,V) where d(U) is the tentative distance of vertex U from the source vertex and d(U,V) is the edge weight between vertices U and V. If newly measured tentative distance, i.e., d(U) + d(U,V), is less than the previously assigned tentative distance of vertex V, d(V), then the algorithm updates tentative distance of vertex V with the new value, i.e., d(V) = d(U) + d(U,V). This process of updating vertex distance if the newly measured distance is less than the previously assigned distance is commonly referred to as relaxation. If a vertex’s distance gets updated with newly measured distance lesser than its previous measured distance, then the vertex is considered as relaxed. In the example, algorithm relaxes vertices B, C and E and at the same time sets vertex A as their predecessor vertex. The algorithm marks the current source vertex A as visited, pops it out of the unvisited set and places it in the visited set. The algorithm then determines the neighbor vertex with shortest distance as the next vertex to be visited, vertex C in the example, and iterates to the next step. At step 3, the algorithm measures the tentative distances of unvisited neighbors of vertex C, i.e., B, D and E, relative to the original source vertex A. The algorithm relaxes vertex D with new tentative distance and sets its predecessor path vertex as C. The algorithm does not update the tentative distances of vertices B and E since their distances measured via vertex C are greater than their previously assigned tentative distances. As in step 2, the algorithm marks the current vertex C as visited, pops it out of the unvisited set and places it in the visited set. The algorithm then determines the unvisited vertex with the shortest distance as the next vertex to be visited, vertex B in the example, and iterates to the next step. Algorithm will repeat such steps until all the vertices have been marked as visited or there are no more connected vertices to evaluate. The shortest path from source vertex to any other vertex can then be determined by looking up the predecessor vertices from the evaluated table. For example, to determine the shortest path from vertex A to vertex G, table can be looked up to find the predecessor vertex of G which is D. The predecessor vertex of D is B and the predecessor vertex of B is A. As such, the shortest path from vertex A to vertex G is <em>{A,B,D,G}</em> with a shortest distance of 11.</p>
<p>In a complete graph comprising of N vertices, where each vertex is connected to all other vertices, the number of vertices to be visited by the algorithm will be N. Also the number of vertices potentially relaxed each time a vertex is visited is also N. As such, the worst case time complexity of Dijkstra’s algorithm is in the order of NxN = N<sup>2</sup>.</p>
<p><strong>Bellman-Ford Algorithm</strong></p>
<p>Bellman-Ford algorithm is used to find the shortest paths from a source vertex to all other vertices in a weighted graph. Unlike Dijkstra’s algorithm, Bellman-Ford algorithm can work when there are negative edge weights. The core of the algorithm is centered on iteratively relaxing the path distances of vertices from the source vertex. If there are N vertices, the maximum number of edges from the source vertex to the N<sup>th</sup> vertex could possibly be (N-1) edges. As such, the algorithm iterates utmost (N-1) times to relax the vertex distances. In every iteration, the algorithm starts at the source vertex, walks the outgoing edges to the connected neighbors and evaluates the tentative distance of each of the neighbors and updates the tentative distance if it is less than the previous value. The algorithm then moves to next vertex and repeats the process of walking the outgoing edges and accordingly relaxing the tentative distances for each of the connected neighbors. As such, in every iteration, the algorithm visits all the vertices and walks all the edges thereby relaxing the vertex distances wherever possible. The algorithm repeats the iterations of relaxing the vertices utmost (N-1) times or until no vertices can be updated anymore.</p>
<p style="text-align: center;"><a href="https://storage.ning.com/topology/rest/1.0/file/get/3819318314?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3819318314?profile=RESIZE_710x" class="align-center" width="750"/></a><em>Figure 2: Illustrations of Bellman-Ford Algorithm</em></p>
<p style="text-align: left;">Figure 2 illustrates the logical steps in the execution of Bellman-Ford algorithm. In essence, the algorithm maintains a table containing the evaluated shortest distances to each of the vertices from the source vertex along with the predecessor vertex, which would get updated potentially at every iteration. In the example graph, there are 4 vertices and hence the algorithm will execute utmost 4-1 = 3 iterations. At the initiation step, the algorithm sets the distances from the source vertex, A in the example, to all other vertices as infinity and the distance from source vertex to itself as 0 in the table. At this step, the predecessor vertex row is empty. The algorithm then begins iteration 1. The algorithm starts at the source vertex and evaluates the distances to the neighbors connected by outgoing edges. In the example, vertex A has only one outgoing edge to vertex C with a weight of -2. As such the measured distance of vertex C from vertex A would be d(A) + d(A,C) which is 0 – 2 = -2 which is less than C’s currently assigned value of infinity. Hence the algorithm relaxes the vertex C in the table and sets vertex A as the predecessor vertex of C. Then algorithm moves to next vertex, B in the example. Since the tentative distance of vertex B is still infinity, none of its neighbors connected by outgoing edges can be relaxed. As such, the algorithm moves to next vertex C. Vertex C has one outgoing edge connecting to vertex D. Since the current tentative distance of C is -2 and the outgoing edge weight to vertex D is 2, the algorithm evaluates the tentative distance of vertex D as -2 + 2 = 0 and since it is less than vertex D’s current distance, i.e., infinity, algorithm relaxes vertex D and sets C as D’s predecessor vertex. Algorithm then moves to vertex D. Vertex D has one outgoing edge to vertex B with a weight of -1. As such, the algorithm evaluates the tentative distance of vertex B as 0 – 1 = -1 and since it is less than B’s current distance of infinity, algorithm relaxes vertex B and sets D as B’s predecessor vertex. With this, the algorithm completes iteration 1. Algorithm then begins iteration 2. As in iteration 1, the algorithm starts at the source vertex A. Since vertex C is the only neighbor connected by outgoing edge, the algorithm evaluates the distance of C from A. The distance is unchanged at -2 and hence algorithm moves to next vertex B. The current tentative distance of vertex B as measured in iteration 1 is -1. Vertex B has two outgoing edges; one connecting vertex A and the other connecting vertex C. Algorithm evaluates the tentative distances of vertices A and C relative to vertex B. Since the outgoing edge weight to vertex A is 4, the algorithm evaluates the tentative distance of vertex A relative to vertex B as -1 + 4 = 3. Since this is greater than the vertex A’s distance of 0, algorithm does not relax vertex A. Similarly, the outgoing edge weight to vertex C is 3 and the algorithm evaluates the tentative distance of vertex C relative to vertex B as -1 + 3 = 2. Since this is greater than C’s current distance of -2, algorithm does not update vertex C in the table. The algorithm then moves to vertex C. Since vertex C has one outgoing edge to vertex D, the algorithm evaluates the tentative distance of vertex D relative to vertex C. Since the current tentative distance of C is -2 and the outgoing edge weight to vertex D is 2, the algorithm evaluates the tentative distance of vertex D as -2 + 2 = 0 and since it is same a vertex D’s current distance, the algorithm does not update vertex D in the table. Algorithm then moves to vertex D which has one outgoing edge to vertex B. Algorithm evaluates the tentative distance of vertex B as 0 – 1 = -1 and since it is same as B’s current distance, the algorithm does not update vertex B in the table. With this, the algorithm completes iteration 2. After completing iteration 2, the algorithm determines that no vertex was relaxed in iteration 2 and distances remained unchanged. As such, the algorithm stops the execution even though iteration 3 is pending. The shortest path from source vertex to any other vertex can then be determined by looking up the predecessor vertices from the evaluated table. For example, to determine the shortest path from vertex A to vertex B, table can be looked up to find the predecessor vertex of B which is D. The predecessor vertex of D is C and the predecessor vertex of C is A. As such, the shortest path from vertex A to vertex B is <em>{A,C,D,B}</em> with a shortest distance of -1.</p>
<p style="text-align: left;">In essence, Bellman-Ford algorithm relaxes utmost E number of vertices in every iteration, where E is the number of edges in the graph. Since the algorithm executes utmost (N-1) times where N is the number of vertices in the graph, the total number of relaxations would be E x (N-1). As such, the time complexity of the algorithm is in the order of (E x N). In a complete graph comprising of N vertices, where each vertex is connected to all other vertices, the total number of edges would be N(N-1)/2. Therefore the total number of relaxations would be N(N-1)/2 x (N-1). As such, the worst case time complexity of Bellman-Ford algorithm is in the order of N<sup>3</sup>.</p>
<p style="text-align: left;"><strong>More Algorithms To Cover<br/></strong></p>
<p style="text-align: left;">In the upcoming continuation parts of this article, I will cover several other graph based shortest path algorithms with concrete illustrations. I hope such illustrations help in getting a good grasp of the intuitions behind such algorithms.</p>