Skip to the content.

DBpedia-Spotlight-Dashboard - GSoC 2021

Table of Contents

Objective

The purpose of this dashboard is to facilitate the understanding and analysis of both DBpedia datasets (instance-types, redirects and disambiguations) and Wikipedia’s statistics (uriCounts, pairCounts, sfAndTotalCounts and tokenCounts) by calculating statistical measures on these data that allow understanding the trends of DBpedia resources, Wikipedia links and surface forms.

To make the dashboard, these steps have been followed:

  1. Obtain raw data from the DBpedia Databus
  2. Entity validation process: throughout the project, it was seen that there are Spotlight entities whose type is unknown. This process consists of determining the DBpedia entities with known types and those with unknown types. DBpedia entities with known types will be found in one of the following datasets: instance-types, redirects, and disambiguations. Whereas entities with unknown types will not be found in any of them.
  3. Computation of statistical measures: percentage of entities with known types over the total (precision), percentage of entities with unknown types over the total (impact), mean, median, standard deviation, quartiles, percentiles…
  4. Plot dashboard figures

Raw Data

As mentioned before, the statistical measures have been calculated from the DBpedia datasets and the Wikipedia statistical files (Wikistats)

DBpedia Datasets

DBpedia Dataset Sample
redirects.nt <http://es.dbpedia.org/resource/Artesanal> <http://dbpedia.org/ontology/wikiPageRedirects> <http://es.dbpedia.org/resource/Artesanía> .
disambiguations.nt <http://es.dbpedia.org/resource/Abate> <http://dbpedia.org/ontology/wikiPageDisambiguates> <http://es.dbpedia.org/resource/Carlo_Abate> .
instance_types.nt <http://es.dbpedia.org/resource/Cristiano_Ronaldo> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Athlete> .

Wikistats

File Sample
uriCounts http://es.dbpedia.org/resource/Ciudadanía_rusa 69
pairCounts ciudadanos rusos http://es.dbpedia.org/resource/Ciudadanía_rusa 5
sfAndTotalCounts ciudadanos rusos 5 133
tokenCounts http://es.wikipedia.org/wiki/14_Wall_Street {(street,13),(wall,11),(edifici,10),(del,5),(adyacent,5),(broadway,4)...}

DBpedia Spotlight Dashboard Flowchart

DBpedia Spotlight Dashboard Flowchart                                                                                 Figure 1. DBpedia Spotlight Dashboard Flowchart

Dashboard Content

The dashboard consists of 4 tabs:

Figure 2 shows the 4 main tabs of the dashboard

Tabs                                                                                        Figure 2. Dashboard tabs

Information tab

This tab explains:

Instance-types comparison tab

This tab is used to compare the instance-types of the versions October 2016, October 2020, May 2021 and June 2021 for English and Spanish languages

It is divided into 3 views:

Figure 3 shows a table with entities and types of October 2016 and May 2021 versions for the English language.

Version Comparison                                                                                        Figure 3. Version comparison

Figure 4 shows a chart with the number of entities of October 2016 and May 2021 versions for the English language.

VS                                                                                        Figure 4. Version 1 VS Version 2

Figure 5** shows a chart with the number of entities by DBpedia types of October 2016 and May 2021 versions for the English language.

Types comparison                                                                                        Figure 5. DBpedia types comparison

Details tab

It contains 6 sub-tabs:

Figure 6 shows the 6 sub-tabs of the Details tab

Sub-tabs                                                                                        Figure 6. Sub-tabs of Details tab

Summary

It shows the calculated statistics.

In Figure 7 can be seen measures of central tendency (mean and mode) that are used to know where the data is inclined or clustered the most. In this case, we can see how the DBpedia entities, surface forms and Wikipedia tokens are grouped. Also it can be seen the standard deviation, which is the main measure of dispersion, that is used to observe the degree of variability of DBpedia entities, surface forms and Wikipedia tokens.

Summary                                                Figure 7. Table with statistical measures of Jun 2021 version for English language

Instance-types

Allows to view the instance-types in more detail for the selected language and version

Figure 8 shows part of the content of the instance-types sub-tab of May 2021 version for the English language .

Instance-types                                                Figure 8. Instance-types details

It is divided in two main sections:

Both sections are formed by the following views:

Figure 9 shows the entities by DBpedia types chart of May 2021 version for the English language .

Instance-types entities and types                                               Figure 9. Entities by DBpedia types chart

Moreover, the following views can be seen in the DBpedia Spotlight section:

These measurements are used to find out which Spotlight entities have known DBpedia types and which entities have unknown types. They are calculated as follows:

Precision = Nº entities with known types / Nº entities
Impact = Nº entities with unknown types / Nº entities

In the Figure 10 it can be seen that 63% of entities present known types and 27% present unknown types in the case of English language in May 2021 version

Precision and impact                                               Figure 10. Precision and Impact indicators

Figure 11 shows the position measures for DBpedia types chart of May 2021 version for the English language .

Position measures                                               Figure 11. Position measures for DBpedia types

Figure 12 shows the top 50 DBpedia types with more entities table of May 2021 version for the English language .

Top                                               Figure 12. Top 50 DBpedia types with more entities

uriCounts

Allows to see metrics calculated from the uriCounts file

The main measures are:

Figure 13 shows the calculated measures of central tendency and dispersion from uriCounts file of May 2021 version for the English language .

uriCounts                                               Figure 13. Calculated measures from uriCounts file

pairCounts

Allows to see metrics calculated from the pairCounts file

The main measures are:

Figure 14 shows the calculated measures of central tendency and dispersion from pairCounts file of May 2021 version for the English language .

pairCounts                                               Figure 14. Calculated measures from pairCounts file

tokenCounts

Allows to see metrics calculated from the tokenCounts file

The main measures are:

Figure 15 shows the calculated measures of central tendency and dispersion from tokenCounts file of May 2021 version for the English language .

tokenCounts                                               Figure 15. Calculated measures from tokenCounts file

sfAndTotalCounts

Allows to see metrics calculated from the sfAndTotalCounts file

The main measures are:

Figure 16 shows the calculated measures of central tendency and dispersion from sfAndTotalCounts file of May 2021 version for the English language .

sfAndTotalCounts                                               Figure 16. Calculated measures from sfAndTotalCounts file

In addition, in Figure 17 can be seen the surface forms according to their state in the Wikipedia dump:

sfAndTotalCounts pie chart                                               Figure 17. Surface forms state

Feedback tab

Any questions or suggestions for improvement can be made by filling out the following form: https://forms.gle/YKiibhasVuYQ5goe6

Figure 18 shows the Feedback tab.

Feedback tab                                               Figure 18. Feedback tab

Evaluation

The usability of the Dashboard has been evaluated according to the following usability principles:

People who carried out the evaluation are related to the area of Entity Linking or with a profile in Computer Science.

The results obtained are the following:

Usability principle Severity rating
Visibility of System Status No Usability Problem - 66,7% | Cosmetic Problem Only - 33,3%
Match between System and the Real World No Usability Problem - 83,3% | Minor Usability Problem - 16,7%
User Control and Freedom No Usability Problem - 100%
Consistency and Standards No Usability Problem - 100%
Recognition rather than Recall No Usability Problem - 83,3% | Minor Usability Problem - 16,7%
Flexibility and Efficiency of Use No Usability Problem - 83,3% | Cosmetic Problem Only - 16,7%
Aesthetic and Minimalist Design/Remove the Extraneous (Ink) No Usability Problem - 83,3% | Cosmetic Problem Only - 16,7%
Spatial Organization No Usability Problem - 83,3% | Cosmetic Problem Only - 16,7%
Information Coding No Usability Problem - 100%
Orientation No Usability Problem - 100%

In addition, the dashboard as a whole was also evaluated:

Nº people who gave a global rating Mark (from 0 to 10)
2 8
1 8.5
3 9

After observing the results of the evaluation, it has been determined that visual adjustments can be made to improve the rating of the following usability principles:

Also, corrections or functionalities can be added to the dashboard to solve minor usability problems in the following usability principles:

Finally, it has been concluded that the dashboard can be improved in some aspects but is usable in general terms.

Used Tools

GNU datamash is a command-line program which performs basic numeric, textual and statistical operations on input textual data files.

Datamash Table

Dash is a productive Python framework for building web analytic applications.

Written on top of Flask, Plotly.js, and React.js, Dash is ideal for building data visualization apps with highly custom user interfaces in pure Python. It’s particularly suited for anyone who works with data in Python.

Plotly’s Python graphing library makes interactive, publication-quality graphs. Examples of how to make line plots, scatter plots, area charts, bar charts, error bars, box plots, histograms, heatmaps, subplots, multiple-axes, polar charts, and bubble charts.

Plotly Table

Spyder is a free and open source scientific environment written in Python, for Python, and designed by and for scientists, engineers and data analysts. It features a unique combination of the advanced editing, analysis, debugging, and profiling functionality of a comprehensive development tool with the data exploration, interactive execution, deep inspection, and beautiful visualization capabilities of a scientific package.

How to Run

In order to run the dashboard on yout local system, it is only necessary to:

The script will install all the necessary packages and modules

The dashboard web page will be running at: http://localhost:8050

Conclusions

Throughout this work:

  1. Raw data used by DBpedia Spotlight for the elaboration of models has been obtained
  2. These data have been submitted to the entity validation process
  3. Statistical measures have been calculated
  4. A dashboard has been built showing these measures using cards and charts.

Measures of central tendency, measures of dispersion and position measures have been calculated. Measures of central tendency are used to see where the data are grouped the most. Measures of dispersion are used to see the degree of variability of the data. Position measures divide the data into intervals of the same size.

After analyzing all these measures, the high degree of dispersion in the data has been observed, which means that the data is very far from the mean, that is, the data presents a high imbalance ratio. In addition, since entity types are highly unbalanced, much of the information in the dataset is covered by a small group of entities. Thus, after ordering the entities from highest to lowest, it was observed that the first quartile was covered by 1 or 2 types of entities, while the last quartile contained a large number of types of entities.

Future Work

These are some tasks that would be interesting to do in the future:

Progress

[17/05/2021]: Proposal acceptance and community bonding period started.

[27/05/2021]: Meeting the mentors on Google Meet to introduce ourselves and talk about the project and interesting ideas:

[10/06/2021]: Second meeting with the mentors, first advances in the project and new ideas:

[14/06/2021]: Some progress:

[24/06/2021]: The problem of URLs validation has been resolved:

SPARQL validation

If the value returned by the query is 0, it means that this URL does not have any type, that is, it is a URL that does not exist and therefore is invalid.

Spanish valid types

[25/06/2021]: DBpedia entities used by Spotlight have been validated for both Spanish and English languages. Now is time to think of other interesting statistical measures to show on the dashboard:

English valid types

English statistics

[01/07/2021]: Some statistical measures have been calculated from DBpedia datasets (redirects, disambiguations and instance-types) and Wikistats (uriCounts, pairCounts, tokenCounts, sfAndTotalCounts) for English and Spanish:

Redirects and disambiguations

Instance-types 1 Instance-types 2 Instance-types 3

uriCounts1 uriCounts2

pairCounts1 pairCounts2

tokenCounts1

sfAndTotalCounts1 sfAndTotalCounts2 sfAndTotalCounts3

Next tasks:

   1. Review all the statistics generated (especially those of the instance-types file) -> In progress
   2. Think about other statistics that may be interesting to have on the dashboard
   3. Think about how these statistics will be displayed on the dashboard

[02/07/2021]: Statistics have been revised and now they seem to be all good. Next thing to do is think about how to display them in the Dashboard.

[06/07/2021]: I have thought about how to display the statistics on the dashboard.

These are the statistics that are currently displayed:

This is how the dashboard is at the moment, waiting for feedback from the mentors to change what is necessary:

dashboard_stats1

dashboard_stats2

dashboard_stats3

dashboard_stats4

dashboard_stats5

dashboard_stats6

dashboard_stats7

dashboard_stats8

dashboard_stats9

dashboard_stats10

[08/07/2021]: Meeting with the mentors and new ideas for the Dashboard:

[11/07/2021]: DBpedia Spotlight and Spotlight Dashboard flow charts have been made. Some Dashboard charts have also been changed, now the data looks better. Waiting for feedback from mentors:

information

information

information

information

information

information

information

information

information

information

information

information

[15/07/2021]: Meeting with the mentors in which the Dashboard has been reviewed and the following tasks to be carried out have been defined:

[18/07/2021]:

Subtabs

Comparison 1

Comparison 2

[19/07/2021]:

Comparison 3

Pending tasks:

[20/07/2021]: Once the final version of the Dashboard is made (there are still pending tasks), the idea is to update it in the future with suggestions for improvement from the users:

Evaluation

Evaluation Tab

[28/07/2021]:

Cards1

Cards2

Cards3

Cards4

DBpedia datasets

Wikistats

Appearance

Tasks in progress:

[29/07/2021]: A Summary tab has been added to show users a summary of the statistics for English and Spanish.

Summary

[05/08/2021]: Meeting with mentors. Finally, the publication of the statistics as linked data is pending as a future task due to lack of time. It only remains to make some visual improvements to the Dashboard.

[11/08/2021]: Some visual changes have been made to the Dashboard:

[15/08/2021]: Most of the visual changes suggested by the mentors have been implemented:

Details

Comparison table