DBpedia-Spotlight-Dashboard - GSoC 2021

Objective
Raw data
Dashboard chart
Dashboard content
Evaluation
Used tools
How to run
Conclusions
Future work
Progress

Objective

The purpose of this dashboard is to facilitate the understanding and analysis of both DBpedia datasets (instance-types, redirects and disambiguations) and Wikipedia’s statistics (uriCounts, pairCounts, sfAndTotalCounts and tokenCounts) by calculating statistical measures on these data that allow understanding the trends of DBpedia resources, Wikipedia links and surface forms.

To make the dashboard, these steps have been followed:

Obtain raw data from the DBpedia Databus
Entity validation process: throughout the project, it was seen that there are Spotlight entities whose type is unknown. This process consists of determining the DBpedia entities with known types and those with unknown types. DBpedia entities with known types will be found in one of the following datasets: instance-types, redirects, and disambiguations. Whereas entities with unknown types will not be found in any of them.
Computation of statistical measures: percentage of entities with known types over the total (precision), percentage of entities with unknown types over the total (impact), mean, median, standard deviation, quartiles, percentiles…
Plot dashboard figures

Raw Data

As mentioned before, the statistical measures have been calculated from the DBpedia datasets and the Wikipedia statistical files (Wikistats)

DBpedia Datasets

redirects.nt: contains the redirect links extracted from Wikipedia redirection pages
disambiguations.nt: contains the disambiguation links extracted from Wikipedia disambiguation pages
instance_types.nt: classification of instances with the DBpedia Ontology. Triple containers of the form <$ resource> rdf: type <$ dbpedia_ontology_class> generated by the mappings extraction.

DBpedia Dataset	Sample
redirects.nt	`<http://es.dbpedia.org/resource/Artesanal> <http://dbpedia.org/ontology/wikiPageRedirects> <http://es.dbpedia.org/resource/Artesanía> .`
disambiguations.nt	`<http://es.dbpedia.org/resource/Abate> <http://dbpedia.org/ontology/wikiPageDisambiguates> <http://es.dbpedia.org/resource/Carlo_Abate> .`
instance_types.nt	`<http://es.dbpedia.org/resource/Cristiano_Ronaldo> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Athlete> .`

Wikistats

uriCounts: Contains the number of times each DBpedia resource (URI) appears in the Wikipedia dump
pairCounts: contains the number of times that a text (surface form) is used to link a DBpedia resource
sfAndTotalCounts: Contains the number of times a text (surface form) appears linked to a DBpedia resource (second column) and also the number of times it appears unlinked (third column).
tokenCounts: contains the number of times the words (tokens) appear in each Wikipedia article

File	Sample
uriCounts	`http://es.dbpedia.org/resource/Ciudadanía_rusa 69`
pairCounts	`ciudadanos rusos http://es.dbpedia.org/resource/Ciudadanía_rusa 5`
sfAndTotalCounts	`ciudadanos rusos 5 133`
tokenCounts	`http://es.wikipedia.org/wiki/14_Wall_Street {(street,13),(wall,11),(edifici,10),(del,5),(adyacent,5),(broadway,4)...}`

DBpedia Spotlight Dashboard Flowchart

Figure 1. DBpedia Spotlight Dashboard Flowchart

Dashboard Content

The dashboard consists of 4 tabs:

Information
Instance-types comparison
Details
Feedback

Figure 2 shows the 4 main tabs of the dashboard

Tabs Figure 2. Dashboard tabs

Information tab

This tab explains:

The purpose of this dashboard
How the statistics have been computed
The entity validation process
The raw files that DBpedia Spotlight uses during the generation of a language model

Instance-types comparison tab

This tab is used to compare the instance-types of the versions October 2016, October 2020, May 2021 and June 2021 for English and Spanish languages

It is divided into 3 views:

Version comparison: a table to compare the number of entities and types of the selected versions as well as their differences

Figure 3 shows a table with entities and types of October 2016 and May 2021 versions for the English language.

Figure 3. Version comparison

Version 1 VS Version 2: a chart illustrates the number of entities from the selected versions

Figure 4 shows a chart with the number of entities of October 2016 and May 2021 versions for the English language.

Figure 4. Version 1 VS Version 2

DBpedia types comparison: the entities from each selected version are graphically compared based on the DBpedia hierarchy of classes

Figure 5** shows a chart with the number of entities by DBpedia types of October 2016 and May 2021 versions for the English language.

Figure 5. DBpedia types comparison

Details tab

It contains 6 sub-tabs:

Summary
Instance-types
uriCounts
pairCounts
tokenCounts
sfAndTotalCounts

Figure 6 shows the 6 sub-tabs of the Details tab

Figure 6. Sub-tabs of Details tab

Summary

It shows the calculated statistics.

In Figure 7 can be seen measures of central tendency (mean and mode) that are used to know where the data is inclined or clustered the most. In this case, we can see how the DBpedia entities, surface forms and Wikipedia tokens are grouped. Also it can be seen the standard deviation, which is the main measure of dispersion, that is used to observe the degree of variability of DBpedia entities, surface forms and Wikipedia tokens.

Summary Figure 7. Table with statistical measures of Jun 2021 version for English language

Instance-types

Allows to view the instance-types in more detail for the selected language and version

Figure 8 shows part of the content of the instance-types sub-tab of May 2021 version for the English language .

Figure 8. Instance-types details

It is divided in two main sections:

DBpedia Extraction Framework: to see metrics about the raw files of the DBpedia Databus that Spotlight uses to generate the models
DBpedia Spotlight: to see metrics about the entities and types that are actually used by DBpedia Spotlight after the entity validation process

Both sections are formed by the following views:

Measures of Central Tendency: mean, mode
Measures of Dispersion: standard deviation
Entities by DBpedia types

Figure 9 shows the entities by DBpedia types chart of May 2021 version for the English language .

Instance-types entities and types Figure 9. Entities by DBpedia types chart

Moreover, the following views can be seen in the DBpedia Spotlight section:

Precision and impact calculated after entity validation process

These measurements are used to find out which Spotlight entities have known DBpedia types and which entities have unknown types. They are calculated as follows:

Precision = Nº entities with known types / Nº entities
Impact = Nº entities with unknown types / Nº entities

In the Figure 10 it can be seen that 63% of entities present known types and 27% present unknown types in the case of English language in May 2021 version

Figure 10. Precision and Impact indicators

Position measures for DBpedia types (quartiles and percentiles)

Figure 11 shows the position measures for DBpedia types chart of May 2021 version for the English language .

Figure 11. Position measures for DBpedia types

Top 50 DBpedia types with more entities

Figure 12 shows the top 50 DBpedia types with more entities table of May 2021 version for the English language .

Top Figure 12. Top 50 DBpedia types with more entities

uriCounts

Allows to see metrics calculated from the uriCounts file

The main measures are:

Measures of Central Tendency: mean, mode, median
Measures of Dispersion: standard deviation

Figure 13 shows the calculated measures of central tendency and dispersion from uriCounts file of May 2021 version for the English language .

Figure 13. Calculated measures from uriCounts file

pairCounts

Allows to see metrics calculated from the pairCounts file

The main measures are:

Measures of Central Tendency: mean, mode, median
Measures of Dispersion: standard deviation

Figure 14 shows the calculated measures of central tendency and dispersion from pairCounts file of May 2021 version for the English language .

Figure 14. Calculated measures from pairCounts file

tokenCounts

Allows to see metrics calculated from the tokenCounts file

The main measures are:

Measures of Central Tendency: mean, mode, median
Measures of Dispersion: standard deviation

Figure 15 shows the calculated measures of central tendency and dispersion from tokenCounts file of May 2021 version for the English language .

Figure 15. Calculated measures from tokenCounts file

sfAndTotalCounts

Allows to see metrics calculated from the sfAndTotalCounts file

The main measures are:

Measures of Central Tendency: mean, mode, median
Measures of Dispersion: standard deviation

Figure 16 shows the calculated measures of central tendency and dispersion from sfAndTotalCounts file of May 2021 version for the English language .

Figure 16. Calculated measures from sfAndTotalCounts file

In addition, in Figure 17 can be seen the surface forms according to their state in the Wikipedia dump:

Without associated link (-1 in second file column)
Not appearing as text (0 in third file column)
Not appearing as text without associated link (-1 in second file column and 0 in third file column)
Rest (surface forms with associated link and appearing as text)

sfAndTotalCounts pie chart Figure 17. Surface forms state

Feedback tab

Any questions or suggestions for improvement can be made by filling out the following form: https://forms.gle/YKiibhasVuYQ5goe6

Figure 18 shows the Feedback tab.

Figure 18. Feedback tab

Evaluation

The usability of the Dashboard has been evaluated according to the following usability principles:

Visibility of System Status
Match between System and the Real World
User Control and Freedom
Consistency and Standards
Recognition rather than Recall
Flexibility and Efficiency of Use
Aesthetic and Minimalist Design / Remove the Extraneous (Ink)
Spatial Organization
Information Coding
Orientation

People who carried out the evaluation are related to the area of Entity Linking or with a profile in Computer Science.

The results obtained are the following:

Usability principle	Severity rating
Visibility of System Status	No Usability Problem - 66,7% \| Cosmetic Problem Only - 33,3%
Match between System and the Real World	No Usability Problem - 83,3% \| Minor Usability Problem - 16,7%
User Control and Freedom	No Usability Problem - 100%
Consistency and Standards	No Usability Problem - 100%
Recognition rather than Recall	No Usability Problem - 83,3% \| Minor Usability Problem - 16,7%
Flexibility and Efficiency of Use	No Usability Problem - 83,3% \| Cosmetic Problem Only - 16,7%
Aesthetic and Minimalist Design/Remove the Extraneous (Ink)	No Usability Problem - 83,3% \| Cosmetic Problem Only - 16,7%
Spatial Organization	No Usability Problem - 83,3% \| Cosmetic Problem Only - 16,7%
Information Coding	No Usability Problem - 100%
Orientation	No Usability Problem - 100%

In addition, the dashboard as a whole was also evaluated:

Nº people who gave a global rating	Mark (from 0 to 10)
2	8
1	8.5
3	9

After observing the results of the evaluation, it has been determined that visual adjustments can be made to improve the rating of the following usability principles:

Visibility of System Status
Flexibility and Efficiency of Use
Aesthetic and Minimalist Design / Remove the Extraneous (Ink)
Spatial Organization

Also, corrections or functionalities can be added to the dashboard to solve minor usability problems in the following usability principles:

Match between System and the Real World
Recognition rather than Recall

Finally, it has been concluded that the dashboard can be improved in some aspects but is usable in general terms.

Used Tools

GNU datamash for statistics calculation

GNU datamash is a command-line program which performs basic numeric, textual and statistical operations on input textual data files.

Datamash Table

Dash framework for building the web app

Dash is a productive Python framework for building web analytic applications.

Written on top of Flask, Plotly.js, and React.js, Dash is ideal for building data visualization apps with highly custom user interfaces in pure Python. It’s particularly suited for anyone who works with data in Python.

Plotly Python graphing library for visualizations

Plotly’s Python graphing library makes interactive, publication-quality graphs. Examples of how to make line plots, scatter plots, area charts, bar charts, error bars, box plots, histograms, heatmaps, subplots, multiple-axes, polar charts, and bubble charts.

Plotly Table

Spyder IDE for development and integration

Spyder is a free and open source scientific environment written in Python, for Python, and designed by and for scientists, engineers and data analysts. It features a unique combination of the advanced editing, analysis, debugging, and profiling functionality of a comprehensive development tool with the data exploration, interactive execution, deep inspection, and beautiful visualization capabilities of a scientific package.

How to Run

In order to run the dashboard on yout local system, it is only necessary to:

Clone the repository
Go to the root folder and execute main.sh script

The script will install all the necessary packages and modules

The dashboard web page will be running at: http://localhost:8050

Conclusions

Throughout this work:

Raw data used by DBpedia Spotlight for the elaboration of models has been obtained
These data have been submitted to the entity validation process
Statistical measures have been calculated
A dashboard has been built showing these measures using cards and charts.

Measures of central tendency, measures of dispersion and position measures have been calculated. Measures of central tendency are used to see where the data are grouped the most. Measures of dispersion are used to see the degree of variability of the data. Position measures divide the data into intervals of the same size.

After analyzing all these measures, the high degree of dispersion in the data has been observed, which means that the data is very far from the mean, that is, the data presents a high imbalance ratio. In addition, since entity types are highly unbalanced, much of the information in the dataset is covered by a small group of entities. Thus, after ordering the entities from highest to lowest, it was observed that the first quartile was covered by 1 or 2 types of entities, while the last quartile contained a large number of types of entities.

Future Work

These are some tasks that would be interesting to do in the future:

Include the rest of the languages available in DBpedia-Spotlight in the Details and Instance-types comparison tabs.
Define the statistical information as Linked Data
Define an onotlogy for the representation of statistical information

Progress

[17/05/2021]: Proposal acceptance and community bonding period started.

[27/05/2021]: Meeting the mentors on Google Meet to introduce ourselves and talk about the project and interesting ideas:

Load data from Wikistats (uriCounts, pairCounts, sfAndTotalCounts and tokenCounts) and DBpedia artifacts (instance-types, redirects and disambiguations) in dataframes using Pandas and RDFLib libraries.
Create the desired visualizations using Matplotlib library.
Use frameworks like Dash for building the dashboard.
Compute the desired statistics over the dataframe using NumPy library.
Publish the statistical data generated using Linked Open Vocabularies once the dashboard is built.

[10/06/2021]: Second meeting with the mentors, first advances in the project and new ideas:

Get model raw data for Spanish and English -> Done
Visualize DBpedia types for Spanish and English -> Done (problem: some hierarchy types are missing in the instance_types file)
Validation of DBpedia links (entities) -> In progress (problem: IP address blocked for 1 day due to excessive requests) (new idea: get ALL DBpedia distinct resources doing SPARQL queries and store results in local file, then look for valid URLs comparing that generated file URLs and instance_types URLs using UNIX commands)

[14/06/2021]: Some progress:

Validation of DBpedia links (entities) -> In progress (solving problem). I found out that all entities (both valid and invalid entities) are found on SPARQL endpoints, so the idea I came up with doesn’t work in this case. Regarding the first idea, even putting timeout between each request my IP address is still blocked (already 3 times in total)
Review of the code generated so far -> Done
Dashboard draft using Dash -> Done

[24/06/2021]: The problem of URLs validation has been resolved:

URLs of the latest version of uriCounts file have been validated for Spanish language. For this, each URL of the file has been checked by means of the following SPARQL query (using a local SPARQL Endpoint to avoid DBpedia IP blocking):

SPARQL validation

If the value returned by the query is 0, it means that this URL does not have any type, that is, it is a URL that does not exist and therefore is invalid.

Once valid and invalid URLs for Spanish were obtained, types of valid URLs have been obtained and can be viewed according to the DBpedia hierarchy:

Spanish valid types

Precision and impact of Spanish URLs has also been calculated.
URLs validation of the latest version of uriCounts file for English language -> In progress (executing)
Types of valid English URLs -> In progress (executing)
Precision and impact of English URLs -> In progress (executing)

[25/06/2021]: DBpedia entities used by Spotlight have been validated for both Spanish and English languages. Now is time to think of other interesting statistical measures to show on the dashboard:

URLs validation of the latest version of uriCounts file for English language -> Done
Types of valid English URLs -> Done

English valid types

Precision and impact of English URLs -> Done

English statistics

Think about other interesting statistical measures to show on the dashboard -> In progress

[01/07/2021]: Some statistical measures have been calculated from DBpedia datasets (redirects, disambiguations and instance-types) and Wikistats (uriCounts, pairCounts, tokenCounts, sfAndTotalCounts) for English and Spanish:

Redirects and disambiguations:

Redirects and disambiguations

Instance-types:

Instance-types 1 Instance-types 2 Instance-types 3

uriCounts:

uriCounts1 uriCounts2

pairCounts:

pairCounts1 pairCounts2

tokenCounts:

tokenCounts1

sfAndTotalCounts:

sfAndTotalCounts1 sfAndTotalCounts2 sfAndTotalCounts3

Next tasks:

Review all the statistics generated (especially those of the instance-types file) -> In progress
Think about other statistics that may be interesting to have on the dashboard
Think about how these statistics will be displayed on the dashboard

[02/07/2021]: Statistics have been revised and now they seem to be all good. Next thing to do is think about how to display them in the Dashboard.

[06/07/2021]: I have thought about how to display the statistics on the dashboard.

These are the statistics that are currently displayed:

Firstly, the information of the Dbpedia Extraction Framework is displayed:
1. Number of non-repeating entities (instance_types.nt)
2. Non-repeating types (instance_types.tsv)
3. Bar chart of instance-types.nt (instance_types.tsv)
Then, the Spotlight information is displayed:
1. Number of non-repeating entities with known types from Dbpedia (valid_urls)
2. Number of non-repeating types (valid_types.tsv)
3. Bar chart with known types (valid_types.tsv)
4. Statistics on known types: median, mean, percentiles, quartiles, etc (valid_types.tsv)

This is how the dashboard is at the moment, waiting for feedback from the mentors to change what is necessary:

dashboard_stats1

dashboard_stats2

dashboard_stats3

dashboard_stats4

dashboard_stats5

dashboard_stats6

dashboard_stats7

dashboard_stats8

dashboard_stats9

dashboard_stats10

[08/07/2021]: Meeting with the mentors and new ideas for the Dashboard:

Add a flowchart at the beginning where it can be seen how DBpedia Spotlight works
Change some of the graphs so that the dispersion of the data can be better appreciated

[11/07/2021]: DBpedia Spotlight and Spotlight Dashboard flow charts have been made. Some Dashboard charts have also been changed, now the data looks better. Waiting for feedback from mentors:

An additional tab has been added to the Dashboard called Information where the process of creating Spotlight models and the purpose and operation of the Spotlight Dashboard are explained, as well as flow charts to see both processes graphically.

information

Some of the Dashboard graphics have been changed to better appreciate the dispersion of the data. In the case of Wikistats, since they are millions of data, Slicers have been added so that the user can select different samples of the data. The information reflected in these graphs together with the tables (top 50) allow you to get an idea of what the data used by Spotlight is like.

information

[15/07/2021]: Meeting with the mentors in which the Dashboard has been reviewed and the following tasks to be carried out have been defined:

Make the Comparison tab, where the number of entities and the types found in the DBpedia instance-types dataset will be compared for both Spanish and English for the following versions:
```
2016.10.01
2020.10.01
2021.05.01
2021.06.01
```
Think about the ontology to use to represent the calculated statistics as Linked Data. Also think if it is worth creating our own ontology to represent the information instead of using any of the existing ones or reusing terms from different ontologies.

[18/07/2021]:

Instance types, uriCounts, pairCounts, tokenCounts and sfAndTotalCounts sub-tabs have been added within the Spanish and English tabs to be able to view the information of interest in parts.

Subtabs

The Comparison tab has been made, where the different versions can be compared and the variation of entities and types between them can be appreciated.

Comparison 1

Comparison 2

[19/07/2021]:

DBpedia types comparison between different versions of instance-types dataset has been added to the Comparison tab.

Comparison 3

Pending tasks:

Ontology: think about statistics of interest to show and think about vocabularies to use or else, create an ontology.
Wikistats: think about what data is interesting to show for the user or another way to show the data.

[20/07/2021]: Once the final version of the Dashboard is made (there are still pending tasks), the idea is to update it in the future with suggestions for improvement from the users:

A form has been made to receive feedback from the Dashboard. This form evaluates the usability principles contained in the paper: The Development of Heuristics for Evaluation of Dashboard Visualizations.

Evaluation

A Feedback tab has been added to the Dashboard. The direct link to the form appears in this tab, so that users can contribute their ideas for improvement after having examined the Dashboard.

Evaluation Tab

[28/07/2021]:

My mentors and I have considered eliminating the bar graphs from the Wikistats because they did not provide relevant information for the user and implied an excess of unnecessary information
Instead, we have thought of displaying the relevant measurements in the form of cards:

Cards1

Cards2

Cards3

Cards4

The corresponding statistical measures have also been added for the version of October 10, 2016
All statistical measurements have been reviewed and appear to be correct
Added to the Information tab an explanation of the files from which the Dashboard statistics are obtained:

DBpedia datasets

Wikistats

Modified the Dashboard header and added the Spotlight logo as well as the appearance of the tabs, also added margins to the dashboard:

Appearance

Tasks in progress:

Add statistics of 2020.10.01 and 2021.06.01 versions for Wikistats files
Add a Summary tab to summarize the statistics in a table

[29/07/2021]: A Summary tab has been added to show users a summary of the statistics for English and Spanish.

Summary

[05/08/2021]: Meeting with mentors. Finally, the publication of the statistics as linked data is pending as a future task due to lack of time. It only remains to make some visual improvements to the Dashboard.

[11/08/2021]: Some visual changes have been made to the Dashboard:

Changed the web page title and added the Spotlight icon
Changed the order of the Instance-types tab
- The cards of each version have been put in different columns:
- The 2 bars of the left graph have been put separately:
The Position measures graphs have been changed since the previous graphs were more confusing:
Also the dashboard is already available to everyone at http://134.155.95.24:8050/

[15/08/2021]: Most of the visual changes suggested by the mentors have been implemented:

The median measure has been eliminated from the Summary tab tables
Dashboard logo has been moved to the left
English and Spanish tabs have been removed and replaced by a Details tab so as not to mix topics with languages

Details

Cards have been removed from the Comparison tab and in their place a table with highlighted differences has been put

Comparison table

Added space between buttons
Horizontal lines have been added to separate views better
The Summary sub-tab has been assigned as default when clicking on the Details tab
The Details tab dropdown has been synchronized so that you only have to select version 1 time (in the Summary tab preferably)
A summary of how statistics have been computed has been added to the Information tab
The width of the tables has been modified so that they can be seen well in all possible screens (Not checked yet)

DBpedia-Spotlight-Dashboard - GSoC 2021

An integrated statistical information tool from the Wikipedia dumps and the DBpedia Extraction Framework artifacts

DBpedia-Spotlight-Dashboard - GSoC 2021

Table of Contents

Objective

Raw Data

DBpedia Datasets

Wikistats

DBpedia Spotlight Dashboard Flowchart

Dashboard Content

Information tab

Instance-types comparison tab

Details tab

Summary

Instance-types

uriCounts

pairCounts

tokenCounts

sfAndTotalCounts

Feedback tab

Evaluation

Used Tools

How to Run

Conclusions

Future Work

Progress