Workdocumentation 2021-03-30
Jump to navigation
Jump to search
Red Links and Data Fixations
Broken redirects
- 106 Red links, Natural Language Ids, ambigous, wrong and N/A values for country/region/city (and other fields)
- The list of all broken redirects on openresearch can be found here: https://www.openresearch.org/mediawiki/index.php?title=Special:BrokenRedirects&limit=500&offset=0
Broken file links
- 106 red links, Natural Language Ids, ambigous, wrong and N/A values for country/region/city (and other fields)
- List of all broken links which contain a file path but the file does not exist can be found here: https://www.openresearch.org/wiki/Category:Pages_with_broken_file_links
Broken properties and errors
- The property Property:Has improper value for stores all invalid values.
Ordinals
- 119 Text Values in Ordinal Field
- For the Fix for ordinals the following approaches can be used :
- This approach finds and edits all events with improper ordinals fixed:
wikiedit -t wikiId -q "[[Has improper value for::Ordinal]]" --search "(\|Ordinal=[0-9]+)(?:st|nd|rd|th)\b" --replace "\1"
- A code snippet can be used coupled with wikibackup and bash tools for specific editing of pages: Code Snippet
- Pipeline usage:
grep Ordinal /path/to/backup -l -r | python ordinal_to_cardinal.py -stdin -d '../dictionary.yaml' -ro -f | wikirestore -t ormk -stdinp -ui
Improper Null values for Has person
- 150 Improper Null Value handling
- Has person was using "some person" as a null value. There was incorrect usage where in the free text events would use some person while the Wikison Format info would contain the person name.
- First way of doing this is to remove free text altogether. A code snippet was used coupled with bash utility grep.
Code Snippet:
def remove_FreeText(path, restoreOut= False):
'''
Remove free text from any given wiki page
Args:
path(str): path to wiki file
Returns:
page without any free text.
'''
f = open(path, 'r')
eventlist = f.readlines()
i = 0
newwiki = ''
while i < (len(eventlist) - 1):
if "{{" in eventlist[i]:
while True:
newwiki += eventlist[i]
if "}}" in eventlist[i]:
break
i += 1
i += 1
if restoreOut:
if newwiki == '':
return
dirname = getHomePath + '/wikibackup/rdfPages/'
fullpath = dirname + ntpath.basename(path)
print(fullpath)
f.close()
ensureDirectoryExists(dirname)
f = open(fullpath, 'w')
f.write(newwiki)
f.close()
Usage:
grep 'some person' -r '/path/to/backup' -l | python scripts/Data_Fixes.py -stdin -ro -rmf | wikirestore -t ormk -stdinp -ui
Output result: ICFHR 2020
- Second way to do this is to only remove the 'some person' entry from the wiki free text. Python snippet is used with bash utility grep.
Code Snippet:
def remove_DuplicateFields(path, restoreOut= False):
f = open(path, 'r')
event = f.read()
flag = False
eventlist = event.split('\n')
for i in eventlist:
if 'some person' in i:
chk = re.findall('\[\[(.*?)\:\:', i)
if event.count(chk[0]) > 1:
flag = True
event = event.replace(i, '')
if restoreOut:
if flag:
dirname = getHomePath + '/wikibackup/rdfPages/'
fullpath = dirname + ntpath.basename(path)
print(fullpath)
f.close()
ensureDirectoryExists(dirname)
f = open(fullpath, 'w')
f.write(event)
f.close()
Usage:
grep 'some person' -r '/path/to/backup' -l | python scripts/Data_Fixes.py -stdin -ro -rdf | wikirestore -t ormk -stdinp -ui
output result: ACCV_2020
Main Function for usage:
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('-p', dest="path", help='path to the file')
parser.add_argument('-stdin', dest="pipein", action='store_true', help='Use the input from STD IN using pipes')
parser.add_argument('-ro', dest="ro", action='store_true', help='Output to STDOUT for wikirestore by pipe')
parser.add_argument('-rmf', dest="rmf", action='store_true', help='flag to remove free text from wiki file')
parser.add_argument('-rdf', dest="rdf", action='store_true', help='flag to remove duplicate null value fields from wiki file')
args = parser.parse_args()
if args.pipein:
if args.rmf:
allx = sys.stdin.readlines()
for i in allx:
remove_FreeText(i.strip(), restoreOut=args.ro)
if args.rdf:
allx = sys.stdin.readlines()
for i in allx:
remove_DuplicateFields(i.strip(), restoreOut=args.ro)
elif args.path is not None:
if args.rmf:
remove_FreeText(args.path, restoreOut= args.ro)
- Usage Statistics:
- The null field 'some person' has been used 223 times
Dates
- 71 Fix Property End_date and invalid date entries
- 153 improper start date entries.
- 6 improper end date entries.
- Reference page
- End date or dates in general are placed with strings.
- Decision to remove the field all together or fix them?
- For fixing manual intervention is needed.
- For removing a field all together a small code snippet can do the trick
Acceptance Rate Issue
- 152 Acceptance Rate Not calculated
- Statistics for the missing values for Submitted papers:
- With bash utility these can be found
- Number of pages that have the field "Submitted papers" : 1716
- Number of pages that have the field "Accepted papers" : 1965
- With a small python code snippet the following can be found:
Code Snippet:
import glob
allx = glob.glob('path\to\backup\*.wiki')
nosub=0
noacc=0
for i in allx:
with open(i,'r') as f:
event =f.read()
if event.lower().find('submitted papers'.lower()) == -1 and event.lower().find('Accepted papers'.lower()) != -1:
nosub+=1
elif event.lower().find('submitted papers'.lower()) != -1 and event.lower().find('Accepted papers'.lower()) == -1:
noacc+=1
print(nosub)
print(noacc)
- Number of pages that have the field "Submitted papers" but no field of "Accepted papers" : Approximately 63
- Number of pages that have the field "Accepted papers" but no field of "Submitted papers" : Approximately 302