Difference between revisions of "Workdocumentation 2021-03-30"
Jump to navigation
Jump to search
Musaab Khan (talk | contribs) (→Dates) |
Musaab Khan (talk | contribs) |
||
Line 25: | Line 25: | ||
=== Improper Null values for Has person === | === Improper Null values for Has person === | ||
* Has person was using "some person" as a null value. There was incorrect usage where in the free text events would use some person while the Wikison Format info would contain the person name. | * Has person was using "some person" as a null value. There was incorrect usage where in the free text events would use some person while the Wikison Format info would contain the person name. | ||
− | # First way of doing this is to remove free text altogether. A code snippet was used coupled with bash utility grep. Usage: | + | # First way of doing this is to remove free text altogether. A code snippet was used coupled with bash utility grep. |
+ | Code Snippet: | ||
+ | <source lang='python'> | ||
+ | def remove_FreeText(path, restoreOut= False): | ||
+ | ''' | ||
+ | Remove free text from any given wiki page | ||
+ | Args: | ||
+ | path(str): path to wiki file | ||
+ | Returns: | ||
+ | page without any free text. | ||
+ | ''' | ||
+ | f = open(path, 'r') | ||
+ | eventlist = f.readlines() | ||
+ | i = 0 | ||
+ | newwiki = '' | ||
+ | while i < (len(eventlist) - 1): | ||
+ | if "{{" in eventlist[i]: | ||
+ | while True: | ||
+ | newwiki += eventlist[i] | ||
+ | if "}}" in eventlist[i]: | ||
+ | break | ||
+ | i += 1 | ||
+ | i += 1 | ||
+ | if restoreOut: | ||
+ | if newwiki == '': | ||
+ | return | ||
+ | dirname = getHomePath + '/wikibackup/rdfPages/' | ||
+ | fullpath = dirname + ntpath.basename(path) | ||
+ | print(fullpath) | ||
+ | f.close() | ||
+ | ensureDirectoryExists(dirname) | ||
+ | f = open(fullpath, 'w') | ||
+ | f.write(newwiki) | ||
+ | f.close() | ||
+ | </source> | ||
+ | Usage: | ||
<source lang='bash'> | <source lang='bash'> | ||
grep 'some person' -r '/path/to/backup' -l | python scripts/Data_Fixes.py -stdin -ro -rmf | wikirestore -t ormk -stdinp -ui | grep 'some person' -r '/path/to/backup' -l | python scripts/Data_Fixes.py -stdin -ro -rmf | wikirestore -t ormk -stdinp -ui | ||
</source> | </source> | ||
Output result: [https://confident.dbis.rwth-aachen.de/ormk/index.php?title=ICFHR_2020&type=revision&diff=19206&oldid=15877 ICFHR 2020] | Output result: [https://confident.dbis.rwth-aachen.de/ormk/index.php?title=ICFHR_2020&type=revision&diff=19206&oldid=15877 ICFHR 2020] | ||
− | # Second way to do this is to only remove the 'some person' entry from the wiki free text. Python snippet is used with bash utility grep. Usage: | + | # Second way to do this is to only remove the 'some person' entry from the wiki free text. Python snippet is used with bash utility grep. |
+ | Code Snippet: | ||
+ | <source lang='python'> | ||
+ | def remove_DuplicateFields(path, restoreOut= False): | ||
+ | f = open(path, 'r') | ||
+ | event = f.read() | ||
+ | flag = False | ||
+ | eventlist = event.split('\n') | ||
+ | for i in eventlist: | ||
+ | if 'some person' in i: | ||
+ | chk = re.findall('\[\[(.*?)\:\:', i) | ||
+ | if event.count(chk[0]) > 1: | ||
+ | flag = True | ||
+ | event = event.replace(i, '') | ||
+ | if restoreOut: | ||
+ | if flag: | ||
+ | dirname = getHomePath + '/wikibackup/rdfPages/' | ||
+ | fullpath = dirname + ntpath.basename(path) | ||
+ | print(fullpath) | ||
+ | f.close() | ||
+ | ensureDirectoryExists(dirname) | ||
+ | f = open(fullpath, 'w') | ||
+ | f.write(event) | ||
+ | f.close() | ||
+ | </source> | ||
+ | Usage: | ||
<source lang='bash'> | <source lang='bash'> | ||
grep 'some person' -r '/path/to/backup' -l | python scripts/Data_Fixes.py -stdin -ro -rdf | wikirestore -t ormk -stdinp -ui | grep 'some person' -r '/path/to/backup' -l | python scripts/Data_Fixes.py -stdin -ro -rdf | wikirestore -t ormk -stdinp -ui |
Revision as of 09:23, 2 April 2021
Red Links and Data Fixations
Broken redirects
- The list of all broken redirects on openresearch can be found here: https://www.openresearch.org/mediawiki/index.php?title=Special:BrokenRedirects&limit=500&offset=0
Broken file links
- List of all broken links which contain a file path but the file does not exist can be found here: https://www.openresearch.org/wiki/Category:Pages_with_broken_file_links
Broken properties and errors
- The property Property:Has improper value for stores all invalid values.
Ordinals
- For the Fix for ordinals the following approaches can be used :
- This approach finds and edits all events with improper ordinals fixed:
wikiedit -t wikiId -q "[[Has improper value for::Ordinal]]" --search "(\|Ordinal=[0-9]+)(?:st|nd|rd|th)\b" --replace "\1"
- A code snippet can be used coupled with wikibackup and bash tools for specific editing of pages: Code Snippet
- Pipeline usage:
grep Ordinal /path/to/backup -l -r | python ordinal_to_cardinal.py -stdin -d '../dictionary.yaml' -ro -f | wikirestore -t ormk -stdinp -ui
Improper Null values for Has person
- Has person was using "some person" as a null value. There was incorrect usage where in the free text events would use some person while the Wikison Format info would contain the person name.
- First way of doing this is to remove free text altogether. A code snippet was used coupled with bash utility grep.
Code Snippet:
def remove_FreeText(path, restoreOut= False):
'''
Remove free text from any given wiki page
Args:
path(str): path to wiki file
Returns:
page without any free text.
'''
f = open(path, 'r')
eventlist = f.readlines()
i = 0
newwiki = ''
while i < (len(eventlist) - 1):
if "{{" in eventlist[i]:
while True:
newwiki += eventlist[i]
if "}}" in eventlist[i]:
break
i += 1
i += 1
if restoreOut:
if newwiki == '':
return
dirname = getHomePath + '/wikibackup/rdfPages/'
fullpath = dirname + ntpath.basename(path)
print(fullpath)
f.close()
ensureDirectoryExists(dirname)
f = open(fullpath, 'w')
f.write(newwiki)
f.close()
Usage:
grep 'some person' -r '/path/to/backup' -l | python scripts/Data_Fixes.py -stdin -ro -rmf | wikirestore -t ormk -stdinp -ui
Output result: ICFHR 2020
- Second way to do this is to only remove the 'some person' entry from the wiki free text. Python snippet is used with bash utility grep.
Code Snippet:
def remove_DuplicateFields(path, restoreOut= False):
f = open(path, 'r')
event = f.read()
flag = False
eventlist = event.split('\n')
for i in eventlist:
if 'some person' in i:
chk = re.findall('\[\[(.*?)\:\:', i)
if event.count(chk[0]) > 1:
flag = True
event = event.replace(i, '')
if restoreOut:
if flag:
dirname = getHomePath + '/wikibackup/rdfPages/'
fullpath = dirname + ntpath.basename(path)
print(fullpath)
f.close()
ensureDirectoryExists(dirname)
f = open(fullpath, 'w')
f.write(event)
f.close()
Usage:
grep 'some person' -r '/path/to/backup' -l | python scripts/Data_Fixes.py -stdin -ro -rdf | wikirestore -t ormk -stdinp -ui
output result: ACCV_2020
- Usage Statistics:
- The null field 'some person' has been used 223 times
Dates
- 153 improper start date entries.
- 6 improper end date entries.
- Reference page
- End date or dates in general are placed with strings.
- Decision to remove the field all together or fix them?
- For fixing manual intervention is needed.
- For removing a field all together a small code snippet can do the trick
Acceptance Rate Issue
- Statistics for the missing values for Submitted papers:
- With bash utility these can be found
- Number of pages that have the field "Submitted papers" : 1716
- Number of pages that have the field "Accepted papers" : 1965
- With a small python code snippet the following can be found:
- Number of pages that have the field "Submitted papers" but no field of "Accepted papers" : Approximately 63
- Number of pages that have the field "Accepted papers" but no field of "Submitted papers" : Approximately 302