Difference between revisions of "Workdocumentation 2021-03-30"

From OPENRESEARCH mk copy Wiki
Jump to navigation Jump to search
 
(12 intermediate revisions by the same user not shown)
Line 2: Line 2:
  
 
== Broken redirects ==
 
== Broken redirects ==
 +
* {{Ticket |106| Red links, Natural Language Ids, ambigous, wrong and N/A values for country/region/city (and other fields)}}
 
*The list of all broken redirects on openresearch can be found here: https://www.openresearch.org/mediawiki/index.php?title=Special:BrokenRedirects&limit=500&offset=0
 
*The list of all broken redirects on openresearch can be found here: https://www.openresearch.org/mediawiki/index.php?title=Special:BrokenRedirects&limit=500&offset=0
  
 
== Broken file links ==
 
== Broken file links ==
 +
* {{Ticket|106| red links, Natural Language Ids, ambigous, wrong and N/A values for country/region/city (and other fields) }}
 
* List of all broken links which contain a file path but the file does not exist can be found here: https://www.openresearch.org/wiki/Category:Pages_with_broken_file_links
 
* List of all broken links which contain a file path but the file does not exist can be found here: https://www.openresearch.org/wiki/Category:Pages_with_broken_file_links
  
Line 10: Line 12:
 
* The property Property:Has improper value for stores all invalid values.
 
* The property Property:Has improper value for stores all invalid values.
 
===Ordinals ===  
 
===Ordinals ===  
 +
* {{Ticket|119|Text Values in Ordinal Field}}
 
* For the Fix for ordinals the following approaches can be used :
 
* For the Fix for ordinals the following approaches can be used :
 
** This approach finds and edits all events with improper ordinals fixed:  
 
** This approach finds and edits all events with improper ordinals fixed:  
Line 24: Line 27:
  
 
=== Improper Null values for Has person ===
 
=== Improper Null values for Has person ===
 +
* {{Ticket|150 |Improper Null Value handling}}
 
* Has person was using "some person" as a null value. There was incorrect usage where in the free text events would use some person while the Wikison Format info would contain the person name.  
 
* Has person was using "some person" as a null value. There was incorrect usage where in the free text events would use some person while the Wikison Format info would contain the person name.  
 
# First way of doing this is to remove free text altogether. A code snippet was used coupled with bash utility grep.  
 
# First way of doing this is to remove free text altogether. A code snippet was used coupled with bash utility grep.  
Line 96: Line 100:
 
output result: [https://confident.dbis.rwth-aachen.de/ormk/index.php?title=ACCV_2020&type=revision&diff=19177&oldid=17957 ACCV_2020]
 
output result: [https://confident.dbis.rwth-aachen.de/ormk/index.php?title=ACCV_2020&type=revision&diff=19177&oldid=17957 ACCV_2020]
  
 +
 +
Main Function for usage:
 +
<source lang='python'>
 +
if __name__ == "__main__":
 +
    parser = argparse.ArgumentParser()
 +
    parser.add_argument('-p', dest="path", help='path to the file')
 +
    parser.add_argument('-stdin', dest="pipein", action='store_true', help='Use the input from STD IN using pipes')
 +
    parser.add_argument('-ro', dest="ro", action='store_true', help='Output to STDOUT for wikirestore by pipe')
 +
    parser.add_argument('-rmf', dest="rmf", action='store_true', help='flag to remove free text from wiki file')
 +
    parser.add_argument('-rdf', dest="rdf", action='store_true', help='flag to remove duplicate null value fields from wiki file')
 +
    args = parser.parse_args()
 +
    if args.pipein:
 +
        if args.rmf:
 +
            allx = sys.stdin.readlines()
 +
            for i in allx:
 +
                remove_FreeText(i.strip(), restoreOut=args.ro)
 +
        if args.rdf:
 +
            allx = sys.stdin.readlines()
 +
            for i in allx:
 +
                remove_DuplicateFields(i.strip(), restoreOut=args.ro)
 +
    elif args.path is not None:
 +
        if args.rmf:
 +
            remove_FreeText(args.path, restoreOut= args.ro)
 +
</source>
 
*Usage Statistics:
 
*Usage Statistics:
 
** The null field 'some person' has been used<b> 223 </b> times
 
** The null field 'some person' has been used<b> 223 </b> times
  
 
=== Dates  ===
 
=== Dates  ===
 +
* {{Ticket | 71 | Fix Property End_date and invalid date entries }}
 
* 153 improper start date entries.
 
* 153 improper start date entries.
 
* 6 improper end date entries.
 
* 6 improper end date entries.
Line 109: Line 138:
  
 
=== Acceptance Rate Issue ===
 
=== Acceptance Rate Issue ===
 
+
* {{Ticket|152|Acceptance Rate Not calculated}}
 
* Statistics for the missing values for Submitted papers:  
 
* Statistics for the missing values for Submitted papers:  
 
* With bash utility these can be found
 
* With bash utility these can be found
Line 115: Line 144:
 
** Number of pages that have the field "Accepted papers" : <b> 1965 </b>
 
** Number of pages that have the field "Accepted papers" : <b> 1965 </b>
 
* With a small python code snippet the following can be found:
 
* With a small python code snippet the following can be found:
 +
Code Snippet:
 +
<source lang='python'>
 +
import glob
 +
allx = glob.glob('path\to\backup\*.wiki')
 +
nosub=0
 +
noacc=0
 +
for i in allx:
 +
    with open(i,'r') as f:
 +
        event =f.read()
 +
        if event.lower().find('submitted papers'.lower()) == -1 and event.lower().find('Accepted papers'.lower()) != -1:
 +
            nosub+=1
 +
        elif event.lower().find('submitted papers'.lower()) != -1 and event.lower().find('Accepted papers'.lower()) == -1:
 +
            noacc+=1
 +
print(nosub)
 +
print(noacc)
 +
</source>
 
** Number of pages that have the field "Submitted papers" but no field of "Accepted papers" : <b>Approximately 63</b>
 
** Number of pages that have the field "Submitted papers" but no field of "Accepted papers" : <b>Approximately 63</b>
 
** Number of pages that have the field "Accepted papers" but no field of "Submitted papers" : <b>Approximately 302</b>
 
** Number of pages that have the field "Accepted papers" but no field of "Submitted papers" : <b>Approximately 302</b>

Latest revision as of 10:37, 2 April 2021

Red Links and Data Fixations

Broken redirects

Broken file links

Broken properties and errors

  • The property Property:Has improper value for stores all invalid values.

Ordinals

  • 119 Text Values in Ordinal Field
  • For the Fix for ordinals the following approaches can be used :
    • This approach finds and edits all events with improper ordinals fixed:
wikiedit -t wikiId -q "[[Has improper value for::Ordinal]]" --search "(\|Ordinal=[0-9]+)(?:st|nd|rd|th)\b" --replace "\1"
  • A code snippet can be used coupled with wikibackup and bash tools for specific editing of pages: Code Snippet
  • Pipeline usage:
grep Ordinal /path/to/backup -l -r | python ordinal_to_cardinal.py -stdin -d '../dictionary.yaml' -ro -f | wikirestore -t ormk -stdinp -ui

Improper Null values for Has person

  • 150 Improper Null Value handling
  • Has person was using "some person" as a null value. There was incorrect usage where in the free text events would use some person while the Wikison Format info would contain the person name.
  1. First way of doing this is to remove free text altogether. A code snippet was used coupled with bash utility grep.

Code Snippet:

def remove_FreeText(path, restoreOut= False):
    '''
            Remove free text from any given wiki page
            Args:
                path(str): path to wiki file
            Returns:
                page without any free text.
            '''
    f = open(path, 'r')
    eventlist = f.readlines()
    i = 0
    newwiki = ''
    while i < (len(eventlist) - 1):
        if "{{" in eventlist[i]:
            while True:
                newwiki += eventlist[i]
                if "}}" in eventlist[i]:
                    break
                i += 1
        i += 1
    if restoreOut:
        if newwiki == '':
            return
        dirname = getHomePath + '/wikibackup/rdfPages/'
        fullpath = dirname + ntpath.basename(path)
        print(fullpath)
        f.close()
        ensureDirectoryExists(dirname)
        f = open(fullpath, 'w')
        f.write(newwiki)
        f.close()

Usage:

 
grep 'some person' -r '/path/to/backup' -l | python scripts/Data_Fixes.py -stdin -ro -rmf | wikirestore -t ormk -stdinp -ui

Output result: ICFHR 2020

  1. Second way to do this is to only remove the 'some person' entry from the wiki free text. Python snippet is used with bash utility grep.

Code Snippet:

def remove_DuplicateFields(path, restoreOut= False):
    f = open(path, 'r')
    event = f.read()
    flag = False
    eventlist = event.split('\n')
    for i in eventlist:
        if 'some person' in i:
            chk = re.findall('\[\[(.*?)\:\:', i)
            if event.count(chk[0]) > 1:
                flag = True
                event = event.replace(i, '')
    if restoreOut:
        if flag:
            dirname = getHomePath + '/wikibackup/rdfPages/'
            fullpath = dirname + ntpath.basename(path)
            print(fullpath)
            f.close()
            ensureDirectoryExists(dirname)
            f = open(fullpath, 'w')
            f.write(event)
            f.close()

Usage:

grep 'some person' -r '/path/to/backup' -l | python scripts/Data_Fixes.py -stdin -ro -rdf | wikirestore -t ormk -stdinp -ui

output result: ACCV_2020


Main Function for usage:

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('-p', dest="path", help='path to the file')
    parser.add_argument('-stdin', dest="pipein", action='store_true', help='Use the input from STD IN using pipes')
    parser.add_argument('-ro', dest="ro", action='store_true', help='Output to STDOUT for wikirestore by pipe')
    parser.add_argument('-rmf', dest="rmf", action='store_true', help='flag to remove free text from wiki file')
    parser.add_argument('-rdf', dest="rdf", action='store_true', help='flag to remove duplicate null value fields from wiki file')
    args = parser.parse_args()
    if args.pipein:
        if args.rmf:
            allx = sys.stdin.readlines()
            for i in allx:
                remove_FreeText(i.strip(), restoreOut=args.ro)
        if args.rdf:
            allx = sys.stdin.readlines()
            for i in allx:
                remove_DuplicateFields(i.strip(), restoreOut=args.ro)
    elif args.path is not None:
        if args.rmf:
            remove_FreeText(args.path, restoreOut= args.ro)
  • Usage Statistics:
    • The null field 'some person' has been used 223 times

Dates

  • 71 Fix Property End_date and invalid date entries
  • 153 improper start date entries.
  • 6 improper end date entries.
  • Reference page
  • End date or dates in general are placed with strings.
  • Decision to remove the field all together or fix them?
    • For fixing manual intervention is needed.
    • For removing a field all together a small code snippet can do the trick

Acceptance Rate Issue

  • 152 Acceptance Rate Not calculated
  • Statistics for the missing values for Submitted papers:
  • With bash utility these can be found
    • Number of pages that have the field "Submitted papers" : 1716
    • Number of pages that have the field "Accepted papers" : 1965
  • With a small python code snippet the following can be found:

Code Snippet:

import glob 
allx = glob.glob('path\to\backup\*.wiki')
nosub=0
noacc=0
for i in allx:
    with open(i,'r') as f:
        event =f.read()
        if event.lower().find('submitted papers'.lower()) == -1 and event.lower().find('Accepted papers'.lower()) != -1:
            nosub+=1
        elif event.lower().find('submitted papers'.lower()) != -1 and event.lower().find('Accepted papers'.lower()) == -1:
            noacc+=1
print(nosub)
print(noacc)
    • Number of pages that have the field "Submitted papers" but no field of "Accepted papers" : Approximately 63
    • Number of pages that have the field "Accepted papers" but no field of "Submitted papers" : Approximately 302